Hi...
I'm a fairly new(ish) Perl programmer, and don't do enough to stay good at it. I'm struggling with coming up with a way to do the following. I've got a file full of lines that look like the following. I’m trying to break the following file line into portions, based on the commas, using the Perl split command:
1007_s_at,Human Genome U133A 2.0 Array,Homo sapiens,16-Sep-05,Exemplar sequence,Affymetrix Proprietary Database,U48705mRNA," U48705 /FEATURE=mRNA /DEFINITION=HSU48705 Human receptor tyrosine kinase DDR gene, complete cds ",U48705,---,Hs.520004,May 2004 (NCBI 35),chr6:30964144-30975910 (+) // 95.63 // p21.33,"discoidin domain receptor family, member 1",DDR1,chr6p21.3,full length,ENSG00000137332,780,Q08345 /// Q6ZNR9 /// Q96T61 /// Q96T62 /// Q6NSK4,EC:2.7.1.112,600408,NP_001945.3 /// NP_054699.1 /// NP_054700.1,NM_001954 /// NM_013993 /// NM_013994,---,---,---,---,---,---,6468 // protein amino acid phosphorylation // inferred from electronic annotation /// 7155 // cell adhesion // traceable author statement /// 7169 // transmembrane receptor protein tyrosine kinase signaling pathway // inferred from electronic annotation,5887 // integral to plasma membrane // traceable author statement /// 16020 // membrane // inferred from electronic annotation,166 // nucleotide binding // inferred from electronic annotation /// 4672 // protein kinase activity // inferred from electronic annotation /// 4674 // protein serine/threonine kinase activity // inferred from electronic annotation /// 4713 // protein-tyrosine kinase activity // inferred from electronic annotation /// 4714 // transmembrane receptor protein tyrosine kinase activity // traceable author statement /// 4872 // receptor activity // inferred from electronic annotation /// 5524 // ATP binding // inferred from electronic annotation /// 16740 // transferase activity // inferred from electronic annotation,---,---,---,IPR000421 // Coagulation factor 5/8 type C domain (FA58C) /// IPR000719 // Protein kinase,AAA18019.1 // span:417-439 // numtm:1 /// NP_054699.1 // span:417-439 // numtm:1 /// NP_054700.1 // span:417-439 // numtm:1 /// NP_001945.3 // span:417-439 // numtm:1,---,This probe set was annotated using the Matching Probes based pipeline to a Entrez Gene identifier using 4 transcripts. // false // Matching Probes // A,"L20817(16),NM_001954(16),NM_013993(16),NM_013994(16)","L20817 // Homo sapiens tyrosine protein kinase (CAK) gene, complete cds. // gb // 16 // --- /// NM_013993 // Homo sapiens discoidin domain receptor family, member 1 (DDR1), transcript variant 1, mRNA. // refseq // 16 // --- /// NM_013994 // Homo sapiens discoidin domain receptor family, member 1 (DDR1), transcript variant 3, mRNA. // refseq // 16 // --- /// NM_001954 // Homo sapiens discoidin domain receptor family, member 1 (DDR1), transcript variant 2, mRNA. // refseq // 16 // --- /// ENST00000324771 // cdna:known chromosome:NCBI35:6:30631412:30975908:1 gene:ENSG00000137332 // ensembl // 16 // --- /// ENST00000361741 // cdna:ccds chromosome:NCBI35:6:30631412:30975908:1 gene:ENSG00000137332 CCDS4690.1 // ensembl // 16 // --- /// ENST00000259875 // cdna:ccds chromosome:NCBI35:6:30959840:30975910:1 gene:ENSG00000137332 CCDS4690.1 // ensembl // 16 // ---",ENSESTT00000096742 // ensembl // 3 // Cross Hyb Matching Probes /// ENSESTT00000096744 // ensembl // 3 // Cross Hyb Matching Probes /// ENSESTT00000096749 // ensembl // 3 // Cross Hyb Matching Probes /// ENSESTT00000096750 // ensembl // 3 // Cross Hyb Matching Probes /// ENSESTT00000096751 // ensembl // 3 // Cross Hyb Matching Probes /// S57212 // gb // 1 // Negative Strand Matching Probes /// GENSCAN00000004090 // ensembl // 1 // Negative Strand Matching Probes /// ENST00000357034 // ensembl // 1 // Negative Strand Matching Probes /// ENST00000340208 // ensembl // 1 // Negative Strand Matching Probes /// ENST00000357057 // ensembl // 1 // Negative Strand Matching Probes
This is an Excel csv file (i.e., comma separated values file). In theory, splitting on the comma would be fine (i.e., split /,/). However, in a few instances, there are segments that are surrounded by quotes, and the commas inside those quotes shouldn’t be split. For example, I’d like this part to remain together: " U48705 /FEATURE=mRNA /DEFINITION=HSU48705 Human receptor tyrosine kinase DDR gene, complete cds " as a single scalar, and not be split.
I need a regular expression that says something like “split up the line based on commas unless the commas come in between quote marks.” I know enough Perl to know that I don’t know how to do this! Do you have any ideas as to what the regular expression would look like for this? Any ideas would be most welcome!
Thanks!
-Steve