Perl

Moderators: Jonathan
Number of threads: 1236
Number of posts: 3605

This Forum Only
Post New Thread
Single Post View       Linear View       Threaded View      f

Report
Is there any way to split this? Posted by stevele on 21 Nov 2007 at 12:01 AM
Hi...

I'm a fairly new(ish) Perl programmer, and don't do enough to stay good at it. I'm struggling with coming up with a way to do the following. I've got a file full of lines that look like the following. I’m trying to break the following file line into portions, based on the commas, using the Perl split command:

1007_s_at,Human Genome U133A 2.0 Array,Homo sapiens,16-Sep-05,Exemplar sequence,Affymetrix Proprietary Database,U48705mRNA," U48705 /FEATURE=mRNA /DEFINITION=HSU48705 Human receptor tyrosine kinase DDR gene, complete cds ",U48705,---,Hs.520004,May 2004 (NCBI 35),chr6:30964144-30975910 (+) // 95.63 // p21.33,"discoidin domain receptor family, member 1",DDR1,chr6p21.3,full length,ENSG00000137332,780,Q08345 /// Q6ZNR9 /// Q96T61 /// Q96T62 /// Q6NSK4,EC:2.7.1.112,600408,NP_001945.3 /// NP_054699.1 /// NP_054700.1,NM_001954 /// NM_013993 /// NM_013994,---,---,---,---,---,---,6468 // protein amino acid phosphorylation // inferred from electronic annotation /// 7155 // cell adhesion // traceable author statement /// 7169 // transmembrane receptor protein tyrosine kinase signaling pathway // inferred from electronic annotation,5887 // integral to plasma membrane // traceable author statement /// 16020 // membrane // inferred from electronic annotation,166 // nucleotide binding // inferred from electronic annotation /// 4672 // protein kinase activity // inferred from electronic annotation /// 4674 // protein serine/threonine kinase activity // inferred from electronic annotation /// 4713 // protein-tyrosine kinase activity // inferred from electronic annotation /// 4714 // transmembrane receptor protein tyrosine kinase activity // traceable author statement /// 4872 // receptor activity // inferred from electronic annotation /// 5524 // ATP binding // inferred from electronic annotation /// 16740 // transferase activity // inferred from electronic annotation,---,---,---,IPR000421 // Coagulation factor 5/8 type C domain (FA58C) /// IPR000719 // Protein kinase,AAA18019.1 // span:417-439 // numtm:1 /// NP_054699.1 // span:417-439 // numtm:1 /// NP_054700.1 // span:417-439 // numtm:1 /// NP_001945.3 // span:417-439 // numtm:1,---,This probe set was annotated using the Matching Probes based pipeline to a Entrez Gene identifier using 4 transcripts. // false // Matching Probes // A,"L20817(16),NM_001954(16),NM_013993(16),NM_013994(16)","L20817 // Homo sapiens tyrosine protein kinase (CAK) gene, complete cds. // gb // 16 // --- /// NM_013993 // Homo sapiens discoidin domain receptor family, member 1 (DDR1), transcript variant 1, mRNA. // refseq // 16 // --- /// NM_013994 // Homo sapiens discoidin domain receptor family, member 1 (DDR1), transcript variant 3, mRNA. // refseq // 16 // --- /// NM_001954 // Homo sapiens discoidin domain receptor family, member 1 (DDR1), transcript variant 2, mRNA. // refseq // 16 // --- /// ENST00000324771 // cdna:known chromosome:NCBI35:6:30631412:30975908:1 gene:ENSG00000137332 // ensembl // 16 // --- /// ENST00000361741 // cdna:ccds chromosome:NCBI35:6:30631412:30975908:1 gene:ENSG00000137332 CCDS4690.1 // ensembl // 16 // --- /// ENST00000259875 // cdna:ccds chromosome:NCBI35:6:30959840:30975910:1 gene:ENSG00000137332 CCDS4690.1 // ensembl // 16 // ---",ENSESTT00000096742 // ensembl // 3 // Cross Hyb Matching Probes /// ENSESTT00000096744 // ensembl // 3 // Cross Hyb Matching Probes /// ENSESTT00000096749 // ensembl // 3 // Cross Hyb Matching Probes /// ENSESTT00000096750 // ensembl // 3 // Cross Hyb Matching Probes /// ENSESTT00000096751 // ensembl // 3 // Cross Hyb Matching Probes /// S57212 // gb // 1 // Negative Strand Matching Probes /// GENSCAN00000004090 // ensembl // 1 // Negative Strand Matching Probes /// ENST00000357034 // ensembl // 1 // Negative Strand Matching Probes /// ENST00000340208 // ensembl // 1 // Negative Strand Matching Probes /// ENST00000357057 // ensembl // 1 // Negative Strand Matching Probes

This is an Excel csv file (i.e., comma separated values file). In theory, splitting on the comma would be fine (i.e., split /,/). However, in a few instances, there are segments that are surrounded by quotes, and the commas inside those quotes shouldn’t be split. For example, I’d like this part to remain together: " U48705 /FEATURE=mRNA /DEFINITION=HSU48705 Human receptor tyrosine kinase DDR gene, complete cds " as a single scalar, and not be split.

I need a regular expression that says something like “split up the line based on commas unless the commas come in between quote marks.” I know enough Perl to know that I don’t know how to do this! Do you have any ideas as to what the regular expression would look like for this? Any ideas would be most welcome!

Thanks!

-Steve

Report
Re: Is there any way to split this? Posted by Jonathan on 21 Nov 2007 at 4:24 AM
Hi,

Ouch. The obvious but evil thing that comes to mind is to encode the commas that are inside the quotes and then do the split. But it'd be nice to do better. How about this?

my @bits = split /,(?![^",]+")/, $string;


Lookahead ROCKS. The idea is that you can say "here's something that I don't want this pattern to match, but that must come after the thing that I do want to match, otherwise we fail". So we match a comma. Then we do a negative lookahead (meaning I want this to not be the case) and say that we want to fail if we see a load of things that are not commas and quotes followed by a quote.

Well, it works if there is only one quote nested inside the quotes, but for two or more it fails. So we need to consider that case...

my @bits = split /,(?!(?:[^",]|[^"],[^"])+")/, $string;


Which is probably starting to make your head spin and took me a good five minutes to come up with. But it appears to work.

my $string = <<TEXT;
blah,"blah, blah, blah",blah,"blah"
TEXT

my @bits = split /,(?!(?:[^",]|[^"],[^"])+")/, $string;
print join "\n", @bits;


Run it and you get the output:

blah
"blah, blah, blah"
blah
"blah"


Hope it works for you!

Jonathan
###
for(74,117,115,116){$::a.=chr};(($_.='qwertyui')&&
(tr/yuiqwert/her anot/))for($::b);for($::c){$_.=$^X;
/(p.{2}l)/;$_=$1}$::b=~/(..)$/;print("$::a$::b $::c hack$1.");



 

Recent Jobs

Official Programmer's Heaven Blogs
Web Hosting | Browser and Social Games | Gadgets

Popular resources on Programmersheaven.com
Assembly | Basic | C | C# | C++ | Delphi | Flash | Java | JavaScript | Pascal | Perl | PHP | Python | Ruby | Visual Basic
© Copyright 2011 Programmersheaven.com - All rights reserved.
Reproduction in whole or in part, in any form or medium without express written permission is prohibited.
Violators of this policy may be subject to legal action. Please read our Terms Of Use and Privacy Statement for more information.
Operated by CommunityHeaven, a BootstrapLabs company.