Current area: HOME -> Blogs -> Jonathan's Blog -> Read Post

Splitting CSV with regex

Posted on Wednesday, November 21, 2007 at 6:01 AM
I answered a question on the Perl forum today about splitting CSV. CSV is a comma separated format; for example:
blah,blah,blah
You can put values in quotes:
blah,"blah blah",blah
And those quotes make commas within them meaningless too:
blah,"blah,blah,blah",blah
If we do the naive thing and implement it using split on a comma:
my @fields = split(/,/, $string);
Then we will obviously get the Wrong Answer. The question was, is there a regex we can use with split that will do the Right Thing? And the answer was yes, though it took me a few minutes to come up with it. The thing is that we don't want to match anything more than the commas we are splitting on, but we do need to do some analysis on the string that is up ahead (or behind us) to detect if the comma we are seeing is in quotes.

We can use the (?!...) construct to achieve what we want. This is called negative lookahead; it says "if you can match the pattern here then the regex fails to match". It is a zero-width assertion, just like the start and end of string anchors. My first thought was that we could do something like:
my @fields = split /,(?![^",]+")/, $string;
So we split on a comma, unless the pattern in the negative lookahead matches. And that pattern says, if we see a load of characters after this comma that don't include a comma or a quote, followed by a quote, then we fail to match. That would imply we have seen a comma inside some quotes. This nearly works, but it doesn't handle the case where we see another comma in the quotes (so it would work for "blah,blah" but not "blah,blah,blah"). This isn't hard to solve - we just need say that it is OK to match a comma if there isn't a quote either side of it (so it's inside the quoted string, but not between two quoted strings). That gives us:
my @fields = split /,(?!(?:[^",]|[^"],[^"])+")/, $string;
You can probably transplant the regex (the bit between the slashes) into the regex engine of lots of other languages, though I only tested it in Perl. Hope this is useful, anyway.

Comments
No comments posted yet.


Sponsored links

Build IT Knowledge with Current & Trusted Content
Helps Employees Develop & Hone New Technical Programming Skills. Sign Up & Get Full Access.
Check Out IT Certification Preparation Materials
Sign Up With SkillSoft & Get Access to Training Materials for Over 50 Professional Certifications.
SFTP components for .NET
Add complete SSH and SFTP support to your .NET framework application
Virtual File System SDK
Create your own file systems in Windows and .NET applications
PureCM Software Configuration Management
Version control and integrated issue tracking - powerful and easy to use. Get your FREE trial now!


Newsletter | Submit Content | About | Advertising | Awards | Contact Us | Link to us |
© 1996-2008 Community Networks Ltd All rights reserved. Reproduction in whole or in part, in any form or medium without express written permission is prohibited. Violators of this policy may be subject to legal action. Please read Terms Of Use and Privacy Statement for more information. Development by Synchron Data - .NET development.