Love this site? Hate it? Leave us some comments.
Theme Graphic
Theme Graphic

$jonathan.ramble(topic => 'work' & 'play');

My random-ish ramblings about my work at Programmer's Heaven and on the Parrot project, my crazy conference schedule and whatever else I...

Subscribe

Author

I wrote my first computer program when I was 8 years old, and I haven't bothered to stop since! I graduated from the University Of Cambridge, England, in June 2006 having read Computer Science at Emmanuel College. Currently I'm doing consultancy work, a lot of it for this very site. I'm into a whole range of music, including metal, trance, old skool, and industrial. I'm a Christian and have previously been involved in youth and children's work with what was then my local church.

Archive

Tags

Posted on Wednesday, November 21, 2007 at 6:01 AM

Splitting CSV with regex

I answered a question on the Perl forum today about splitting CSV. CSV is a comma separated format; for example:
blah,blah,blah
You can put values in quotes:
blah,"blah blah",blah
And those quotes make commas within them meaningless too:
blah,"blah,blah,blah",blah
If we do the naive thing and implement it using split on a comma:
my @fields = split(/,/, $string);
Then we will obviously get the Wrong Answer. The question was, is there a regex we can use with split that will do the Right Thing? And the answer was yes, though it took me a few minutes to come up with it. The thing is that we don't want to match anything more than the commas we are splitting on, but we do need to do some analysis on the string that is up ahead (or behind us) to detect if the comma we are seeing is in quotes.

We can use the (?!...) construct to achieve what we want. This is called negative lookahead; it says "if you can match the pattern here then the regex fails to match". It is a zero-width assertion, just like the start and end of string anchors. My first thought was that we could do something like:
my @fields = split /,(?![^",]+")/, $string;
So we split on a comma, unless the pattern in the negative lookahead matches. And that pattern says, if we see a load of characters after this comma that don't include a comma or a quote, followed by a quote, then we fail to match. That would imply we have seen a comma inside some quotes. This nearly works, but it doesn't handle the case where we see another comma in the quotes (so it would work for "blah,blah" but not "blah,blah,blah"). This isn't hard to solve - we just need say that it is OK to match a comma if there isn't a quote either side of it (so it's inside the quoted string, but not between two quoted strings). That gives us:
my @fields = split /,(?!(?:[^",]|[^"],[^"])+")/, $string;
You can probably transplant the regex (the bit between the slashes) into the regex engine of lots of other languages, though I only tested it in Perl. Hope this is useful, anyway.
Bookmark: Submit To Digg Submit To reddit Submit To del.icio.us Bookmark With StumbleUpon Bookmark With FaceBook Bookmark With Google Bookmarks   Share: Share By Email By Email

6 comments on "Splitting CSV with regex"
Posted by Guillaume Roderick on Thursday, July 31, 2008 at 1:42 AM
Image Of Author
excellent
This also works nicely in Ruby:

string.split(/,(?!(?:[^",]|[^"],[^"])+")/)
Posted by Mark on Saturday, August 09, 2008 at 2:19 AM
Image Of Author
Perfect!
It worked great with PHP. Best CSV regular expression I've seen. Thank you so much!

$fields = preg_split("/,(?!(?:[^\\\",]|[^\\\"],[^\\\"])+\\\")/", $string);
Posted by Guillaume Roderick on Thursday, August 14, 2008 at 6:28 AM
Image Of Author
Actually, this doesn't work in all cases
This, however, is the daddy:

Ruby: string.split(/,(?=(?:[^\"]*\"[^\"]*\")*(?![^\"]*\"))/)

Perl: $fields = preg_split("/,(?=(?:[^\"]*\"[^\"]*\")*(?![^\"]*\"))/", $string)
Posted by noname on Friday, September 12, 2008 at 1:56 PM
Image Of Author
bugged
Doesn't work in the case when empty quoted field is followed by empty unquoted field (ex: a,"",,,,c ) This <a href="http://us.php.net/manual/en/function.split.php#82050">PHP CSV splitting function</a> on the other hand works...
Posted by Martin S. on Wednesday, October 29, 2008 at 8:17 AM
Image Of Author
Use the csv module for python
Seems not to work in python: But the 'csv' module can be used there: http://www.python.org/doc/2.5.2/lib/module-csv.html
Posted by Anon on Thursday, February 26, 2009 at 5:41 PM
Image Of Author
Didn't work
Yeh didn't work for me in all situations. Guillaume Roderick's one (above) does though.

Leave A Comment
Subject:


Comment:
   Bold Italic Underline          Code Link Image Horizontal Rule


Because you do not have or are not logged in to your Programmer's Heaven account, please enter your name.

Name:


To help prevent comment SPAM, please enter the magic code '525' in the box:




Posting Rules
Please follow these rules when posting comments on blog posts.
  • Do not post anything that is racist, hate speech or of a sexual or adult nature.
  • Do not post or link to anything that infringes copyrighted laws.
  • Posting about security or legal topics is fine so long as you are not glorifying or encouraging people to perform illegal activities.
  • Both the author of this blog and the Programmer's Heaven administrators may delete any inappropriate comments without notice at their own discretion.
 
Popular resources and forums for programmers on Programmersheaven.com
Assembly, Basic, C, C#, C++, Delphi, Java, JavaScript, Pascal, Perl, PHP, Python, Ruby, Visual Basic
© Copyright 2009 Programmersheaven.com - All rights reserved.
Reproduction in whole or in part, in any form or medium without express written permission is prohibited.
Violators of this policy may be subject to legal action. Please read our Terms Of Use and Privacy Statement for more information.
Publisher: Lars Hagelin. Read the latest words from the publisher here.
Be the first to sign up for Lars Hagelin’s In-depth Outsourcing Newsletter here.
bootstrapLabs Logo A bootstrapLabs project.