<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Posts Tagged With 'CSV' RSS Feed</title>
    <link>http://www.programmersheaven.com/blog/tags/CSV</link>
    <description>Contains the latest posts from the Programmer's Heaven blogs that are tagged with the label 'CSV'</description>
    <lastBuildDate>Thu, 24 Jul 2008 06:37:53 -0700</lastBuildDate>
    <generator>Argotic Syndication Framework 2007.3.0.1, http://www.codeplex.com/Argotic</generator>
    <docs>http://www.rssboard.org/rss-specification</docs>
    <item>
      <title>Splitting CSV with regex</title>
      <link>http://www.programmersheaven.com/user/Jonathan/blog/73-Splitting-CSV-with-regex/</link>
      <description>I answered a question on the Perl forum today about splitting CSV. CSV is a comma separated format; for example:&lt;br /&gt;
&lt;pre class="sourcecode"&gt;blah,blah,blah&lt;/pre&gt;&lt;br /&gt;
You can put values in quotes:&lt;br /&gt;
&lt;pre class="sourcecode"&gt;blah,"blah blah",blah&lt;/pre&gt;&lt;br /&gt;
And those quotes make commas within them meaningless too:&lt;br /&gt;
&lt;pre class="sourcecode"&gt;blah,"blah,blah,blah",blah&lt;/pre&gt;&lt;br /&gt;
If we do the naive thing and implement it using split on a comma:&lt;br /&gt;
&lt;pre class="sourcecode"&gt;my @fields = split(/,/, $string);&lt;/pre&gt;&lt;br /&gt;
Then we will obviously get the Wrong Answer. The question was, is there a regex we can use with split that will do the Right Thing? And the answer was yes, though it took me a few minutes to come up with it. The thing is that we don't want to match anything more than the commas we are splitting on, but we do need to do some analysis on the string that is up ahead (or behind us) to detect if the comma we are seeing is in quotes.&lt;br /&gt;
&lt;br /&gt;
We can use the (?!...) construct to achieve what we want. This is called negative lookahead; it says "if you can match the pattern here then the regex fails to match". It is a zero-width assertion, just like the start and end of string anchors. My first thought was that we could do something like:&lt;br /&gt;
&lt;pre class="sourcecode"&gt;my @fields = split /,(?![^",]+")/, $string;&lt;/pre&gt;&lt;br /&gt;
So we split on a comma, unless the pattern in the negative lookahead matches. And that pattern says, if we see a load of characters after this comma that don't include a comma or a quote, followed by a quote, then we fail to match. That would imply we have seen a comma inside some quotes. This nearly works, but it doesn't handle the case where we see another comma in the quotes (so it would work for "blah,blah" but not "blah,blah,blah"). This isn't hard to solve - we just need say that it is OK to match a comma if there isn't a quote either side of it (so it's inside the quoted string, but not between two quoted strings). That gives us:&lt;br /&gt;
&lt;pre class="sourcecode"&gt;my @fields = split /,(?!(?:[^",]|[^"],[^"])+")/, $string;&lt;/pre&gt;&lt;br /&gt;
You can probably transplant the regex (the bit between the slashes) into the regex engine of lots of other languages, though I only tested it in Perl. Hope this is useful, anyway.</description>
      <guid isPermaLink="true">http://www.programmersheaven.com/user/Jonathan/blog/73-Splitting-CSV-with-regex/</guid>
      <pubDate>Wed, 21 Nov 2007 06:01:18 -0700</pubDate>
      <dc:creator>Jonathan</dc:creator>
    </item>
  </channel>
</rss>