Regular Expressions In Perl

The Perl programming language is one of the few that considers regexes to be important enough to have their own operator. Its regexes also go way beyond the regular, with all kinds of useful features. This guide won't cover all of these features, but a few of the useful ones will be highlighted. To understand this article, you need to have read the regular expressions tutorial and have a reasonable understanding of basic Perl syntax.

Note: This article discusses Perl 5's pattern matching features. Perl 6 represents a major overhaul of the Perl language and therefore does things somewhat differently. An overview of the changes can be found here.

Matching A String Against A Pattern

The =~ operator matches the scalar to its left against the pattern to its right. Patterns are usually written between forward slashes, e.g. /pattern/. Here are some examples:-
# Match a pattern against a constant string.
"Hello world" =~ /(H\w+)/;
# Check if a pattern matches a scalar.
if ($maybeNumber =~ /^\d+$/) {
    # Matched.
}
# We can invert the logic of the match by doing !~.
if ($maybeNumber !~ /^\d+$/) {
    # Didn't match.
}
# To match against $_, just write
/pattern/;
The problem with putting a pattern between /'s is that you might have a / in your pattern. If this happens, you can either escape it, e.g. \/, or use an alternative pattern delimiter. If you put an "m" before your pattern, e.g. m/pattern/, then you can replace the /'s with something else.
# So these are all the same.
$x =~ /pattern/;
$x =~ m/pattern/;
$x =~ m|pattern|;


Extracting Matches

Captures, made using brackets, are stored in special numbered variables. The first capture is stored in $1, the second capture in $2, etc. Note that these exist only for the last pattern match executed, so it is important to save results before doing another match.
# Continuation of our earlier example...
"Hello world" =~ /(H\w+)/;
print $1; # prints Hello
# Extract name/value pair.
$test = "weapon=railgun";
$test =~ /^(\w+)=(.+)$/
print "My $1 is a $2."; # prints My weapon is a railgun.


Modifiers

It is possible to add a number of modifiers to a pattern that affect the way it is matched. Modifiers are single letters and you can use combinations of them. Commonly used modifiers include:
  • i for doing a case insensitive match.
  • x for making whitespace in the pattern not count as part of the pattern; you can spread the pattern over multiple lines and even add comments.
  • g for repetitive matching, e.g. remembering where the last match was and finding the next one when the regex is evaluated again. This means we can easily extract all matches with a loop.
  • s makes the . metacharacter match newline characters; it doesn't by default.
  • m makes ^ and $ match the start and end of a line rather than the start and end of a string.
  • o tells Perl to only compile patterns with variables interpolated into them once. This is occasionally useful for performance reasons. See the section on interpolating variables below.
Here are a few examples.
# Case insensitive match.
print 1 if "bUbblE" =~ /bubble/i; # prints 1
# Whitespace allowed and case insensitive
print 1 if "bUbblE" =~ /bub ble/ix; # prints 1
# Find all duplicated words - prints dupes and in. Note
# that \1 is a back-reference to the first capture - that
# is whatever was captured into $1 must also exist in the
# string where \1 is placed.
while ("some dupes dupes are in in here" =~ /\b(\w+)\s+\1\b/g) {
    print "$1\n";
}


Substitutions

The syntax for substitutions is similar, and takes the form s/find/replace/. Note that this time the "s" before the first / is not optional, unlike in the m// construct. In the replacement string, captures can be referenced as $1, $2, etc. There are also a number of modifiers that can be used with substitutions:
  • i, x, s and m work as for the match construct.
  • g means replace globally. You do not need to set up a loop; using the g modifier means that anything that matches the pattern we're finding will be replaced "at once".
  • e means that what is in the replacement part is Perl code and should be executed. Use with care!
Here are some examples.
# Without the g modifier, just one replacement.
$test = 'I saw a play being played.';
$test = s/play/monkey/;
print $test; # Prints I saw a monkey being played.
# With the g modifier, all replacements are made.
$test = 'I saw a play being played.';
$test = s/play/monkey/;
print $test; # Prints I saw a monkey being monkeyed.
# Re-writing a date from American to UK format.
$date = '09-06-2004';
$date =~ s/^(\d\d)-(\d\d)/$2-$1/;
print $date; # Prints 06-09-2004
# Code execution is possible. This reverses every word
# while retaining word order.
$sentence = 'something blah whatever';
$sentence =~ s/\b(\w+)\b/reverse($1)/eg;
print $sentence; # Prints gnihtemos halb revetahw


Interpolating Variables

You can interpolate a variable into a regular expression. For example:-
$search = 'dog';
print 1 if 'I own a dog.' =~ /\b$search\b/; # prints 1
However, it is important to be very careful when doing this, as the interpolation is done before the regular expression is compiled. For example:-
$search = '.+';
print 1 if 'I own a dog.' =~ /\b$search\b/; # prints 1
Here the substring ".+" clearly doesn't exist in the string we're matching against, so .+ must be been interpreted as part of the pattern, not a literal. Because of this, you must be very careful; if you are not then you may be enabling arbitary code execution, and if the script is, for example, a CGI script, leaving your code and thus server open to remote exploitation. Unless you really want to interpolate the variable as a pattern, put a \Q before it and a \E afterwards.
$search = '.+';
print 1 if 'I own a dog.' =~ /\b\Q$search\E\b/; # prints nothing
This escapes any meta-characters for you automatically, and therefore is safer.

Further Reading

See the Perl Regular Expressions manual page for further details on Perl 5 regular expressions. If you have O'Reilly's Programming Perl, then Chapter 4 is certainly worth reading too.

  User Comments


Anonymous
(Not rated)
(Report as abusive)
new String("123")????
new String(<literal>) is a newbie mistake... just use the literal directly.
Anonymous

(Report as abusive)
the comment above is incorrect
It is true that declaring variables unnecessarily is a frequent newbie mistake. However, in this case, it is necessary to have a String object to operate on.

For example, if you don't put the string in var text here, you can't use the replace method on it:

var text = "abababab";
var altered = text.replace(/b/, 'a');

Would you write that as
var altered = "abababab".replace(/b/, 'a');

I think not.

However, I do think that there is a typo in the telephone number example. The string is placed into a variable named "phone", so the second line should probably be:
var lastfour = phone.match(/\d{4}$/);

instead of
var lastfour = text.match(/\d{4}$/);

which (probably inadvertently) references a variable named "text" which is used in adjacent examples.
  View all   Rate and comment this article




 
Printer friendly version of the RegexPerl page



Advertisement



Free Magazine

Free Magazines
eWeek The essential technology information source for builders of e-business.... subscribe now

Newsletter | Submit Content | About | Advertising | Awards | Contact Us | Link to us |
© 1996-2008 Community Networks Ltd All rights reserved. Reproduction in whole or in part, in any form or medium without express written permission is prohibited. Violators of this policy may be subject to legal action. Please read Terms Of Use and Privacy Statement for more information. Development by Synchron Data - .NET development.