Regular Expressions Tutorial

Regular expressions provide a powerful way of determining whether a string matches a given pattern and extracting pieces of the string as required. This can be useful for validation and some data parsing tasks. Be aware that the syntax for regular expressions has changed over time, and support for the features that are found in this tutorial may not exist in all implementations.

Literals And Metacharacters

A regular expression (we will use the term regex from this point onwards) is simply a string that describes a pattern that we want to match other strings against. A regex contains two types of characters - literals and metacharacters.

A literal is a character that we are looking to exist (or not exist) in strings we are matching. For example, the regex "aabb" will match any string that contains the character "a" twice followed immediately by the character "b" twice. For example, it would match "12aabbxyz", but not "abcabc".

A metacharacter is a character that has special meaning in the pattern. They allow us to define much more complex patterns. For example, the * metacharacter allows us to match 0 or more of a particular character. The regex "a*b" would match "5bcde" and "aaaaabq", but not "a5bcde". That is, it matches a b that may have the chracter "a" appearing before it zero or more times.

If we wanted to match a literal "*" character, we would escpae the metacharacter by putting a \ immediately before it. For example, "a\*b" would match the strings "a*b" and "zxa*bcd", but not "ab". To match a literal \, use \\.

Quantifiers

The * metacharacter is an example of a quantifier - it says how many times we require what comes before it to appear in a string in order for the pattern to match. There are three quantifiers supported by most forms of regex you'll find today:
  • a* matches if we have zero or more of the "a" character together in the string. As it can match even when there are none of the character it follows, alone the pattern "a*" will match every string. Usually this quantifier is used as part of a bigger pattern or when the matched sequence is being extracted.
  • a+ matches if we have one or more of the "a" character together in a string. For example, it would match "a", "aaaa" and "batman", but not a string that does not contains the character "a".
  • a? matches if we have zero or one of the character "a".
We can start using these quantifiers together to build more interesting patterns. For example, the regex "a*bc+" would match zero or more of the character "a", followed by a "b", followed by one or more "c". Here, the string "bc" would match, as would "abc" and "bcc" and "aaabccc"; the strings "ac" or "b" would not.

Quantifiers are (almost always) greedy by default. This means that an "a*" in a regex will cause the regex engine to attempt to match as many of the character "a" as it can, while still making the entire regex match (if it can). Why this matters will become clear soon.

Character Classes

Sometimes we want the pattern to match when we find one of a number of possible characters. For example, imagine we needed to write a regex to match a binary number. We know we want one or more of the characters 1 and 0, so the + quantifier seems like the right thing to use. Character classes allow us to state that both a 1 and a 0 are valid.

Character classes are usually defined using square brackets in which the characters that match are placed. For example, the regex "[01]" would match a 1 or a 0. A character class on its own (e.g. without a quantifier) will only ever match one character. To match a binary number, we could use the regex "[01]+".

It is sometimes useful to match any character that is not a certain character or in a certain set of characters. To do this, we can negate the character class by putting a ^ after the opening square bracket. For example, "[^a]" matches any character that is not an "a". The string "b" will match, but "a" will not. Remember that the string "baby" would match as there are characters in the string that are not "a"s.

There are a couple of other useful tricks with character classes that are found amongst many regex implementations. Ranges are a useful shorthand for matching any of a sequence of characters. They are written as the first character in the sequence, followed by a dash, followed by the last character in the sequence. For example, to match a (base 10) integer the regex "[0-9]+" could be used, which is equivalent to "[0123456789]+". Another example is a regex that matches any string containing an upper case letter or a number, which would be written "[A-Z0-9]". Note that if you want to put a dash (-) in a character class, it should be escaped.

In most regex implementations, "." is a metacharacter and denotes a character class that matches any character. For example, "a.c" would match "abc", "aac", and "aZc", but not "ac". Be careful to remember that you just use "." and not "[.]". If you want to match a literal "." then escape it (e.g. \.).

Regex implementations often also contain some "escape sequences" that act as predefined character classes. For example, \d will match any digit, making the regex "\d+" equivalent to "[0-9]+" (if we ignore possible internationalisation issues - which we probably shouldn't). Here is a list of common escape sequences:
  • \d matches a digit
  • \w matches an alphanumeric character (e.g. a number or something in the alphabet) and usually the underscore (_) too.
  • \s matches whitespace (space, tab, new line, formfeed, etc)
  • \D matches anything that is not a digit; \W and \S work the same way for \w and \s respectively.
  • \n matches a newline character
  • \r matches a formfeed character
  • \t matches a tab character

Groups And Capturing

Sometimes it is useful to capture parts of what we match. Other times, it is useful to be able to group things together and then apply quantifiers to the group. Brackets are used for both of these. Grouping can be understood by thinking about the way brackets are used in mathematics. Imagine that a regex was needed that matched a string that contained the word badger at least once, and maybe a few times in a row. The regex "(badger)+" would do this, matching "badger" and "www.badgerbadgerbadger.com". Some regex implementations can handle nested groups (certainly Perl Compatible Regular Expressions can), but others can't.

When a regex matches and it has groups, the contents of those groups is captured and can be accessed later (how this is done varies between programming languages and implementations). Imagine we had a string that contained three fields delimited by commas, e.g. "abc,123,xyz", and we wanted to extract the middle field. The regex ",(.+)," could be applied to the string. The first capture would contain the string "123".

Alternatives

Character classes are good for matching one of a number of characters. Alternatives enable matching one of a number of more complex expressions. Each alternative is separated by the pipe character. For example, to match a string that contains the word "cat" or the word "dog", the regex "cat|dog" could be used. To find out which was found, simply use capturing, e.g. "(cat|dog)". To match a literal pipe character, escape it with a backslash.

The Beginning, The End And The Boundary

It is often very useful to force a match to be anchored at the start or the end of the string. This is done using the ^ and $ metacharacters. These don't actually match any particular character; instead, they match if a particular condition is true. We call this a zero width assertion. ^ matches at the start of the string. $ matches at the end of the string.

For example, to extract the first 5 characters of a string, the regex "^(.....)" could be used. Matching against "stuffing" would capture "stuff". Likewise, "(.....)$" would capture the last 5 characters. Obviously, there are more efficient ways to do that than with regexes.

A more useful example is to check if a string contains a number, which could be done with the regex "^-?\d+(\.\d+)?$". The ^ matches the start of the string, so we know there is nothing before the number. "-?" means we can have a "-" at the start of the number, but we may not - this handles negative numbers. "\d+" is the integer part of the number - one or more digits. We may or may not have a fractional part. This would look like "\.\d+" the "\." matching a literal "." and "\d+" matching at least one digit. We may not have this at all, so we put it in brackets to group it and use a "?" on the end to say that group may or may not be there.

Some regex implementations also have \b, a zero width assertion that matches against a word boundary (e.g. the start or end of a word).

Greediness And Minimal Matching

Matches are greedy by default. That means that if you have a string "aaaa" and match it against "(a*)(a*)", the first capture will greedily extract all of the a's, and the second match will be empty. The + and * quantifiers are greedy, and forgetting this can make your regex fail or run slowly. Some implementations enable you to request minimal matching by adding a question mark after the quantifier, e.g. *?, +? or ??.

Other Tricks

Sometimes it can be useful to match a certain number of occurrences of a character or group. Some regex implementations provide an {n,m} quantifier construct for specifying the number of occurrences that are allowed. For example, to match 6 "a"s, I could write "a{6}". For more than 6 "a"s, "a{6,}" would work; "a{,6}" would match up to, but no more than 6. To match between 6 and 10 "a"s, "a{6,10}" can be used.

Some regex implementations support non-capturing groups. These are useful when you want to group, but not capture. The syntax often looks like "(?:pattern here)".

Some regex implementations support looking for existing captures later on in the pattern. The syntax for this is usually a backslash followed by the number of the capture, e.g. \1 for the first capture. For example, a regex to match repeated words would look like "\b(\w+)\s+\1\b". Here we match a word boundary, capture a word which we extract with \w+, then match some whitespace (e.g. the space between the words) and then have the word that we captured followed by a word boundary.

Substitution

Once we have matched a string or part of a string against a pattern, it is often useful to replace that string or substring with something else. This is called substitution. The substitution may be based on characters captured in the match or just a hardcoded replacement. The details of doing substitutions vary between implementations and languages, and will not be covered here.

The Limits Of Regular Expressions

Regular expressions are rooted in regular language theory. It is beyond the scope of this article to go into exactly what that means, though there is another one that does. Basically, regular expressions are not good for all parsing tasks - trying to use regexes to determine whether a string has properly nested brackets will not work out, for example. If you are really struggling to come up with a regex for something, it may be that regexes are the wrong tool for the job.

Where From Here

Various regex engines will have their own quirks and additional features which have not been covered here - this tutorial tried to keep things as general as possible. A set of extra guides have been written to cover using regexes in a range of languages, and you may wish to refer to these as well as your programming language's or regex library's documentation.

  User Comments


Anonymous
(Not rated)
(Report as abusive)
new String("123")????
new String(<literal>) is a newbie mistake... just use the literal directly.
Anonymous

(Report as abusive)
the comment above is incorrect
It is true that declaring variables unnecessarily is a frequent newbie mistake. However, in this case, it is necessary to have a String object to operate on.

For example, if you don't put the string in var text here, you can't use the replace method on it:

var text = "abababab";
var altered = text.replace(/b/, 'a');

Would you write that as
var altered = "abababab".replace(/b/, 'a');

I think not.

However, I do think that there is a typo in the telephone number example. The string is placed into a variable named "phone", so the second line should probably be:
var lastfour = phone.match(/\d{4}$/);

instead of
var lastfour = text.match(/\d{4}$/);

which (probably inadvertently) references a variable named "text" which is used in adjacent examples.
  View all   Rate and comment this article




 
Printer friendly version of the RegexTutorial page



Advertisement



Free Magazine

Free Magazines
eWeek The essential technology information source for builders of e-business.... subscribe now

Newsletter | Submit Content | About | Advertising | Awards | Contact Us | Link to us |
© 1996-2008 Community Networks Ltd All rights reserved. Reproduction in whole or in part, in any form or medium without express written permission is prohibited. Violators of this policy may be subject to legal action. Please read Terms Of Use and Privacy Statement for more information. Development by Synchron Data - .NET development.