Regular Expressions In .NET

The Microsoft .NET framework's Foundation Class Library contains several classes relating to regexes in the System.Text.RegularExpressions namespace. These can be used in any .NET supported language. The examples in this tutorial are in C#, however the focus is on how to use the various classes that provide support for regexes in .NET. Armed with that understanding it should be simple to translate the code examples to other .NET languages.

An Issue For Several .NET Languages

Many regexes use sequences such as \d to match a digit. Regexes in .NET are represented as strings. This is a problem in some .NET languages (C# included) because "\d" is seen by their compilers as an escape sequence. The workaround for this is to add extra backslashes, e.g. "\\d". Be careful not to forget this - there are cases where this could go quietly un-noticed at compile time, then your regex won't work as you expect at runtime.

Also note that for C# in particular, you can use literal strings instead, e.g. by putting an @ before the opening quote. Thus "\\d" and @"\d" are equivalent. To help with readability, this tutorial will use the @"..." form.

Matching A String Against A Pattern

There are three steps involved in checking if a string matches a particular regex. The first is to create an instance of the Regex class. An instance of Regex stores a single regular expression, which is passed as a string in its constructor.
// Create a regex object.
Regex r = new Regex(@"\d+");
The second step is to get a Match object. To instantiate this the "Match" method of a Regex object is used, passing the string to match the regex against as a parameter.
// Get a match object.
Match m = r.Match("aa123bb");
The third and final step is to check the Match object's "Success" property, which is of type boolean and true if the match was successful.
// Check if a match was found.
if (m.Success)
	System.Console.WriteLine("Found a match!");
This is quite a few method calls if all that is needed is to test if a string matches a particular regex. Thankfully, there are some shortcuts for this. If a Regex object has already been instantiated, the "IsMatch" method can be called on it. It takes the string to try and match, as the "Match" method does, but instead of returning a Match object it simply returns a boolean stating whether or not a match was found.
// Create a regex object.
Regex r = new Regex(@"\d+");
// Check if there's a match.
if (r.IsMatch("aa123bb")
	System.Console.WriteLine("Found a match!");
There is an even shorter way than this in the case than no Regex object has been instantiated. A static method exists, also called "IsMatch", which takes two string parameters: the first is the string to match the regex against and the second is the regex.
// Check if there's a match.
if (Regex.IsMatch("aa123bb", @"\d+"))
	System.Console.WriteLine("Found a match!");


Extracting Matches

Having got a Match object by calling the "Match" method on a Regex object, it is possible to access the substring that matched through the "Value" property. The following example prints "Found a match: 123".
Regex r = new Regex(@"\d+");
Match m = r.Match("aa123vv");
if (m.Success)
	System.Console.WriteLine("Found a match: " + m.Value);
The "nextMatch" method of a Match object returns another Match object representing the next substring that matches, if one should exist. The following example sets up a loop that retreives all matches.
// Create a regex object.
Regex r = new Regex(@"\d+");
// Get initial Match object.
Match m = r.Match("aa123bb456cc789");
// Loop through all matches.
while (m.Success) {
	// Display this match.
	System.Console.WriteLine(m.Value);
	
	// Go to the next one.
	m = m.NextMatch();
}
// The output of this code will be:
// 123
// 456
// 789
It is also possible to obtain a collection of all matches; the collection is named MatchCollection. This is obtained by calling the "Matches" method on the Regex object rather than the "Match" method. The following code has the same end result as the previous example.
// Create a regex object.
Regex r = new Regex(@"\d+");
// Get all the matches.
MatchCollection mc = r.Matches("aa123bb456cc789");
// Iterate over the collection and display.
foreach (Match m in mc)
	System.Console.WriteLine(m.Value);


Extracting Groups And Captures

.NET regexes make a marked distinction between groups and captures. A group is something in the regex that could capture data, and a capture is a substring that was captured by that group. In some cases, these are the same thing. For example, in the regex "(c.t)", there is one group and it will only ever capture one substring per match. However, if a quantifier is placed after a group, e.g. "(c.t)+", things become somewhat more interesting as there may be many substrings captured by that group.

In .NET, each Match object holds a GroupCollection, which is a collection of Group objects. This is accessible through the "Groups" property of a Match object. Each Group object holds a CaptureCollection, which is a collection of Capture objects. This is accessible through the "Captures" property of a Group object. A Capture object contains the substring that was captured. Note that the ordering of groups is innermost, leftmost first (or rightmost if you specify the RightToLeft modifier). So for the regex "((a)b)(c), the green group will be the first, the red group the second and the blue group the third.

The following example shows how GroupCollection and CaptureCollection can be used.
// Create a regex object.
Regex r = new Regex(@"(ab)+(bc)+");
// Get all the matches.
string text = "ababbc abbc abbcbcbc";
MatchCollection mc = r.Matches(text);
// Iterate over the matches.
foreach (Match m in mc) {
	// Get the GroupCollection for this match.
	GroupCollection gc = m.Groups;
	
	// Display a message.
	System.Console.WriteLine("The match " + m.Value +
	                         " has " + gc.Count + " groups.");
	
	// Iterate over the groups.
	foreach (Group g in gc) {
		// Display the group value.
		System.Console.WriteLine("--- " + g.Value);
		
		// Get the captures.
		CaptureCollection cc = g.Captures;
		
		// Display each one.
		foreach (Capture c in cc)
			System.Console.WriteLine("------- " + c.Value);
	}
}
The output of this program is show below. Note that the first group alwasys contains the entire substring matched, and that the "Value" property of the Group always corresponds to the final capture.
The match abadbc has 3 groups.
--- abadbc
------- abadbc
--- ad
------- ab
------- ad
--- bc
------- bc
The match abbc has 3 groups.
--- abbc
------- abbc
--- ab
------- ab
--- bc
------- bc
The match afaghcicjc has 3 groups.
--- afaghcicjc
-------afaghcicjc
--- ag
------- af
------- ag
--- jc
------- hc
------- ic
------- jc
As this makes access to captures rather long-winded in the case where no groups are quantified, the GroupCollection can simply be used, as the "Value" property of each group will contain the matches required. The following example simply extracts all of the integers from a string.
// Create a regex object.
Regex r = new Regex(@"\b(\d+)\b");
// Get all the matches.
string text = "123 456 789";
MatchCollection mc = r.Matches(text);
// Iterate over the matches.
foreach (Match m in mc) {
	// Get the GroupCollection for this match.
	GroupCollection gc = m.Groups;
	
	// Display what was captured by the 1st group.
	System.Console.WriteLine(gc[1].Value);
}
The output of this program will be:
123
456
789


Modifiers

Modifiers change the way that a regex is matched against a string by changing the meaning of the regex itself. In .NET, these modifiers can be passed in a second optional parameter when we instantiate a Regex class. This parameter is created by ORing together one or more of the values in the RegexOptions enumeration. Some of the most commonly used options are:-
  • IgnoreCase, which makes the match case insensitive. Which this modifier the regex "net" would match the string ".NET"; without this modifier it would not.
  • IgnorePatternWhitespace, which makes the regex engine ignore whitespace in your pattern and allows you to add comments. Comments start with a # and end at the next linebreak (e.g. \n - note that \n will not do this in an @"..." literal string, however). This can help make regexes more readable.
  • Multiline, which makes the ^ and $ metacharacters match the start and end of a line rather than the start and end of the entire string.
  • Singleline, which makes the . metacharacter match a newline character. By default, it does not.
  • None
The following example demonstrates the use of the IgnoreCase and IgnorePatternWhitespace options. With these options, the program prints "It matched!"; if they are removed, it doesn't print anything.
// Create a regex object.
Regex r = new Regex(@"\. net", 
                    RegexOptions.IgnoreCase | RegexOptions.IgnorePatternWhitespace);
// Does it match?
if (r.IsMatch("I love .NET!"))
	System.Console.WriteLine("It matched!");


Substitutions

.NET provides two ways of replacing data that was matched by a regex, both involving the Regex class and objects instantiated from it. The first is to simply provide a text string to replace what was matched.
// Replace all numbers with 42.
string original = "12 47 83 100";
Regex r = new Regex(@"\d+");
string replaced = r.Replace(original, "42");
System.Console.WriteLine(replaced);  // Prints 42 42 42 42
A third parameter can be added specifying the number of replacements to make.
// Replace first two numbers with 42.
string original = "12 47 83 100";
Regex r = new Regex(@"\d+");
string replaced = r.Replace(original, "42", 2);
System.Console.WriteLine(replaced); // Prints 42 42 83 100
The second involves using a MatchEvaluator delegate. The method you supply through the delegate is called each time a match is found, passed the Match object and expected to return a strng which will be substituted in place of the substring that was matched. The passed Match object can be used to extract the substring matched and any groups and captures that may be needed to build the replacement. The following example program finds all integers in a string and returns the same string with the numbers incremented by 1.
using System.Text.RegularExpressions;
class RegexExample
{
	// This static method is our match evaluator for
	// doing replacements.
	private static string increment (Match m)
	{
		// Take what was matched and convert it to an integer.
		int i = System.Int32.Parse(m.Value);
		
		// Increment.
		i++;
		
		// Return it as a string.
		return i.ToString();
	}
	
	public static void Main(string[] args)
	{
		// Here's our string.
		string input = "4 10 29 85 100";
		
		// Match all integers in it.
		Regex r = new Regex(@"\d+");
		
		// Do the replacement.
		string output = r.Replace(input, 
		                          new MatchEvaluator(RegexExample.increment));
		
		// Display.
		System.Console.WriteLine(output);
	}
}
The output of the program is "5 11 30 86 101".

Interpolating Variables

Sometimes it is useful to insert (interpolate) a user-specified string into a pattern. This can, however, be dangerous; if the user-specified string contains regex metacharacters then the user can manipulate what the regex matches. To Regex class provides a static method, "Escape", which takes a string and returns a copy of it with the metacharacters escaped.
// Unescaped test
string unescaped = "What? 1+1=3! No way.";
// Escape it.
string escaped = Regex.Escape(unescaped);
// Display; prints "What\?\ 1\+1=3!\ No\ way\."
System.Console.WriteLine(escaped);


Further Reading

See the documentation for the various classes, collections and enumerations in the System.Text.RegularExpressions namespace in the MSDN.

  User Comments


Anonymous
(Not rated)
(Report as abusive)
new String("123")????
new String(<literal>) is a newbie mistake... just use the literal directly.
Anonymous

(Report as abusive)
the comment above is incorrect
It is true that declaring variables unnecessarily is a frequent newbie mistake. However, in this case, it is necessary to have a String object to operate on.

For example, if you don't put the string in var text here, you can't use the replace method on it:

var text = "abababab";
var altered = text.replace(/b/, 'a');

Would you write that as
var altered = "abababab".replace(/b/, 'a');

I think not.

However, I do think that there is a typo in the telephone number example. The string is placed into a variable named "phone", so the second line should probably be:
var lastfour = phone.match(/\d{4}$/);

instead of
var lastfour = text.match(/\d{4}$/);

which (probably inadvertently) references a variable named "text" which is used in adjacent examples.
  View all   Rate and comment this article




 
Printer friendly version of the RegexNET page




Newsletter | Submit Content | About | Advertising | Awards | Contact Us | Link to us |
© 1996-2008 Community Networks Ltd All rights reserved. Reproduction in whole or in part, in any form or medium without express written permission is prohibited. Violators of this policy may be subject to legal action. Please read Terms Of Use and Privacy Statement for more information. Development by Synchron Data - .NET development.