Regular Expressions In JAVA

The JAVA class library provides two classes to support regexes in JAVA, namely Pattern and Matcher from the java.util.regex package. Pattern simply stores a regex that we want to match strings against. We can get a Matcher object by matching a Pattern against a String, and then use the Matcher object to see if the pattern matched, extract captures, etc.

An Annoying JAVA Issue

Many regexes use sequences such as \d to match a digit. As in many programming languags, a regex in JAVA is represented as a string. This is a problem because "\d" is seen by the JAVA compiler as a JAVA escape sequence, and the compiler will complain that it doesn't understand the escape sequence \d. Therefore, it is important to add extra backslashes, e.g. "\\d". Be careful - in the case of \b this will go quietly un-noticed at compile time, then your regex won't work out as you expect at runtime.

Matching A String Against A Pattern

There are three simple steps involved in checking if the data held in a String object matches a particular regex. The first is to instantiate a Pattern object for the regex, which can be done by using the static "compile" method of the Pattern class.
Pattern isInteger = Pattern.compile("\\d+");
Next, this needs to be matched against a String to get a Matcher object.
String test = new String("123");
Matcher m = isInteger.matcher(test);
Finally, the Matcher's "matches" method can be called to check if the string is matched by the pattern. It returns true if the whole string matches and false otherwise.
if (m.matches() == true)
    System.out.println("This is an integer!");
Sometimes checking if the whole string matches the pattern isn't suitable. The Matcher provides a way to check if any part of the string matches. If all that is needed is to find a single match that works, the Matcher's "lookingAt" method can be used.
if (m.lookingAt() == true)
    System.out.println("This contains digits.");
If, on the other hand, every match is needed then the .find() method should be used. It returns true for each match that is found, and false when there are no more matches. Therefore, the number of matches could be counted as follows:
int numMatches = 0;
while (m.find() == true)
    numMatches++;
If a simple "does this entire string match this regex" test is needed (e.g. with no capturing, finding out how many matches there are or finding if a substring matches), the three steps can be carried out with a single method call. The Pattern class has a static method named "matches". The first parameter is a String containing the regex, and the second parameter is the String to match against the regex. It returns a boolean, which will be set to true if there is a match and false otherwise. Thus we could simply have written:-
String test = new String("123");
if (Pattern.matches("^\\d+$", test) == true)
    System.out.println("This is an integer!");
The rest of this tutorial will not refer to this shortcut as to do more advanced things the Matcher object's other methods are needed.

Extracting Matches

After a call (on the Matcher object) to the "matches", "lookingAt" or "find" methods has taken place, and provided true was returned, the "group" method can be used to get the substring that matched and anything captured using brackets. A call to "group" with no prameters or with the parameter 0 returns the substring that matched; if the "match" method was used that will be the entire string that is being matched against. The following example extracts numbers seperated by non-alphanumeric characters. It is a complete JAVA program; save this code as Test.java, then try compiling and running it.
import java.util.regex.*;
class Test {
	public static void main(String[] argv) {
		// Here's the string.
		String fib = new String("1, 1, 2, 3, 5, 8, 13...");
		
		// Compile the patten.
		Pattern p = Pattern.compile("\\b\\d+\\b");
		
		// Match it.
		Matcher m = p.matcher(fib);
		
		// Get all matches.
		while (m.find() == true)
			System.out.println(m.group());
	}
}
To get captured substrings, pass the number of the group/capture to the "group" method. The following example takes a string containing some real numbers and extracts their integral and fractional parts.
import java.util.regex.*;
class Test {
	public static void main(String[] argv) {
		// Here's the string.
		String reals = new String("1.5 27.3 2.0 9.8");
		
		// Compile the patten.
		Pattern p = Pattern.compile("\\b(\\d+)\\.(\\d+)\\b");
		
		// Match it.
		Matcher m = p.matcher(reals);
		
		// Get all matches.
		while (m.find() == true)
			System.out.println("The number " + m.group() +
			                   " has integer part " + m.group(1) +
			                   " and fractional part " + m.group(2));
	}
}


Modifiers

Modifiers change the way that a regex is matched against a string by changing the meaning of the regex itself. In JAVA, these modifiers exist as flags, which can be combined and passed as an optional second parameter to the static compile method of the Pattern class. The flags are static constants of the Pattern class.
  • Pattern.CASE_INSENSITIVE makes the regex case insensitive. For example, the regex "JAVA" would match the string "I love java!" with this flag. If unicode is being used, also using the Pattern.UNICODE_CASE flag is advisable.
  • Pattern.COMMENTS makes the regex engine ignore whitespace in your pattern and allow comments, which start with a # and end at the next linebreak (e.g. \n). This helps make regexes more readable - advisable with all that double backslashing going on! Remember to not escape a \n that ends the comment, however!
  • Pattern.DOTALL makes the . metacharacter match a newline character. By default, it does not.
  • Pattern.MULTILINE makes the ^ and $ metacharacters match the start and end of a line rather than the start and end of the entire string.
The following example identifies imaginary integers, or for those who didn't get that far with maths, an integer followed by an i or a j that may be upper or lower case.
Pattern p = Pattern.compile("\\b \\d+ [ij] \\b", Pattern.CASE_INSENSITIVE 
+ Pattern.COMMENTS);
Matcher m = p.matcher("5 + 3i");
if (m.find())
    System.out.println("Found an imaginary integer!");
In the above example, the substring that is matched will be "3i".

Substitutions

Substituting a replacement for matched substrings in JAVA is done by calling the "replaceFirst" and "replaceAll" methods of the Matcher object. They simply take the replacement String and returns a String that has the substitutions made on it; the original input String given to create the Matcher is not changed. The "replaceFirst" method only replaces the first match that is found; the "replaceAll" method replaces all matches. Captures from the match can be used in the replacement string. Use $1 to get the first capture, $2 to get the second capture, etc. The following example takes a list of real numbers and removes the fractional part.
import java.util.regex.*;
class Test {
	public static void main(String[] argv) {
		// Here's the string.
		String reals = new String("1.5 27.3 2.0 9.8");
		
		// Compile the patten.
		Pattern p = Pattern.compile("\\b(\\d+)\\.\\d+\\b");
		
		// Match it.
		Matcher m = p.matcher(reals);
		
		// Get substitutued string and display it.
		String sub = m.replaceAll("$1");
		System.out.println(sub);
	}
}
The output of this code is "1 27 2 9".

Interpolating Variables

Sometimes it is useful to use a string (often from user input) inside a regex. When doing this, it's important to remember that any characters in that string that are metacharacters to the regex engine will be seen as such. To use a string inside a regex as a literal, use the \Q...\E construct. For example:
String mysearch = "whatever";
Pattern p = Pattern.compile("\b\Q" + mysearch + "\E\b");
// And so on...
This works out fine until the user input contains \E, but this can easily be tested for if needed.

Further Reading

See the documentation for the java.util.regex package, as well as the JAVA regular expressions lesson.

  User Comments


Anonymous
(Not rated)
(Report as abusive)
new String("123")????
new String(<literal>) is a newbie mistake... just use the literal directly.
Anonymous

(Report as abusive)
the comment above is incorrect
It is true that declaring variables unnecessarily is a frequent newbie mistake. However, in this case, it is necessary to have a String object to operate on.

For example, if you don't put the string in var text here, you can't use the replace method on it:

var text = "abababab";
var altered = text.replace(/b/, 'a');

Would you write that as
var altered = "abababab".replace(/b/, 'a');

I think not.

However, I do think that there is a typo in the telephone number example. The string is placed into a variable named "phone", so the second line should probably be:
var lastfour = phone.match(/\d{4}$/);

instead of
var lastfour = text.match(/\d{4}$/);

which (probably inadvertently) references a variable named "text" which is used in adjacent examples.
  View all   Rate and comment this article




 
Printer friendly version of the RegexJAVA page




Newsletter | Submit Content | About | Advertising | Awards | Contact Us | Link to us |
© 1996-2008 Community Networks Ltd All rights reserved. Reproduction in whole or in part, in any form or medium without express written permission is prohibited. Violators of this policy may be subject to legal action. Please read Terms Of Use and Privacy Statement for more information. Development by Synchron Data - .NET development.