|
Perl 6 FAQ - Regexes and Grammars
This FAQ is part of the Programmer's Heaven Perl 6 FAQ. It answers questions about the new Perl 6 regex syntax, how to use Perl 5 regex syntax in Perl 6 and what grammars are. This is not a "how to" for regexes in general; if you are totally new to them please consider reading around the area first and coming back to this afterwards. The Programmer's Heaven Regex Area is one good place to start.Why the name "Regex"? Why not just say "Regular Expression"?
The term "regular expression" is strictly defined in formal language theory. Modern implementations of "regular expressions" normally offer features for matching things that are not regular languages. This was the case in Perl 5 and is even more the case in Perl 6.Therefore, for the sake of correctly using terminology, the Perl 6 designers are strongly encouraging the use of the term "regex". Happily, it is less to type too.
Argh, they changed the regex syntax! How do I use the Perl 5 regex syntax in Perl 6?
You can use the "Perl5" modifier to specify that a regex should be parsed using the Perl 5 regex syntax. However, note that you must use the new smartmatch operator ("~~").if $age ~~ m:Perl5/^\d+$/ {
say "Valid age!";
}Also, modifiers (such as the case-insensitive modifier "i") must be brought to the front, as required in Perl 6, rather than placed on the end. That is, you must write:
m:Perl5:i/^\d+$/Rather than:
m:Perl5/^\d+$/iHow is whitespace interpreted in Perl 6 regexes (or how do I match whitespace)?
Whitespace in Perl 6 regexes is not taken as literal whitespace that should be matched (that is, it is not part of the pattern). This means that these three patterns:/ab/ /a b/ /a b/
Will all match the same thing: an "a" followed immediately by a "b". Hopefully this will encourage people to space out their regexes a bit so they are more readable.
There are a number of ways to match whitespace characters:
| <sp> | Matches a space |
| \n | Matches a new line character |
| \t | Matches a tab |
| \s | Matches any whitespace character (spaces, tabs and new line characters) |
| \h | Matches any horizontal whitespace character (spaces and tabs) |
| \v | Matches any vertical whitespace character (new line characters) |
| <ws> | Intelligent whitespace matching |
Intelligent whitespace matching may match "\s+" or "\s*". It decides which of these to match based upon the last character that was matched before the "<ws>" and the next character to be matched.
- If the character matched before the "<ws>" is alphanumeric AND the character that will be matched next is alphanumeric, "<ws>" will be equivalent to "\s+"
- In all other cases, it will be equivalent to "\s*"
/<variable> <ws> <op> <ws> <variable>/
Where:
- <variable> will match variable names of the form "$a", "$monkey" and so on.
- <op> will match operators in the language, which includes "+" and "-" and so on, but also "cmp" and "and" (like we have in Perl 6).
$a + $b $a-$b $a cmp $b $a cmp$b $a + $b
However, it will not match these strings:
$acmp$b $acmp $b
Since in these cases, "$acmp" would just be a variable.
What is the syntax for quantifiers?
A quantifier specifies how many times the pattern or group that precedes it can appear in the string being matched. Quantifiers are mostly unchanged from Perl 5 and most other implementations of regexes. The exception is that the "{n,m}" construct for matching between n and m times has been changed.| ? | Zero or one (greedy) |
| * | 0 or more (greedy) |
| + | One or more (greedy) |
| ?? | Zero or one (minimal) |
| *? | 0 or more (minimal) |
| +? | One or more (minimal) |
| **{5..10} (NEW) | Match between 5 and 10 times (greedy) |
| **{5..10}? (NEW) | Match between 5 and 10 times (minimal) |
What is the syntax for alternation?
As in Perl 5, the vertical bar is used for alternation.if $pet_idea ~~ /dog|cat|fish|rabbit/ {
say "You may have a $pet_idea";
} else {
say "You can't have a $pet_idea as a pet!";
}What is the syntax for character classes?
The syntax for character classes has changed since Perl 5. The characters making up a character class are now written between "<[" and "]>", for example:if $answer !~~ /<[yn]>/ { # Perl 5: /[yn]/
say "You must answer (y)es or (n)o";
}Additionally, the syntax for ranges of characters has been changed; it is now consistent with the way that ranges are specified elsewhere in the Perl 6 language.
if $grade !~~ /<[A..EU]>/ { # Perl 5: /[A-EU]/
say "$grade is not a valid exam grade (enter A to E or U)";
}Note that the common built-in character classes, such as "\w", "\d" and "\s", are the same as they were in Perl 5.
How do I write modifiers on regexes?
Modifiers (for example, the case-insensitive modifier i) must now be written at the start of a regex.if $text ~~ m:i/monkey/ {
say "Found a case-insensitive monkey"
}If you need to supply multiple modifiers, separate them with a colon. For example, to apply the global and case-insensitive modifiers ("g" and "i"), write:
m:i:g/monkey/
You can find a full list of modifiers in the specification.
Where did the single-line (s) and multi-line (m) modifiers go?
The "s" (single line) and "m" (multi-line) modifiers from Perl 5 were removed in Perl 6.The "s" modifier is no longer needed since the "." metacharacter matches anything, including a new line, all of the time. (To match anything but a newline, use "\N").
The "m" modifier is no longer needed due to changes to the "^" and "$" metacharacters and the introduction of "^^" and "$$".
| ^ | Always matches the start of the string |
| $ | Always matches the end of the string |
| ^^ | Always matches the start of a line |
| $$ | Always matches the end of a line |
What is the syntax for groups?
In Perl 5, the syntax for groups was also the syntax for captures. Perl 6 uses square brackets to denote a group that does not capture, replacing the "(?: ...)" construct in Perl 5. The following Perl 6regex:/[badger <ws>]+/
Will match the word "badger" followed by one or more whitespace characters one or more times. However, it will not capture anything, but just group.
How do I use captures?
A capture is used to extract parts of the data that was matched by a regex. Like in Perl 5, a capture is specified using parentheses. For example, to capture a number from some string do something like this:my $text = 'I am 500 years old'; $text ~~ /I <ws> am <ws> (\d+)/;
A pattern match will result in a Match object being created and placed in $/. You can then use this like an array reference in order to get at individual captures.
my $text = 'I am 500 years old'; $text ~~ /I <ws> am <ws> (\d+)/; say $/[0]; # 500
Aliases are created to this array as $0, $1, etc. This means you can also write:
my $text = 'I am 500 years old'; $text ~~ /I <ws> am <ws> (\d+)/; say $0; # 500
Like you would have in Perl 5, apart from captures start from $0 in Perl 6 rather than $1. If you want the entire string that was matched by the pattern, use the Match object in string context.
my $text = 'I am 500 years old'; $text ~~ /I <ws> am <ws> (\d+)/; # <ws> = whitespace say ~$/; # I am 500 # ~ coerces to string
In boolean context, the Match object evaluates to true if the match was successful and false if not.
if $/ {
say "It matched!";
}How do I do named captures?
As well as accessing captured data using $/ or $0, $1 and so on, Perl 6 allows you to give specific captures a name. You do this using the syntax "$<name> := (pattern)", and then access what was captured using "$<name>". For example:my $text = 'I am 500 years old.'; $text ~~ /I <ws> am <ws> $<age>:=(\d+)/; say "You are $<age> years old?!"; # You are 500 years old?!
This can aid readability of programs using regexes.
How do I match the contents of a string literally in a regex (or are strings taken as part of the regex syntax in Perl 6)?
Unlike in Perl 5, when you use a variable inside a regex in Perl 6 it is passed directly to the regex engine and its contents is matched literally. It will not be treated as part of the regex syntax.To demonstrate this difference, consider the following Perl 6 program:
my $char_to_find = '.';
my $text = "There is no dot in here";
if $text ~~ /$char_to_find/ {
say "Found a $char_to_find in $text";
}The message will not print since there is no "." in the string $text. However, if we "translate" this to Perl 5:
my $char_to_find = '.';
my $text = "There is no dot in here";
if ($text =~ /$char_to_find/) {
print "Found a $char_to_find in $text\n";
}Then the third line will behave as if we had written:
if ($text =~ /./) {And since the "." metacharacter matches anything, the message will print.
The Perl 6 behavior is safer and more likely to be what most people expect.
How do I give a regex a name so it can be re-used many times in a program?
Perl 6 makes it simple to give a name to a regex so you can refer to it later.regex price { \$ \d+ [\.\d**{2}]? };You can then match against that regex by enclosing its name in angle brackets.
if $sales_leaflet ~~ /<price>/ {
say "Price: $/";
} else {
say "Could not find a price!";
}This not only allows for better code re-use, but is also much more self-documenting.
How do I "insert" one named regex into another?
You can insert one regex into another using angle brackets. For example, we can define a currency regex that matches currency symbols, then use it in a price regex.regex currency { \$ | £ | € | ¥ }; # \ before $ is escape
regex price { <currency> \d+ [\.\d**{2}]? };This easily enables code-reuse of regexes, which given their potential complexity can only be a good thing.
It is also worth noting that it is possible to mention the regex currently being defined within itself. This enables you to use Perl 6 regexes to parse things like programming languages!
What is a grammar?
Just as a class may contains many methods, a grammar may contain many regexes. Grammars are used to group together related regexes into their own namespace, so as to avoid naming conflicts.grammar Price {
regex currency { \$ | £ | € | ¥ }
regex price { <currency> \d+ [\.\d**{2}]? };
}Grammars can, like classes, inherit from each other. In this case, all regexes are inherited, but you can override them. Like classes, grammars are polymorphic too - you can use a sub-grammar in place of its base grammar. Unlike classes, you do not need to instantiate a grammar to use regexes contained within it.
grammar NamedCurrencyPrice is Price {
- Override currency; price is inherited.
regex currency { USD | GBP | EUR | JPY };
}Where can I find the Perl 6 regex specification?
This document, known as S05, can be found here.Back To FAQ List | Next Section
What's next?
Join our Perl 6 Newsletter
Visit our Perl Resources
- Perl 6 Forum
- Perl, PHP & Python zone
- Perl Programming Forum
- Beginners Guide to Perl
- Regex tutorial
- 20 Perl Tips And Tricks
|
|
riya
From India (Report as abusive) |
"Very Useful" this faqs made to know about many things unknown and it helped me a lot |
| View all Rate and comment this article |
Sponsored links
Build IT Knowledge with Current & Trusted Content
Helps Employees Develop & Hone New Technical Programming Skills. Sign Up & Get Full Access.
Helps Employees Develop & Hone New Technical Programming Skills. Sign Up & Get Full Access.
Check Out IT Certification Preparation Materials
Sign Up With SkillSoft & Get Access to Training Materials for Over 50 Professional Certifications.
Sign Up With SkillSoft & Get Access to Training Materials for Over 50 Professional Certifications.
SFTP components for .NET
Add complete SSH and SFTP support to your .NET framework application
Add complete SSH and SFTP support to your .NET framework application
Virtual File System SDK
Create your own file systems in Windows and .NET applications
Create your own file systems in Windows and .NET applications
PureCM Software Configuration Management
Version control and integrated issue tracking - powerful and easy to use. Get your FREE trial now!
Version control and integrated issue tracking - powerful and easy to use. Get your FREE trial now!