Regular Expressions

Parentheses as memory

Parentheses in a regular expression can "remember" text matched by the subexpression they enclose, in a way that enables later use of that text in the program.

How does this work?

We have seen that a regular expression or pattern usually matches a group of strings within which several variations are allowed.

Sometimes we wish to know what exactly was the sequence of characters in the searched string that matched the regular expression.

If we enclose part of the regular expression with parentheses, the sequence of characters in the searched string that matched that part will be automatically assigned into a variable named $1.

If we enclose several parts of the regular expression with parentheses, their matched substrings in the target string will be sequentially assigned to $1, $2, $3 etc.

Example 1

Recall the example in which we asked the user to enter date and time in a given format.

Let us now extract from the entry the day, month, year, hour and minutes.


print "Please enter date and time, as in \"08-OCT-1997  16:30\"\n";
my $entry = <STDIN>;
chop ($entry);

$entry =~ /(\d\d)-(\w\w\w)-(\d\d\d\d)  (\d\d):(\d\d)/;

# $1 now contains the day
# $2     contains the month;
# $3     contains the year;
# $4     contains the hour;
# $5     contains the minutes;

# for example, to print the month we would write:

print "Month: $2\n";
Note: as will be shown later, you can directly assign the substrings matching the parts in parentheses into variables named $day, $month etc.

Example 2

Given an HTML text containing a link tag, extract the URL.

The link:
List of Lecture Slides

The HTML source:
List of <A HREF="index.html">Lecture Slides</A>

The Perl program:


my $html = "List of <A HREF=\"index.html\">Lecture Slides</A>.";

$html =~ /<A HREF="(.*)">/;

print "URL: $1\n";
URL: index.html

Table of Contents.