Regular Expressions


Quantifiers affect the character before them in the regular expression, and determine how many times this character must or may occur.

If you want the quantifier to affect a sequence of characters, enclose those characters in parentheses.

The quantifiers are:

{n}Must occur exactly n times
{n,m}Must occur at least n times but no more than m times
{n,}Must occur at least n times
*0 or more times (same as {0,})
+1 or more times (same as {1,})
?0 or 1 time (same as {0,1})

Example 1

We would like to find out whether the concensus sequence
is contained (somewhere) in a given sequence $a.

Without quantifiers:
if ($a =~ /ACCCC[AG][AG][AG]GTGT/) {...};
With quantifiers:
if ($a =~ /AC{4}[AG]{3}(GT){2}/) {...};

Example 2

The date and time example from the previous slide will look much nicer if we use quantifiers:

print "Please enter date and time, as in \"08-OCT-1997  16:30\"\n";
my $entry = <STDIN>;
chop ($entry);

if ($entry =~ /\d{2}-\w{3}-\d{4}  \d{2}:\d{2}/) {
   print "good!\n";
} else {
   print "wrong format!\n";

Example 3

To check whether a given sequence contains 2 or more repeats of the GATA tetranucleotide write:
if ($seq =~ /(GATA){2,}/) {  }

# note that we enclosed  the sequence to be repeated in parentheses

Example 4

The Genome Database accession IDs are composed of the characters GDB: followed by several digits (see example).
To check whether a Genome Database accession ID is entered correctly, use the following conditional:
if ($entry =~ /GDB:\d+/) {  }

# i.e. "GDB:" followed by one or more digits

Example 5

To check whether a sentence contains either the word "color" or "colour", write:
if ($sentence =~ /colou?r/) {  }

# the question mark here denotes an optional "u"

Example 6

The HTML specifications allow extra whitespaces inside tags.
For example, < TITLE    > and <\tTITLE> mean the same as <TITLE>.
To check whether an HTML text contains the TITLE tag, write:

if ($text =~ /<\s*TITLE\s*>/) {  }

# the word "TITLE" may optionally be surrounded by any number
# of spaces, tabs etc.

Table of Contents.