When to use "regular expressions"?
Lets say that you oppened a file and are checking line after line for a
given text. If you are checking the lines for an exact match like the
string "Here is the end" then you may check inside a loop with 'eq' as
in the following example:
if ($line eq "Here is the end") { ... }
In real programmers' life, we want to check for only portions of the
incomming line, or for a given pattern (always 2 integers following a
letter, for example). For this kind of job we need regular expressions.
What are "regular expressions"?
Regular expressions are a way of describing a PATTERN - for example,
"all the words that begin with the letter A" or "every 10-digit phone
number".
How do we create a PATTERN?
Creating a PATTERN suitable for the task is the fun part.
When testing for actual characters, like the 'A' in the first example,
we will directly use the same character A. In other cases, like when
testing for a digit, we can use special symbols (meta-character) that
indicate a group of similar characters ('\d' indicates a digit) or that
match a special position in the string ('^' match the beginning of the
line).
The 'full' metacharacter table is as follows
\w Match a "word" character (alphanumeric plus "_")
\W Match a non-word character
\s Match a whitespace character
\S Match a non-whitespace character
\d Match a digit character
\D Match a non-digit character
\ Quote the next metacharacter
^ Match the beginning of the line
. Match any character (except newline)
$ Match the end of the line (or before newline at the end)
| Alternation
() Grouping
[] Character class
\b Match a word boundary
\B Match a non-(word boundary)
\A Match only at beginning of string
\Z Match only at end of string, or before newline at the end
\z Match only at end of string
\G Match only where previous m//g left off (works only with /g)
Going back to our example, "^A" will represent "a character A at the
beginning of the line". To represent the 10 digits number, we can either
write "\d\d\d\d\d\d\d\d\d\d" or better use a quantifier to specify
that we want a digit repeated 10 times: "\d{10}"
The following standard quantifiers are recognized:
* Match 0 or more times
+ Match 1 or more times
? Match 1 or 0 times
{n} Match exactly n times
{n,} Match at least n times
{n,m} Match at least n but not more than m times
Where do we use "regular expressions"?
Regular expressions are used to m/atch/, s/ubsti/tute/ and tr/ans/late/
characters. In the following text, PATTERN means a group of characters,
metacharacters and quantifiers as decribed above.
m/PATTERN/cgimosx
/PATTERN/cgimosx
Searches a string for a pattern match, and in scalar context returns
true (1) or false (''). If no string is specified via the =~ or !~
operator, the $_ string is searched.
Options are:
c Do not reset search position on a failed match when /g is in effect.
g Match globally, i.e., find all occurrences.
i Do case-insensitive pattern matching.
m Treat string as multiple lines.
o Compile pattern only once.
s Treat string as single line.
x Use extended regular expressions.
Examples:
if ($line =~ /^A/);
if (/Version: *([0-9.]*)/) { $version = $1; }
if (($F1, $F2, $Etc) = ($foo =~ /^(\S+)\s+(\S+)\s*(.*)/))
s/PATTERN/REPLACEMENT/egimosx
Searches a string for a pattern, and if found, replaces that pattern
with the replacement text and returns the number of substitutions made.
Otherwise it returns false (specifically, the empty string).
Options are:
e Evaluate the right side as an expression.
g Replace globally, i.e., all occurrences.
i Do case-insensitive pattern matching.
m Treat string as multiple lines.
o Compile pattern only once.
s Treat string as single line.
x Use extended regular expressions.
Examples:
$line =~ s/^A/B/;
$path =~ s|/usr/bin|/usr/local/bin|; # here we use '|' as delimiter
($foo = $bar) =~ s/this/that/; # copy first, then change
tr/SEARCHLIST/REPLACEMENTLIST/cds
y/SEARCHLIST/REPLACEMENTLIST/cds
Transliterates all occurrences of the characters found in the search
list with the corresponding character in the replacement list. It
returns the number of characters replaced or deleted. If no string is
specified via the =~ or !~ operator, the $_ string is transliterated.
Options:
c Complement the SEARCHLIST (the last character on REPLACEMENTLIST is repeated)
d Delete found but unreplaced characters.
s Squash duplicate replaced characters.
Examples:
$ARGV[1] =~ tr/A-Z/a-z/; # canonicalize to lower case
$cnt = tr/*/*/; # count the stars in $_
$cnt = $sky =~ tr/*/*/; # count the stars in $sky
$cnt = tr/0-9//; # count the digits in $_
tr/a-zA-Z//s; # bookkeeper -> bokeper