When to use "regular expressions"?

Lets say that you oppened a file and are checking line after line for a
given text. If you are checking the lines for an exact match like the
string "Here is the end" then you may check inside a loop with 'eq' as
in the following example:
       if ($line eq "Here is the end") { ... }
       
In real programmers' life, we want to check for only portions of the
incomming line, or for a given pattern (always 2 integers following a
letter, for example). For this kind of job we need regular expressions.

What are "regular expressions"? Regular expressions are a way of describing a PATTERN - for example, "all the words that begin with the letter A" or "every 10-digit phone number".

How do we create a PATTERN? Creating a PATTERN suitable for the task is the fun part. When testing for actual characters, like the 'A' in the first example, we will directly use the same character A. In other cases, like when testing for a digit, we can use special symbols (meta-character) that indicate a group of similar characters ('\d' indicates a digit) or that match a special position in the string ('^' match the beginning of the line). The 'full' metacharacter table is as follows \w Match a "word" character (alphanumeric plus "_") \W Match a non-word character \s Match a whitespace character \S Match a non-whitespace character \d Match a digit character \D Match a non-digit character \ Quote the next metacharacter ^ Match the beginning of the line . Match any character (except newline) $ Match the end of the line (or before newline at the end) | Alternation () Grouping [] Character class \b Match a word boundary \B Match a non-(word boundary) \A Match only at beginning of string \Z Match only at end of string, or before newline at the end \z Match only at end of string \G Match only where previous m//g left off (works only with /g) Going back to our example, "^A" will represent "a character A at the beginning of the line". To represent the 10 digits number, we can either write "\d\d\d\d\d\d\d\d\d\d" or better use a quantifier to specify that we want a digit repeated 10 times: "\d{10}" The following standard quantifiers are recognized: * Match 0 or more times + Match 1 or more times ? Match 1 or 0 times {n} Match exactly n times {n,} Match at least n times {n,m} Match at least n but not more than m times

Where do we use "regular expressions"? Regular expressions are used to m/atch/, s/ubsti/tute/ and tr/ans/late/ characters. In the following text, PATTERN means a group of characters, metacharacters and quantifiers as decribed above.

m/PATTERN/cgimosx /PATTERN/cgimosx Searches a string for a pattern match, and in scalar context returns true (1) or false (''). If no string is specified via the =~ or !~ operator, the $_ string is searched. Options are: c Do not reset search position on a failed match when /g is in effect. g Match globally, i.e., find all occurrences. i Do case-insensitive pattern matching. m Treat string as multiple lines. o Compile pattern only once. s Treat string as single line. x Use extended regular expressions. Examples: if ($line =~ /^A/); if (/Version: *([0-9.]*)/) { $version = $1; } if (($F1, $F2, $Etc) = ($foo =~ /^(\S+)\s+(\S+)\s*(.*)/))

s/PATTERN/REPLACEMENT/egimosx Searches a string for a pattern, and if found, replaces that pattern with the replacement text and returns the number of substitutions made. Otherwise it returns false (specifically, the empty string). Options are: e Evaluate the right side as an expression. g Replace globally, i.e., all occurrences. i Do case-insensitive pattern matching. m Treat string as multiple lines. o Compile pattern only once. s Treat string as single line. x Use extended regular expressions. Examples: $line =~ s/^A/B/; $path =~ s|/usr/bin|/usr/local/bin|; # here we use '|' as delimiter ($foo = $bar) =~ s/this/that/; # copy first, then change

tr/SEARCHLIST/REPLACEMENTLIST/cds y/SEARCHLIST/REPLACEMENTLIST/cds Transliterates all occurrences of the characters found in the search list with the corresponding character in the replacement list. It returns the number of characters replaced or deleted. If no string is specified via the =~ or !~ operator, the $_ string is transliterated. Options: c Complement the SEARCHLIST (the last character on REPLACEMENTLIST is repeated) d Delete found but unreplaced characters. s Squash duplicate replaced characters. Examples: $ARGV[1] =~ tr/A-Z/a-z/; # canonicalize to lower case $cnt = tr/*/*/; # count the stars in $_ $cnt = $sky =~ tr/*/*/; # count the stars in $sky $cnt = tr/0-9//; # count the digits in $_ tr/a-zA-Z//s; # bookkeeper -> bokeper

Jaime Prilusky