Cablechip Solution's Blog: September 2008

Regex : Tutorial

Regular expressions, or regex, are one of the best things about Perl. So good infact, that they been copied by most other programming languages.

They are sadly let down however by Perl "man pages" and the chapter in the 'Programming Perl' camel book which in no way does them justice. It explains them, but only in a way a Computer Science PHD student would understand.

So, here's my tutorial with lots of examples.

NB 1: this is based on Perl, but you can use them in Javascript, Python etc.

NB 2: once you understand this, you should be ready for the reference material

Regex : Using a regular expression (perl specific).

There are 3 ways to call a regex:

Use: $x =~ //
Use: $x =~ m//
or m## or m"" or m{} or anything else you care to choose
Use: qr()
or qr## or qr"" or qr{}

Why so many ways. First, that's Perl. But it helps with escaping, see examples below. Just choose the delimiters that don't appear inside the rege.

The usual choices are // () ## and ""


$x =~ /andrew/;
$x =~ m/andrew/;
$x =~ m#andrew#;
$x =~ qr(andrew);

#matches speachmarks
$x =~ m/"andrew"/;

#matches HTML tags - there's no need to escape the / in </b>
$x =~ qr# <b>andrew</b> #;

Regex : Using whitepace and comments

The /x modifier

Using whitespace and comments inside your regular expression is a brilliant idea, especially as regex often look like line noise, even to yourself a week later..

If you need a space, you'll need to escape it as \ (that \ then a SPACE ) or \s (which means any whitespace character)

$x =~ / andrew /ix

# these last 2 are the same
$x =~ m# <b> \s* andrew \s* </b> #ix;

$x =~ qr"                 # " is the delimiter to start and end the regex
      <b>
        \s* andrew \s*  # allow optional whitespace (0+ chars) at start and end
      </b>
   "ix;                 # ignore case

Regex : Some simple examples

Some simple examples to start


$x = 'Andrew';

print 'match' if $x =~ /rew/;
match

print 'match' if $x =~ /And/;
match

# no match as and doesn't match And
print 'false' if $x =~ /and/;

print 'true' if $x =~ /and/i; # the 'i' is ignore case
match

Regex : How many times to match?

Place these modifiers after a character:

+ matches 1+ times
* matches 0+ times
{3} after a charater matches it 3 times
{2,3} after a charater matches it 2 to 3 times

Greedy and Not Greedy


$x = " one 
  two 
";

#.* is greedy, so this matches from the first  to the last 


$x =~ m#  .* 
 #x

#.*? isn't greedy, so this matches  one  only

$x =~ m#  .*? 
 #x

Regex : Predefined Groups of characters

Groups of characters

/s matches a whitespace character
/S matches anything except a whitespace character
/w a word charaters (a-z, A-Z, 0-9, Persian, Greek, Chinese alphabets, etc)
/W anything except a word character
/d matches 0..9
/D matches anything EXCEPT 0..9

Regex : Groups

Use ( ) for groups.


$html =~ (andrew) 
# matches andrew

((andrew)|(john)|(peter)) 
# matches andrew OR john or peter

# the same, but with whhitespace
# m# - use # as the delimiter
# #x - allow whitespace
$html =~ m#
 (
   ( andrew   )  |
   ( peter    )  |
   ( john     )
 )#x;

Perl Complex Variables

Perl has 3 types, scalar, array and hash. These can be combined to make complex variables.

Unfortunately, the syntax isn't kind.

Defining Scalars, Arrays and Hashes.


$name = 'andrew' ;

# these 2 are the same, qw() is 'syntax sugar'
@names = qw( andrew john peter );
@names = ('andrew', 'john', 'peter');

%age = (
 'andrew' => 20,
 'peter'  => 30,
 'paul'   => 17,
 );

# note the outside hash uses (
# - but the inside hash uses {

%names = (
 'andrew' => { 
      'age'    => 20, 
      'height' => 1.76,
      },
 'peter'  => { 
      'age'    => 30, 
      'height' => 1.2, 
      'friends'=> ['andrew', 'paul'], 
      }
 );

Scalars


$x =  $names{'andrew'}{'age'};
# now $x = 20 

$names{'andrew'}{'age'} = 35 ;

Arrays

Arrays are a bit more complex - use @{ } to access them and [ ] to set them


@x = @{ $names{'peter'}{'friends'} } ;

$names{'peter'}{'friends'} = [ 'bill', 'john', 'fred' ] ;

Hashes

Hashes are similar, use %{ } to access them and { } to set them


%hash = %{ $names{'andrew'} } ;

$names{'andrew'} = { 'age' => 20, 'height' => 1.76 } ;

Important Gotcha

When using the %hash = %{} and @array = @{} syntax, you are making a reference to the data, NOT a copy of it;


%hash = %{ $names{'andrew'} } ;
# %hash is a reference, change %hash and you change %names, and visa versa

%new_hash = %hash;
# now safe

Regex : Doesn't match


# $x does not contain andrew
$x !~ /andrew/

Regex : Match on word Boundary

\b match on a word boundary

\B matches on not a word boundary

Regex : . and the /s modifier

A . matches anything except a newline

To match a newline, use the /s modifer (treat as a Single line)


$x = qq(
andrew
was
here
);

$x =~ /andrew .* here/xs;
# matches as the /s carries the search over the newlines

Regex : Anchors and the /m modifier

Anchors
^ matches at the start of a string
$ matches at the end of a string


$x = /andrew/

$x =~ /^a/ ;
# matches as andrew starts with an a

$x =~ /rew$/;
# matches as andrew ends with a rew

But what if you have a string containing several lines (e.g. a file with \n characters) ?

The /m modifier.

This treats the string as multiple lines, so ^ and $ will match the start and end of each line in the file


$x = qq(
andrew
john andrew
andrew
);
$x =~ s/^andrew/ANDREW/mg;
# will replace 1st and 3rd andrew
# without the /m, it would only replace the third
This treats the string as a single line, so

Regex : Global Replace

The /g modifier

This repeats the regex until there are no more matches


$x = /andrew andrew andrew/;

$x = s/a/A/;
# Andrew andrew andrew (1 replace)

$x = s/a/A/g;
# Andrew Andrew Andrew (all the 'a' get replaces)

Regex : Using expressions

This is the /e modifier

Example: make HTML tags upper case


# ( ) - capture the text
# .*? - match the least amount of text, i.e. an HTML tag
# uc ($1) - make the matches text uppercase
# gxe - g is global replace
# gxe - x is allow whitespace and comments
# gxe - the bit on the right is an expressions to be evaluates

$html =~ s# ( < .*? > ) # print uc($1) #gxe ;

Example : find hi byte characters


# find hi-byte characters in HTML
# - and keep a record of all the hi byte chars found in %found
my %found ;

sub find {
 my ($char) = @_ ;
 $found{ $chars } ++; ## keep a record
 return '[' . ord( $chars ) . ']';
 }

## this will find euro and £ symbols, but not €
## \x80 is hex (ascii char 128)
## \x80-\xffff is a range - if the file is utf8, it will match hi bytes chars as well
## gxe - expression, whitespace, and global replace
$html =~ s# [\x80-\xffff] # &find( $1 ) #gxe;

Regex : Substitutions

Regex : Capture Matched text

Use brackets to capture text, and $1, $2 ... to refer to it


$x = 'andrew john';

# swap the first and second word around

$x=~ s/ (\w+) \s+ (\w+*)
      / $2 $1
      / x;

# s/    - substitute
# (\w+) - 1+ word charcters, captures as $1
# \s+   - 1+ whitespace characters
# (\w+) - 1+ word chars, captured as $2
# /x    - use whitesapce and comments

# $1 and $2 survive
print $2;
# prints john

Regex: Ranges

Use [ ] for ranges - match any 1 char inside the range


$x =~ /[0-9]/;
$x =~ /[a-z]/i;   # so match a..z and A..Z as /i is ignore case
$x =~ /[\n\r]/;

#match 1+ chars in the range 0..9 + - and .
#\. escape the dot so it doesn't mean "any 1 char"
$x=~ /[0-9+-\.]+/

Perl Poetry

This isn't mine, but its quite cool.

$_ = reverse sort qw p ekca lre uJ reh ts
p, $/.r, map $_.$", qw e p h tona e; print

Cablechip Solution's Blog

SWC Header

Cablechip Solutions

web development with Unix, Perl, Javascript, HTML and web services

Title