Regular expressions, or regex, are one of the best things about Perl. So good infact, that they been copied by most other programming languages.
They are sadly let down however by Perl "man pages" and the chapter in the 'Programming Perl' camel book which in no way does them justice. It explains them, but only in a way a Computer Science PHD student would understand.
So, here's my tutorial with lots of examples.
NB 1: this is based on Perl, but you can use them in Javascript, Python etc.
NB 2: once you understand this, you should be ready for the reference material
Regex : Using a regular expression (perl specific).
There are 3 ways to call a regex:
The usual choices are // () ## and ""
- Use: $x =~ //
- Use: $x =~ m//
or m## or m"" or m{} or anything else you care to choose - Use: qr()
or qr## or qr"" or qr{}
The usual choices are // () ## and ""
$x =~ /andrew/;
$x =~ m/andrew/;
$x =~ m#andrew#;
$x =~ qr(andrew);
#matches speachmarks
$x =~ m/"andrew"/;
#matches HTML tags - there's no need to escape the / in </b>
$x =~ qr# <b>andrew</b> #;
Labels:
perl regex regular expressions
Regex : Using whitepace and comments
The /x modifier
Using whitespace and comments inside your regular expression is a brilliant idea, especially as regex often look like line noise, even to yourself a week later..
If you need a space, you'll need to escape it as \ (that \ then a SPACE ) or \s (which means any whitespace character)
Using whitespace and comments inside your regular expression is a brilliant idea, especially as regex often look like line noise, even to yourself a week later..
If you need a space, you'll need to escape it as \ (that \ then a SPACE ) or \s (which means any whitespace character)
$x =~ / andrew /ix
# these last 2 are the same
$x =~ m# <b> \s* andrew \s* </b> #ix;
$x =~ qr" # " is the delimiter to start and end the regex
<b>
\s* andrew \s* # allow optional whitespace (0+ chars) at start and end
</b>
"ix; # ignore case
Labels:
perl regex regular expressions
Regex : Some simple examples
Some simple examples to start
$x = 'Andrew';
print 'match' if $x =~ /rew/;
match
print 'match' if $x =~ /And/;
match
# no match as and doesn't match And
print 'false' if $x =~ /and/;
print 'true' if $x =~ /and/i; # the 'i' is ignore case
match
Labels:
perl regex regular expressions
Regex : How many times to match?
Place these modifiers after a character:
- + matches 1+ times
- * matches 0+ times
- {3} after a charater matches it 3 times
- {2,3} after a charater matches it 2 to 3 times
$x = "one
two
";
#.* is greedy, so this matches from the firstto the last
$x =~ m#.*
#x
#.*? isn't greedy, so this matchesone
only
$x =~ m#.*?
#x
Labels:
perl regex regular expressions
Regex : Predefined Groups of characters
Groups of characters
- /s matches a whitespace character
- /S matches anything except a whitespace character
- /w a word charaters (a-z, A-Z, 0-9, Persian, Greek, Chinese alphabets, etc)
- /W anything except a word character
- /d matches 0..9
- /D matches anything EXCEPT 0..9
Labels:
perl regex regular expressions
Regex : Groups
Use ( ) for groups.
$html =~ (andrew)
# matches andrew
((andrew)|(john)|(peter))
# matches andrew OR john or peter
# the same, but with whhitespace
# m# - use # as the delimiter
# #x - allow whitespace
$html =~ m#
(
( andrew ) |
( peter ) |
( john )
)#x;
Labels:
perl regex regular expressions
Perl Complex Variables
Perl has 3 types, scalar, array and hash. These can be combined to make complex variables.
Unfortunately, the syntax isn't kind.
Defining Scalars, Arrays and Hashes.
Scalars
Arrays
Arrays are a bit more complex - use @{ } to access them and [ ] to set them
Hashes
Hashes are similar, use %{ } to access them and { } to set them
Important Gotcha
When using the %hash = %{} and @array = @{} syntax, you are making a reference to the data, NOT a copy of it;
Unfortunately, the syntax isn't kind.
Defining Scalars, Arrays and Hashes.
$name = 'andrew' ;
# these 2 are the same, qw() is 'syntax sugar'
@names = qw( andrew john peter );
@names = ('andrew', 'john', 'peter');
%age = (
'andrew' => 20,
'peter' => 30,
'paul' => 17,
);
# note the outside hash uses (
# - but the inside hash uses {
%names = (
'andrew' => {
'age' => 20,
'height' => 1.76,
},
'peter' => {
'age' => 30,
'height' => 1.2,
'friends'=> ['andrew', 'paul'],
}
);
Scalars
$x = $names{'andrew'}{'age'};
# now $x = 20
$names{'andrew'}{'age'} = 35 ;
Arrays
Arrays are a bit more complex - use @{ } to access them and [ ] to set them
@x = @{ $names{'peter'}{'friends'} } ;
$names{'peter'}{'friends'} = [ 'bill', 'john', 'fred' ] ;
Hashes
Hashes are similar, use %{ } to access them and { } to set them
%hash = %{ $names{'andrew'} } ;
$names{'andrew'} = { 'age' => 20, 'height' => 1.76 } ;
Important Gotcha
When using the %hash = %{} and @array = @{} syntax, you are making a reference to the data, NOT a copy of it;
%hash = %{ $names{'andrew'} } ;
# %hash is a reference, change %hash and you change %names, and visa versa
%new_hash = %hash;
# now safe
Regex : Doesn't match
# $x does not contain andrew
$x !~ /andrew/
Labels:
perl regex regular expressions
Regex : Match on word Boundary
\b match on a word boundary
\B matches on not a word boundary
\B matches on not a word boundary
Labels:
perl regex regular expressions
Regex : . and the /s modifier
A . matches anything except a newline
To match a newline, use the /s modifer (treat as a Single line)
To match a newline, use the /s modifer (treat as a Single line)
$x = qq(
andrew
was
here
);
$x =~ /andrew .* here/xs;
# matches as the /s carries the search over the newlines
Labels:
perl regex regular expressions
Regex : Anchors and the /m modifier
Anchors
^ matches at the start of a string
$ matches at the end of a string
But what if you have a string containing several lines (e.g. a file with \n characters) ?
The /m modifier.
This treats the string as multiple lines, so ^ and $ will match the start and end of each line in the file
^ matches at the start of a string
$ matches at the end of a string
$x = /andrew/
$x =~ /^a/ ;
# matches as andrew starts with an a
$x =~ /rew$/;
# matches as andrew ends with a rew
But what if you have a string containing several lines (e.g. a file with \n characters) ?
The /m modifier.
This treats the string as multiple lines, so ^ and $ will match the start and end of each line in the file
This treats the string as a single line, so
$x = qq(
andrew
john andrew
andrew
);
$x =~ s/^andrew/ANDREW/mg;
# will replace 1st and 3rd andrew
# without the /m, it would only replace the third
Labels:
perl regex regular expressions
Regex : Global Replace
The /g modifier
This repeats the regex until there are no more matches
This repeats the regex until there are no more matches
$x = /andrew andrew andrew/;
$x = s/a/A/;
# Andrew andrew andrew (1 replace)
$x = s/a/A/g;
# Andrew Andrew Andrew (all the 'a' get replaces)
Labels:
perl regex regular expressions
Regex : Using expressions
This is the /e modifier
Example: make HTML tags upper case
Example : find hi byte characters
Example: make HTML tags upper case
# ( ) - capture the text
# .*? - match the least amount of text, i.e. an HTML tag
# uc ($1) - make the matches text uppercase
# gxe - g is global replace
# gxe - x is allow whitespace and comments
# gxe - the bit on the right is an expressions to be evaluates
$html =~ s# ( < .*? > ) # print uc($1) #gxe ;
Example : find hi byte characters
# find hi-byte characters in HTML
# - and keep a record of all the hi byte chars found in %found
my %found ;
sub find {
my ($char) = @_ ;
$found{ $chars } ++; ## keep a record
return '[' . ord( $chars ) . ']';
}
## this will find euro and £ symbols, but not €
## \x80 is hex (ascii char 128)
## \x80-\xffff is a range - if the file is utf8, it will match hi bytes chars as well
## gxe - expression, whitespace, and global replace
$html =~ s# [\x80-\xffff] # &find( $1 ) #gxe;
Labels:
perl regex regular expressions
Regex : Capture Matched text
Use brackets to capture text, and $1, $2 ... to refer to it
$x = 'andrew john';
# swap the first and second word around
$x=~ s/ (\w+) \s+ (\w+*)
/ $2 $1
/ x;
# s/ - substitute
# (\w+) - 1+ word charcters, captures as $1
# \s+ - 1+ whitespace characters
# (\w+) - 1+ word chars, captured as $2
# /x - use whitesapce and comments
# $1 and $2 survive
print $2;
# prints john
Labels:
perl regex regular expressions
Regex: Ranges
Use [ ] for ranges - match any 1 char inside the range
$x =~ /[0-9]/;
$x =~ /[a-z]/i; # so match a..z and A..Z as /i is ignore case
$x =~ /[\n\r]/;
#match 1+ chars in the range 0..9 + - and .
#\. escape the dot so it doesn't mean "any 1 char"
$x=~ /[0-9+-\.]+/
Labels:
perl regex regular expressions
Perl Poetry
This isn't mine, but its quite cool.
$_ = reverse sort qw p ekca lre uJ reh ts
p, $/.r, map $_.$", qw e p h tona e; print
$_ = reverse sort qw p ekca lre uJ reh ts
p, $/.r, map $_.$", qw e p h tona e; print
Labels:
perl poetry
Subscribe to:
Posts (Atom)