Definitions

Character Sets

A character set in a sequence of characters, each character is represented by a number.
e.g. 65=A, 66=B, 67=C, ... 1234=£, ...
Examples of character sets are:
ASCII, ISO 8859-1, Windows 1252, ANSI

Character Encoding

A character encoding is a means of representing a character set in a file.
For ASCII and Windows 1252 (or ANSI), its easy, 1 byte = 1 character.
For large character sets, with more than 256 characters, it is more complex, as more than 1 byte per character is used.
"UTF-8" uses 1, 2, 3 or 4 bytes per character.

Character References in HTML and XML

An entity reference is of the form:
" € ©
A numeric reference (in HTML and XML) for character 256 is of the form:
ÿ (decimal) or ÿ (hexidecimal)

Character Sets

ASCII

ASCII is the original Character Set, with 128 characters defined.
1 byte = 1 character.

ISO 8859-1

This is the ISO "Western European" character set.
It is the original "web" character set, and used as the default by older browsers.
ISO 8859-1 is a subset of the larger UCS/Unicode character set (not quite true, but almost)
It is now depreciated (obsolete) - use UTF-8 instead.

Unicode and UCS (Universal Character Set)

This is a very large character set. It is a combination of the ISO 8859-1 characters,
plus mathematical and other symbols,
plus the Chinese, Hebrew, Japanese, Greek, Thai, Persian and other alphabets.
Some special cases :
- there are spaces reserved for 'user defined' characters,
- some characters can be combined to make composite characters (e.g. Thai)
Unicode/UCS is a character set. It is encoded using UTF-8

Windows 1252 / ANSI Character Set

This is the Windows character set.
It is encoded using 1 bytes (0-255) per character.
From 0-127, its the same as ASCII
Between 0x80 and 0x9F there are differences. This is the problem area, as these character positions are 'not defined' in ISO 8859-1 and UTF-8
From 0xA0 and above, its "the same" as ISO 8859-1.

Character Encoding

UTF-8

UTF-8 is actually a character encoding, not a character set. Colloquially, it is now used to mean "Unicode/UCS with the UTF-8 encoding"
Its a means of using 1, 2 , 3 or 4 bytes to store a very large character set.
ASCII characters (0-127) take up 1 byte, so its backwards compatible.
£, maths symbols, Chinese and Japanese characters take up 2, 3 or 4 bytes
Some Windows editors use a 'BOM', a character at the front of a file to indicate that the file contains UTF-8 encoded characters. (Actually, its a 2 byte character that's illegal in UTF-8). Not part of the spec.

Character Set Conversion Problems

Whatever you've used in the past, UTF-8 (Unicode/UCS), is the thing to aim for. Google, Blogger etc all use it.

From Windows 1252 / ANSI to the ISO character sets

If converting text from a Windows file to a web page in ISO format, you may have to map some 'hi byte' characters, e.g. the euro symbol, as the character numbers will not be the same.
If copy-and-past'ing, windows will take care of the conversion for you.

From ISO 8859-1 to UTF-8/UCS/Unicode

Viewing a ISO 885901 file in a web browser page set to UTF-8 will display any characters greater than 128 as illegal characters
In terms of character sets, the conversion is straight forward, as there are "no" differences you are likely to encounter.
The character encoding is the problem. Example: In ISO 8859-1, character(165) is stored as binary 165. In UTF-8, it should be 2 bytes. The single byte will be an illegal UTF-8 character.
The solution is programing language dependent or editor dependant.
- For example, in the Notepad++ editor, there is a 'convert ANSI to UTF-8' option.
- In perl: $string =~ s/ ( [\x80-\xff] ) / chr($1) /gxe;

UTF-8 to ISO 8859-1

Viewing a UTF-8 file in a web browser page set to ISO 8859-1 will display 2 (or more)characters for each UTF-8 'hi byte' character.
e.g. For 2 byte UTF-8 characters, it will display an illegal character, followed by the character you want.
The solution: First, identify all characters in your input stream, that don't have ISO 8850-1 equivalents
Maybe convert
- all the exotic utf-8 bullet points to &#nn;
- the exotic hyphens to - (minus sign)
- the various 6, 66, 9, 99 style quotes to ' and "
For XML feeds with character codes greater than 255, consider &#nn; escape sequences(rather than &name; or the binary code, both of which will cause problems)

Case Study

A well known international newspaper has a publishing system that uses UTF-8, and a series of XML feeds that use ISO 8859-1

Analyse an entire year's worth or newspaper articles.
- Make a list of every unique characters used.
- Cater for &#nn; and &name; style characters.
Map all the UTF-8 characters found (with character code greater than 128) to ISO 8859-1 equivalents.
Flag up any UTF-8 characters encountered in the conversion process which are not covered by this mapping.
- Again, cater for &#name; and &name; characters.
Escape all characters greater than 128 with the XML &#nn; escape sequence, so the output file is pure ASCII

Appendix : Differences between Windows 1252 and the ISO Character Sets

In practise, you may wish to map characters to ', ", and - rather then "left single quote" etc


0x80 0x20ac ;Euro Sign
0x81 0x0081
0x82 0x201a ;Single Low-9 Quotation Mark
0x83 0x0192 ;Latin Small Letter F With Hook
0x84 0x201e ;Double Low-9 Quotation Mark
0x85 0x2026 ;Horizontal Ellipsis
0x86 0x2020 ;Dagger
0x87 0x2021 ;Double Dagger
0x88 0x02c6 ;Modifier Letter Circumflex Accent
0x89 0x2030 ;Per Mille Sign
0x8a 0x0160 ;Latin Capital Letter S With Caron
0x8b 0x2039 ;Single Left-Pointing Angle Quotation Mark
0x8c 0x0152 ;Latin Capital Ligature Oe
0x8d 0x008d
0x8e 0x017d ;Latin Capital Letter Z With Caron
0x8f 0x008f
0x90 0x0090
0x91 0x2018 ;Left Single Quotation Mark
0x92 0x2019 ;Right Single Quotation Mark
0x93 0x201c ;Left Double Quotation Mark
0x94 0x201d ;Right Double Quotation Mark
0x95 0x2022 ;Bullet
0x96 0x2013 ;En Dash
0x97 0x2014 ;Em Dash
0x98 0x02dc ;Small Tilde
0x99 0x2122 ;Trade Mark Sign
0x9a 0x0161 ;Latin Small Letter S With Caron
0x9b 0x203a ;Single Right-Pointing Angle Quotation Mark
0x9c 0x0153 ;Latin Small Ligature Oe
0x9d 0x009d
0x9e 0x017e ;Latin Small Letter Z With Caron
0x9f 0x0178 ;Latin Capital Letter Y With Diaeresis

How to Install Template Toolkit on Windows

If using Activestate Perl on Windows XP, Windows Vista etc.:

Have a look at : http://cpan.uwinnipeg.ca/PPMPackages/10xx/
These modules are for Perl 5.10 (only). Use perl -v to check you're using 5.10
run ppm (activestate's perl packet manager)
- can just type ppm at a dos prompt (in vista at least)
In ppm: Edit > Preferences > Add Repository
- name = university of winnipeg
- url = http://cpan.uwinnipeg.ca/PPMPackages/10xx/
- this will then sync with the repository
In ppm: Select template toolkit (may have to do "view > all packages" to see it)
In ppm: File > Run Marked Actions

As a bonus, there's loads of other perl modules as well.

Regular expressions, or regex, are one of the best things about Perl. So good infact, that they been copied by most other programming languages.

They are sadly let down however by Perl "man pages" and the chapter in the 'Programming Perl' camel book which in no way does them justice. It explains them, but only in a way a Computer Science PHD student would understand.

So, here's my tutorial with lots of examples.

NB 1: this is based on Perl, but you can use them in Javascript, Python etc.

NB 2: once you understand this, you should be ready for the reference material

Regex : Using a regular expression (perl specific).

There are 3 ways to call a regex:

Use: $x =~ //
Use: $x =~ m//
or m## or m"" or m{} or anything else you care to choose
Use: qr()
or qr## or qr"" or qr{}

Why so many ways. First, that's Perl. But it helps with escaping, see examples below. Just choose the delimiters that don't appear inside the rege.

The usual choices are // () ## and ""


$x =~ /andrew/;
$x =~ m/andrew/;
$x =~ m#andrew#;
$x =~ qr(andrew);

#matches speachmarks
$x =~ m/"andrew"/;

#matches HTML tags - there's no need to escape the / in </b>
$x =~ qr# <b>andrew</b> #;

Regex : Using whitepace and comments

The /x modifier

Using whitespace and comments inside your regular expression is a brilliant idea, especially as regex often look like line noise, even to yourself a week later..

If you need a space, you'll need to escape it as \ (that \ then a SPACE ) or \s (which means any whitespace character)

$x =~ / andrew /ix

# these last 2 are the same
$x =~ m# <b> \s* andrew \s* </b> #ix;

$x =~ qr"                 # " is the delimiter to start and end the regex
      <b>
        \s* andrew \s*  # allow optional whitespace (0+ chars) at start and end
      </b>
   "ix;                 # ignore case

Regex : Some simple examples

Some simple examples to start


$x = 'Andrew';

print 'match' if $x =~ /rew/;
match

print 'match' if $x =~ /And/;
match

# no match as and doesn't match And
print 'false' if $x =~ /and/;

print 'true' if $x =~ /and/i; # the 'i' is ignore case
match

Regex : How many times to match?

Place these modifiers after a character:

+ matches 1+ times
* matches 0+ times
{3} after a charater matches it 3 times
{2,3} after a charater matches it 2 to 3 times

Greedy and Not Greedy


$x = " one 
  two 
";

#.* is greedy, so this matches from the first  to the last 


$x =~ m#  .* 
 #x

#.*? isn't greedy, so this matches  one  only

$x =~ m#  .*? 
 #x

Regex : Predefined Groups of characters

Groups of characters

/s matches a whitespace character
/S matches anything except a whitespace character
/w a word charaters (a-z, A-Z, 0-9, Persian, Greek, Chinese alphabets, etc)
/W anything except a word character
/d matches 0..9
/D matches anything EXCEPT 0..9

Regex : Groups

Use ( ) for groups.


$html =~ (andrew) 
# matches andrew

((andrew)|(john)|(peter)) 
# matches andrew OR john or peter

# the same, but with whhitespace
# m# - use # as the delimiter
# #x - allow whitespace
$html =~ m#
 (
   ( andrew   )  |
   ( peter    )  |
   ( john     )
 )#x;

Perl Complex Variables

Perl has 3 types, scalar, array and hash. These can be combined to make complex variables.

Unfortunately, the syntax isn't kind.

Defining Scalars, Arrays and Hashes.


$name = 'andrew' ;

# these 2 are the same, qw() is 'syntax sugar'
@names = qw( andrew john peter );
@names = ('andrew', 'john', 'peter');

%age = (
 'andrew' => 20,
 'peter'  => 30,
 'paul'   => 17,
 );

# note the outside hash uses (
# - but the inside hash uses {

%names = (
 'andrew' => { 
      'age'    => 20, 
      'height' => 1.76,
      },
 'peter'  => { 
      'age'    => 30, 
      'height' => 1.2, 
      'friends'=> ['andrew', 'paul'], 
      }
 );

Scalars


$x =  $names{'andrew'}{'age'};
# now $x = 20 

$names{'andrew'}{'age'} = 35 ;

Arrays

Arrays are a bit more complex - use @{ } to access them and [ ] to set them


@x = @{ $names{'peter'}{'friends'} } ;

$names{'peter'}{'friends'} = [ 'bill', 'john', 'fred' ] ;

Hashes

Hashes are similar, use %{ } to access them and { } to set them


%hash = %{ $names{'andrew'} } ;

$names{'andrew'} = { 'age' => 20, 'height' => 1.76 } ;

Important Gotcha

When using the %hash = %{} and @array = @{} syntax, you are making a reference to the data, NOT a copy of it;


%hash = %{ $names{'andrew'} } ;
# %hash is a reference, change %hash and you change %names, and visa versa

%new_hash = %hash;
# now safe

Regex : Doesn't match


# $x does not contain andrew
$x !~ /andrew/

Regex : Match on word Boundary

\b match on a word boundary

\B matches on not a word boundary

Regex : . and the /s modifier

A . matches anything except a newline

To match a newline, use the /s modifer (treat as a Single line)


$x = qq(
andrew
was
here
);

$x =~ /andrew .* here/xs;
# matches as the /s carries the search over the newlines

Regex : Anchors and the /m modifier

Anchors
^ matches at the start of a string
$ matches at the end of a string


$x = /andrew/

$x =~ /^a/ ;
# matches as andrew starts with an a

$x =~ /rew$/;
# matches as andrew ends with a rew

But what if you have a string containing several lines (e.g. a file with \n characters) ?

The /m modifier.

This treats the string as multiple lines, so ^ and $ will match the start and end of each line in the file


$x = qq(
andrew
john andrew
andrew
);
$x =~ s/^andrew/ANDREW/mg;
# will replace 1st and 3rd andrew
# without the /m, it would only replace the third
This treats the string as a single line, so

Regex : Global Replace

The /g modifier

This repeats the regex until there are no more matches


$x = /andrew andrew andrew/;

$x = s/a/A/;
# Andrew andrew andrew (1 replace)

$x = s/a/A/g;
# Andrew Andrew Andrew (all the 'a' get replaces)

Regex : Using expressions

This is the /e modifier

Example: make HTML tags upper case


# ( ) - capture the text
# .*? - match the least amount of text, i.e. an HTML tag
# uc ($1) - make the matches text uppercase
# gxe - g is global replace
# gxe - x is allow whitespace and comments
# gxe - the bit on the right is an expressions to be evaluates

$html =~ s# ( < .*? > ) # print uc($1) #gxe ;

Example : find hi byte characters


# find hi-byte characters in HTML
# - and keep a record of all the hi byte chars found in %found
my %found ;

sub find {
 my ($char) = @_ ;
 $found{ $chars } ++; ## keep a record
 return '[' . ord( $chars ) . ']';
 }

## this will find euro and £ symbols, but not €
## \x80 is hex (ascii char 128)
## \x80-\xffff is a range - if the file is utf8, it will match hi bytes chars as well
## gxe - expression, whitespace, and global replace
$html =~ s# [\x80-\xffff] # &find( $1 ) #gxe;

Regex : Substitutions

Regex : Capture Matched text

Use brackets to capture text, and $1, $2 ... to refer to it


$x = 'andrew john';

# swap the first and second word around

$x=~ s/ (\w+) \s+ (\w+*)
      / $2 $1
      / x;

# s/    - substitute
# (\w+) - 1+ word charcters, captures as $1
# \s+   - 1+ whitespace characters
# (\w+) - 1+ word chars, captured as $2
# /x    - use whitesapce and comments

# $1 and $2 survive
print $2;
# prints john

Regex: Ranges

Use [ ] for ranges - match any 1 char inside the range


$x =~ /[0-9]/;
$x =~ /[a-z]/i;   # so match a..z and A..Z as /i is ignore case
$x =~ /[\n\r]/;

#match 1+ chars in the range 0..9 + - and .
#\. escape the dot so it doesn't mean "any 1 char"
$x=~ /[0-9+-\.]+/

Perl Poetry

This isn't mine, but its quite cool.

$_ = reverse sort qw p ekca lre uJ reh ts
p, $/.r, map $_.$", qw e p h tona e; print

Recommended PC Shareware

On my new Windows Vista PC I have:

Google Pack (which includes)

Firefox (web browser)
Picasa (organises photos)
Adobe PDF Viewer
Spyware Doctor (anti virus)
Google Earth

Comodo (firewall) NB: Online Armour is better for Windows XP

Ubuntu (so the PC can dual boot in Linux)

Activestate Perl (a programming language)

OpenOffice shareware version of MS Word, MS Excel etc)

Notepad++ (an editor that believes in Perl)

The National Trust in Southeast England

The National Trust is a charity that own many historic buildings and gardens in the UK. This is a review of its properties in the southeast

Many NT places are country houses set in grounds, or in rural areas that are nice for a walk, so its best to visit them on pleasent days. Very few have public transport, so you really need a car to visit most of them.

Their pricing policy makes it expensive to visit an individual property, but the annual membership is very good value. Join via the website to get 25% off on your first year. Note that they send out your next's year membership pack after 9 months or so to 'force' you into renewing.

Kent (southeast of London)

Knowle Palace: 4/5
Large Bishop's palace set in a massive deer park. One of the biggies. Can walk from Sevenoakes train station. Nice tea room. Outside is stunning. Interior is a dimly lit (conservation) museum of very old furniture and faded portraits. The landscaped deer park is beautiful. Recommended.
To Do
Chartwell, Emmetts, Igtham Mote, Quebec House, Scotney castle, Sissinghurst, Smallhythe

Sussex (south of London)

Nymans Gardens : 5/5
A large, beautiful, word-class garden with a ruined house set on a hill. One of the biggies (for gardeners). Hundreds of different plants, shrubs and trees. South of London, just off the M23, between Gatwick Airport and Brighton. Waymarked walks down in the nearby forest. OK tearoom. Best when the garden is (spring and summer). Go there!
Petworth : 5/5
Very large house, almost a palace, containing a large selection of paintings and sculpture, set in a massive landscaped park. One of the bigges. Most of the ground floor is open (and some family rooms on weekdays only). Great for a walk or picnic. Go there!

Uppark : 2/5
Great looking house, and a great location on a quiet part of the South Downs. However the interior is very "cold", like an unlived in museum. Visit for a walk or picnic. Little information about the people that lived in it would have made for a much more interesting visit. No public transport. A long way southwest of London, near Portsmouth. Poor tearoom would be amazing if you could sit outside in summer, so best on a nice day.

Surrey

Clandon Park and Hatchlands : 4/5
2 country houses close to each other in parkland. Takes about 3 hours with a walk in the parkland
Poleseden Lacy : 4/5
Large country house overlooking a secluded valley. Formal grassed terrace with a view. Waymarked walks in the nearby forest. South of London, just outside the M25. OK tearoom (its being relocated). Recommended.
Winkworth Arboretum : 3/5
Pretty trees on a hillside

Hampshire (Southwest of London)

To Do
Hinton Ampner, Mottisfont Abbey, The Vine

Berkshire (west of London)

Basildon Park : 2/10
Nice, medium sized house in a park, close to the Thames. Little about the family. Takes 1 hour.
To Do
Ashdown House

Oxfordshire (west of London)

To Do
Buscot Park, Chastleton House, Greys Court

Buckinghamshire (northwest of London)

Claydon House : 5/5
Beautiful, medium sized house, remote location, near Stowe. Very homely. Highly recommended. Nice tearoom. Takes 1-2 hours.
Stowe : 3/5
A large park with many follies. One of the biggies. I was really looking forward to this place as I like follies, but it was a bit of an anticlimax. Takes about 2-3 hours to visit as you walk around. NB the house belongs to the school, and isn't NT.
To Do
Hughenden Manor, Waddesdon Park, West Wycombe Manor, Cliveden, Ascott

Cablechip Solution's Blog

SWC Header

Cablechip Solutions

web development with Unix, Perl, Javascript, HTML and web services

Title

Cablechip Solution's Blog

Character Sets

Definitions

Character Sets

Character Encoding

Character Set Conversion Problems

Case Study

Appendix : Differences between Windows 1252 and the ISO Character Sets

How to Install Template Toolkit on Windows

Regex : Tutorial

Regex : Using a regular expression (perl specific).

Regex : Using whitepace and comments

Regex : Some simple examples

Regex : How many times to match?

Regex : Predefined Groups of characters

Regex : Groups

Perl Complex Variables

Regex : Doesn't match

Regex : Match on word Boundary

Regex : . and the /s modifier

Regex : Anchors and the /m modifier

Regex : Global Replace

Regex : Using expressions

Regex : Substitutions

Regex : Capture Matched text

Regex: Ranges

Perl Poetry

Recommended PC Shareware

The National Trust in Southeast England

Labels

Blog Archive

About Me

SWC Footer