Definitions

Character Sets

A character set in a sequence of characters, each character is represented by a number.
e.g. 65=A, 66=B, 67=C, ... 1234=£, ...
Examples of character sets are:
ASCII, ISO 8859-1, Windows 1252, ANSI

Character Encoding

A character encoding is a means of representing a character set in a file.
For ASCII and Windows 1252 (or ANSI), its easy, 1 byte = 1 character.
For large character sets, with more than 256 characters, it is more complex, as more than 1 byte per character is used.
"UTF-8" uses 1, 2, 3 or 4 bytes per character.

Character References in HTML and XML

An entity reference is of the form:
" € ©
A numeric reference (in HTML and XML) for character 256 is of the form:
ÿ (decimal) or ÿ (hexidecimal)

Character Sets

ASCII

ASCII is the original Character Set, with 128 characters defined.
1 byte = 1 character.

ISO 8859-1

This is the ISO "Western European" character set.
It is the original "web" character set, and used as the default by older browsers.
ISO 8859-1 is a subset of the larger UCS/Unicode character set (not quite true, but almost)
It is now depreciated (obsolete) - use UTF-8 instead.

Unicode and UCS (Universal Character Set)

This is a very large character set. It is a combination of the ISO 8859-1 characters,
plus mathematical and other symbols,
plus the Chinese, Hebrew, Japanese, Greek, Thai, Persian and other alphabets.
Some special cases :
- there are spaces reserved for 'user defined' characters,
- some characters can be combined to make composite characters (e.g. Thai)
Unicode/UCS is a character set. It is encoded using UTF-8

Windows 1252 / ANSI Character Set

This is the Windows character set.
It is encoded using 1 bytes (0-255) per character.
From 0-127, its the same as ASCII
Between 0x80 and 0x9F there are differences. This is the problem area, as these character positions are 'not defined' in ISO 8859-1 and UTF-8
From 0xA0 and above, its "the same" as ISO 8859-1.

Character Encoding

UTF-8

UTF-8 is actually a character encoding, not a character set. Colloquially, it is now used to mean "Unicode/UCS with the UTF-8 encoding"
Its a means of using 1, 2 , 3 or 4 bytes to store a very large character set.
ASCII characters (0-127) take up 1 byte, so its backwards compatible.
£, maths symbols, Chinese and Japanese characters take up 2, 3 or 4 bytes
Some Windows editors use a 'BOM', a character at the front of a file to indicate that the file contains UTF-8 encoded characters. (Actually, its a 2 byte character that's illegal in UTF-8). Not part of the spec.

Character Set Conversion Problems

Whatever you've used in the past, UTF-8 (Unicode/UCS), is the thing to aim for. Google, Blogger etc all use it.

From Windows 1252 / ANSI to the ISO character sets

If converting text from a Windows file to a web page in ISO format, you may have to map some 'hi byte' characters, e.g. the euro symbol, as the character numbers will not be the same.
If copy-and-past'ing, windows will take care of the conversion for you.

From ISO 8859-1 to UTF-8/UCS/Unicode

Viewing a ISO 885901 file in a web browser page set to UTF-8 will display any characters greater than 128 as illegal characters
In terms of character sets, the conversion is straight forward, as there are "no" differences you are likely to encounter.
The character encoding is the problem. Example: In ISO 8859-1, character(165) is stored as binary 165. In UTF-8, it should be 2 bytes. The single byte will be an illegal UTF-8 character.
The solution is programing language dependent or editor dependant.
- For example, in the Notepad++ editor, there is a 'convert ANSI to UTF-8' option.
- In perl: $string =~ s/ ( [\x80-\xff] ) / chr($1) /gxe;

UTF-8 to ISO 8859-1

Viewing a UTF-8 file in a web browser page set to ISO 8859-1 will display 2 (or more)characters for each UTF-8 'hi byte' character.
e.g. For 2 byte UTF-8 characters, it will display an illegal character, followed by the character you want.
The solution: First, identify all characters in your input stream, that don't have ISO 8850-1 equivalents
Maybe convert
- all the exotic utf-8 bullet points to &#nn;
- the exotic hyphens to - (minus sign)
- the various 6, 66, 9, 99 style quotes to ' and "
For XML feeds with character codes greater than 255, consider &#nn; escape sequences(rather than &name; or the binary code, both of which will cause problems)

Case Study

A well known international newspaper has a publishing system that uses UTF-8, and a series of XML feeds that use ISO 8859-1

Analyse an entire year's worth or newspaper articles.
- Make a list of every unique characters used.
- Cater for &#nn; and &name; style characters.
Map all the UTF-8 characters found (with character code greater than 128) to ISO 8859-1 equivalents.
Flag up any UTF-8 characters encountered in the conversion process which are not covered by this mapping.
- Again, cater for &#name; and &name; characters.
Escape all characters greater than 128 with the XML &#nn; escape sequence, so the output file is pure ASCII

Appendix : Differences between Windows 1252 and the ISO Character Sets

In practise, you may wish to map characters to ', ", and - rather then "left single quote" etc


0x80 0x20ac ;Euro Sign
0x81 0x0081
0x82 0x201a ;Single Low-9 Quotation Mark
0x83 0x0192 ;Latin Small Letter F With Hook
0x84 0x201e ;Double Low-9 Quotation Mark
0x85 0x2026 ;Horizontal Ellipsis
0x86 0x2020 ;Dagger
0x87 0x2021 ;Double Dagger
0x88 0x02c6 ;Modifier Letter Circumflex Accent
0x89 0x2030 ;Per Mille Sign
0x8a 0x0160 ;Latin Capital Letter S With Caron
0x8b 0x2039 ;Single Left-Pointing Angle Quotation Mark
0x8c 0x0152 ;Latin Capital Ligature Oe
0x8d 0x008d
0x8e 0x017d ;Latin Capital Letter Z With Caron
0x8f 0x008f
0x90 0x0090
0x91 0x2018 ;Left Single Quotation Mark
0x92 0x2019 ;Right Single Quotation Mark
0x93 0x201c ;Left Double Quotation Mark
0x94 0x201d ;Right Double Quotation Mark
0x95 0x2022 ;Bullet
0x96 0x2013 ;En Dash
0x97 0x2014 ;Em Dash
0x98 0x02dc ;Small Tilde
0x99 0x2122 ;Trade Mark Sign
0x9a 0x0161 ;Latin Small Letter S With Caron
0x9b 0x203a ;Single Right-Pointing Angle Quotation Mark
0x9c 0x0153 ;Latin Small Ligature Oe
0x9d 0x009d
0x9e 0x017e ;Latin Small Letter Z With Caron
0x9f 0x0178 ;Latin Capital Letter Y With Diaeresis

Cablechip Solution's Blog

SWC Header

Cablechip Solutions

web development with Unix, Perl, Javascript, HTML and web services

Title

Cablechip Solution's Blog

Character Sets

Definitions

Character Sets

Character Encoding

Character Set Conversion Problems

Case Study

Appendix : Differences between Windows 1252 and the ISO Character Sets

No comments:

Labels

Blog Archive

About Me

SWC Footer