Definitions
Character Sets- A character set in a sequence of characters, each character is represented by a number.
e.g. 65=A, 66=B, 67=C, ... 1234=£, ... - Examples of character sets are:
ASCII, ISO 8859-1, Windows 1252, ANSI
Character Encoding
- A character encoding is a means of representing a character set in a file.
- For ASCII and Windows 1252 (or ANSI), its easy, 1 byte = 1 character.
- For large character sets, with more than 256 characters, it is more complex, as more than 1 byte per character is used.
- "UTF-8" uses 1, 2, 3 or 4 bytes per character.
- An entity reference is of the form:
" € © - A numeric reference (in HTML and XML) for character 256 is of the form:
ÿ (decimal) or ÿ (hexidecimal)
Character Sets
ASCII
- ASCII is the original Character Set, with 128 characters defined.
- 1 byte = 1 character.
ISO 8859-1
- This is the ISO "Western European" character set.
- It is the original "web" character set, and used as the default by older browsers.
- ISO 8859-1 is a subset of the larger UCS/Unicode character set (not quite true, but almost)
- It is now depreciated (obsolete) - use UTF-8 instead.
Unicode and UCS (Universal Character Set)
- This is a very large character set. It is a combination of the ISO 8859-1 characters,
plus mathematical and other symbols,
plus the Chinese, Hebrew, Japanese, Greek, Thai, Persian and other alphabets. - Some special cases :
- there are spaces reserved for 'user defined' characters,
- some characters can be combined to make composite characters (e.g. Thai) - Unicode/UCS is a character set. It is encoded using UTF-8
- This is the Windows character set.
- It is encoded using 1 bytes (0-255) per character.
- From 0-127, its the same as ASCII
- Between 0x80 and 0x9F there are differences. This is the problem area, as these character positions are 'not defined' in ISO 8859-1 and UTF-8
- From 0xA0 and above, its "the same" as ISO 8859-1.
Character Encoding
UTF-8
- UTF-8 is actually a character encoding, not a character set. Colloquially, it is now used to mean "Unicode/UCS with the UTF-8 encoding"
- Its a means of using 1, 2 , 3 or 4 bytes to store a very large character set.
- ASCII characters (0-127) take up 1 byte, so its backwards compatible.
- £, maths symbols, Chinese and Japanese characters take up 2, 3 or 4 bytes
- Some Windows editors use a 'BOM', a character at the front of a file to indicate that the file contains UTF-8 encoded characters. (Actually, its a 2 byte character that's illegal in UTF-8). Not part of the spec.
Character Set Conversion Problems
Whatever you've used in the past, UTF-8 (Unicode/UCS), is the thing to aim for. Google, Blogger etc all use it.
From Windows 1252 / ANSI to the ISO character sets
- If converting text from a Windows file to a web page in ISO format, you may have to map some 'hi byte' characters, e.g. the euro symbol, as the character numbers will not be the same.
- If copy-and-past'ing, windows will take care of the conversion for you.
From ISO 8859-1 to UTF-8/UCS/Unicode
- Viewing a ISO 885901 file in a web browser page set to UTF-8 will display any characters greater than 128 as illegal characters
- In terms of character sets, the conversion is straight forward, as there are "no" differences you are likely to encounter.
- The character encoding is the problem. Example: In ISO 8859-1, character(165) is stored as binary 165. In UTF-8, it should be 2 bytes. The single byte will be an illegal UTF-8 character.
- The solution is programing language dependent or editor dependant.
- For example, in the Notepad++ editor, there is a 'convert ANSI to UTF-8' option.
- In perl: $string =~ s/ ( [\x80-\xff] ) / chr($1) /gxe;
UTF-8 to ISO 8859-1
- Viewing a UTF-8 file in a web browser page set to ISO 8859-1 will display 2 (or more)characters for each UTF-8 'hi byte' character.
e.g. For 2 byte UTF-8 characters, it will display an illegal character, followed by the character you want. - The solution: First, identify all characters in your input stream, that don't have ISO 8850-1 equivalents
- Maybe convert
- all the exotic utf-8 bullet points to nn;
- the exotic hyphens to - (minus sign)
- the various 6, 66, 9, 99 style quotes to ' and "
- For XML feeds with character codes greater than 255, consider nn; escape sequences(rather than &name; or the binary code, both of which will cause problems)
Case Study
A well known international newspaper has a publishing system that uses UTF-8, and a series of XML feeds that use ISO 8859-1
- Analyse an entire year's worth or newspaper articles.
- Make a list of every unique characters used.
- Cater for nn; and &name; style characters.
- Map all the UTF-8 characters found (with character code greater than 128) to ISO 8859-1 equivalents.
- Flag up any UTF-8 characters encountered in the conversion process which are not covered by this mapping.
- Again, cater for name; and &name; characters.
- Escape all characters greater than 128 with the XML nn; escape sequence, so the output file is pure ASCII
Appendix : Differences between Windows 1252 and the ISO Character Sets
In practise, you may wish to map characters to ', ", and - rather then "left single quote" etc
0x80 0x20ac ;Euro Sign
0x81 0x0081
0x82 0x201a ;Single Low-9 Quotation Mark
0x83 0x0192 ;Latin Small Letter F With Hook
0x84 0x201e ;Double Low-9 Quotation Mark
0x85 0x2026 ;Horizontal Ellipsis
0x86 0x2020 ;Dagger
0x87 0x2021 ;Double Dagger
0x88 0x02c6 ;Modifier Letter Circumflex Accent
0x89 0x2030 ;Per Mille Sign
0x8a 0x0160 ;Latin Capital Letter S With Caron
0x8b 0x2039 ;Single Left-Pointing Angle Quotation Mark
0x8c 0x0152 ;Latin Capital Ligature Oe
0x8d 0x008d
0x8e 0x017d ;Latin Capital Letter Z With Caron
0x8f 0x008f
0x90 0x0090
0x91 0x2018 ;Left Single Quotation Mark
0x92 0x2019 ;Right Single Quotation Mark
0x93 0x201c ;Left Double Quotation Mark
0x94 0x201d ;Right Double Quotation Mark
0x95 0x2022 ;Bullet
0x96 0x2013 ;En Dash
0x97 0x2014 ;Em Dash
0x98 0x02dc ;Small Tilde
0x99 0x2122 ;Trade Mark Sign
0x9a 0x0161 ;Latin Small Letter S With Caron
0x9b 0x203a ;Single Right-Pointing Angle Quotation Mark
0x9c 0x0153 ;Latin Small Ligature Oe
0x9d 0x009d
0x9e 0x017e ;Latin Small Letter Z With Caron
0x9f 0x0178 ;Latin Capital Letter Y With Diaeresis
No comments:
Post a Comment