SWC Header

Cablechip Solutions

web development with Unix, Perl, Javascript, HTML and web services

Title

Cablechip Solution's Blog

Character Sets

Definitions

Character Sets
  • A character set in a sequence of characters, each character is represented by a number.
    e.g. 65=A, 66=B, 67=C, ... 1234=£, ...
  • Examples of character sets are:
    ASCII, ISO 8859-1, Windows 1252, ANSI

Character Encoding
  • A character encoding is a means of representing a character set in a file.
  • For ASCII and Windows 1252 (or ANSI), its easy, 1 byte = 1 character.
  • For large character sets, with more than 256 characters, it is more complex, as more than 1 byte per character is used.
  • "UTF-8" uses 1, 2, 3 or 4 bytes per character.
Character References in HTML and XML
  • An entity reference is of the form:
    " € ©
  • A numeric reference (in HTML and XML) for character 256 is of the form:
    ÿ (decimal) or ÿ (hexidecimal)

Character Sets

ASCII

  • ASCII is the original Character Set, with 128 characters defined.
  • 1 byte = 1 character.

ISO 8859-1

  • This is the ISO "Western European" character set.
  • It is the original "web" character set, and used as the default by older browsers.
  • ISO 8859-1 is a subset of the larger UCS/Unicode character set (not quite true, but almost)
  • It is now depreciated (obsolete) - use UTF-8 instead.

Unicode and UCS (Universal Character Set)

  • This is a very large character set. It is a combination of the ISO 8859-1 characters,
    plus mathematical and other symbols,
    plus the Chinese, Hebrew, Japanese, Greek, Thai, Persian and other alphabets.
  • Some special cases :
    - there are spaces reserved for 'user defined' characters,
    - some characters can be combined to make composite characters (e.g. Thai)
  • Unicode/UCS is a character set. It is encoded using UTF-8
Windows 1252 / ANSI Character Set
  • This is the Windows character set.
  • It is encoded using 1 bytes (0-255) per character.
  • From 0-127, its the same as ASCII
  • Between 0x80 and 0x9F there are differences. This is the problem area, as these character positions are 'not defined' in ISO 8859-1 and UTF-8
  • From 0xA0 and above, its "the same" as ISO 8859-1.

Character Encoding


UTF-8

  • UTF-8 is actually a character encoding, not a character set. Colloquially, it is now used to mean "Unicode/UCS with the UTF-8 encoding"
  • Its a means of using 1, 2 , 3 or 4 bytes to store a very large character set.
  • ASCII characters (0-127) take up 1 byte, so its backwards compatible.
  • £, maths symbols, Chinese and Japanese characters take up 2, 3 or 4 bytes
  • Some Windows editors use a 'BOM', a character at the front of a file to indicate that the file contains UTF-8 encoded characters. (Actually, its a 2 byte character that's illegal in UTF-8). Not part of the spec.

Character Set Conversion Problems


Whatever you've used in the past, UTF-8 (Unicode/UCS), is the thing to aim for. Google, Blogger etc all use it.

From Windows 1252 / ANSI to the ISO character sets

  • If converting text from a Windows file to a web page in ISO format, you may have to map some 'hi byte' characters, e.g. the euro symbol, as the character numbers will not be the same.
  • If copy-and-past'ing, windows will take care of the conversion for you.

From ISO 8859-1 to UTF-8/UCS/Unicode

  • Viewing a ISO 885901 file in a web browser page set to UTF-8 will display any characters greater than 128 as illegal characters
  • In terms of character sets, the conversion is straight forward, as there are "no" differences you are likely to encounter.
  • The character encoding is the problem. Example: In ISO 8859-1, character(165) is stored as binary 165. In UTF-8, it should be 2 bytes. The single byte will be an illegal UTF-8 character.
  • The solution is programing language dependent or editor dependant.
    • For example, in the Notepad++ editor, there is a 'convert ANSI to UTF-8' option.
    • In perl: $string =~ s/ ( [\x80-\xff] ) / chr($1) /gxe;

UTF-8 to ISO 8859-1

  • Viewing a UTF-8 file in a web browser page set to ISO 8859-1 will display 2 (or more)characters for each UTF-8 'hi byte' character.
    e.g. For 2 byte UTF-8 characters, it will display an illegal character, followed by the character you want.
  • The solution: First, identify all characters in your input stream, that don't have ISO 8850-1 equivalents
  • Maybe convert
    • all the exotic utf-8 bullet points to &#nn;
    • the exotic hyphens to - (minus sign)
    • the various 6, 66, 9, 99 style quotes to ' and "
  • For XML feeds with character codes greater than 255, consider &#nn; escape sequences(rather than &name; or the binary code, both of which will cause problems)

Case Study

A well known international newspaper has a publishing system that uses UTF-8, and a series of XML feeds that use ISO 8859-1

  1. Analyse an entire year's worth or newspaper articles.
    • Make a list of every unique characters used.
    • Cater for &#nn; and &name; style characters.
  2. Map all the UTF-8 characters found (with character code greater than 128) to ISO 8859-1 equivalents.
  3. Flag up any UTF-8 characters encountered in the conversion process which are not covered by this mapping.
    • Again, cater for &#name; and &name; characters.
  4. Escape all characters greater than 128 with the XML &#nn; escape sequence, so the output file is pure ASCII

Appendix : Differences between Windows 1252 and the ISO Character Sets


In practise, you may wish to map characters to ', ", and - rather then "left single quote" etc

0x80 0x20ac ;Euro Sign
0x81 0x0081
0x82 0x201a ;Single Low-9 Quotation Mark
0x83 0x0192 ;Latin Small Letter F With Hook
0x84 0x201e ;Double Low-9 Quotation Mark
0x85 0x2026 ;Horizontal Ellipsis
0x86 0x2020 ;Dagger
0x87 0x2021 ;Double Dagger
0x88 0x02c6 ;Modifier Letter Circumflex Accent
0x89 0x2030 ;Per Mille Sign
0x8a 0x0160 ;Latin Capital Letter S With Caron
0x8b 0x2039 ;Single Left-Pointing Angle Quotation Mark
0x8c 0x0152 ;Latin Capital Ligature Oe
0x8d 0x008d
0x8e 0x017d ;Latin Capital Letter Z With Caron
0x8f 0x008f
0x90 0x0090
0x91 0x2018 ;Left Single Quotation Mark
0x92 0x2019 ;Right Single Quotation Mark
0x93 0x201c ;Left Double Quotation Mark
0x94 0x201d ;Right Double Quotation Mark
0x95 0x2022 ;Bullet
0x96 0x2013 ;En Dash
0x97 0x2014 ;Em Dash
0x98 0x02dc ;Small Tilde
0x99 0x2122 ;Trade Mark Sign
0x9a 0x0161 ;Latin Small Letter S With Caron
0x9b 0x203a ;Single Right-Pointing Angle Quotation Mark
0x9c 0x0153 ;Latin Small Ligature Oe
0x9d 0x009d
0x9e 0x017e ;Latin Small Letter Z With Caron
0x9f 0x0178 ;Latin Capital Letter Y With Diaeresis

No comments: