Internet‎ > ‎

UNICODE

  • According to unicode.org :
    • Unicode provides a unique number for every character, no matter what the platform, no matter what the program, no matter what the language.
    • In all, the Unicode Standard, Version 5.2 provides codes for 107,361 characters from the world's alphabets, ideograph sets, and symbol collections.
    • The majority of common-use characters fit into the first 64K code points, an area of the codespace that is called the basic multilingual plane, or BMP for short.
    • There are sixteen other supplementary planes available for encoding other characters which currently have over eight hundred thousand unused code points.
    • The Unicode Standard defines three encoding forms that allow the same data to be transmitted in a byte, word or double word oriented format (i.e. in 8, 16 or 32-bits per code unit).
      • UTF-8
        • Popular for HTML and similar protocols.
        • Way of transforming all Unicode characters into a variable length encoding of bytes.
        • It has the advantages that the Unicode characters corresponding to the familiar ASCII set have the same byte values as ASCII.
        • Unicode characters transformed into UTF-8 can be used with much existing software without extensive software rewrites.
      • UTF-16
        • Popular in many environments that need to balance efficient access to characters with economical use of storage
        • It is reasonably compact and all the heavily used characters fit into a single 16-bit code unit, while all other characters are accessible via pairs of 16-bit code units.
      • UTF-32
        • Useful where memory space is no concern, but fixed width, single code unit access to characters is desired.
        • Each Unicode character is encoded in a single 32-bit code unit when using UTF-32.
  • The character identified by a Unicode code point is an abstract entity, such as "LATIN CHARACTER CAPITAL S".
  • The mark made on screen or paper -- called a glyph -- is a visual representation of the character.
  • The Unicode Standard does not define glyph images.
  • The standard defines how characters are interpreted, not how glyphs are rendered. The Unicode Standard does not specify the size, shape, nor style of on-screen characters.
  • The Unicode Standard directly addresses only the encoding and semantics of text.
  • The Unicode Character Standard primarily encodes scripts rather than languages.
  • Where more than one language shares a set of symbols that have a historically related derivation, the union of the set of symbols of each such language is unified into a single collection identified as a single script.
  • Some scripts like Latin and Devanagari can support many languages.
  • Some languages may also make use of more than one script; for example, Japanese traditionally makes use of the Han (Kanji), Hiragana, and Katakana scripts, and modern Japanese usage commonly mixes in the Latin script as well.
  • The primary scripts currently supported by Unicode 5.2.0 are:
  • Arabic
  • Aramaic, Imperial
  • Armenian
  • Avestan
  • Balinese
  • Bamum
  • Bengali
  • Bopomofo
  • Buginese
  • Buhid
  • Canadian Syllabics
  • Carian
  • Cham
  • Cherokee
  • Coptic
  • Cypriot
  • Cyrillic
  • Deseret
  • Devanagari
  • Egyptian Hieroglyphs
  • Ethiopic
  • Georgian
  • Glagolitic
  • Gothic
  • Greek
  • Gujarati
  • Gurmukhi
  • Han
  • Hangul
  • Hanunóo
  • Hebrew
  • Hiragana
  • Javanese
  • Kaithi
  • Kannada
  • Katakana
  • Kayah Li
  • Kharoshthi
  • Khmer
  • Lao
  • Latin
  • Lepcha (Rong)
  • Limbu
  • Linear B
  • Lisu
  • Lycian
  • Lydian
  • Malayalam
  • Meetei Mayek
  • Mongolian
  • Myanmar
  • New Tai Lue
  • N'Ko
  • Ogham
  • Ol Chiki
  • Old Italic (Etruscan)
  • Old Persian Cuneiform
  • Old South Arabian
  • Old Turkic
  • Osmanya
  • Oriya
  • Pahlavi, Inscriptional
  • Parthian, Inscriptional
  • Phags-pa
  • Phoenician
  • Rejang
  • Runic
  • Saurashtra
  • Samaritan
  • Shavian
  • Sinhala
  • Sumero-Akkadian Cuneiform
  • Sundanese
  • Syloti Nagri
  • Syriac
  • Tagalog
  • Tagbanwa
  • Tai Le
  • Tai Tham
  • Tai Viet
  • Tamil
  • Telugu
  • Thaana
  • Thai
  • Tibetan
  • Tifinagh (Berber)
  • Ugaritic
  • Vai
  • Yi
  • Unicode also encodes a number of other collections of symbols. These other collections are as follows:

    • Numbers
    • General Diacritics
    • General Punctuation
    • General Symbols
    • Mathematical Symbols
    • Musical Symbols (Western, Byzantine, and Ancient Greek)
    • Technical Symbols
    • Dingbats
    • Arrows, Blocks, Box Drawing Forms, and Geometric Shapes
    • Game Symbols
    • Miscellaneous Symbols
    • Presentation Forms
    • Braille Patterns
    • Kangxi Radicals
    Some of the Members of Unicode Consortium (Source: Unicode.org)
    • Adobe
    • Apple
    • Microsoft
    • Google
    • IBM
    • Oracle
    • SAP
    • Sybase
    • Yahoo
    • Government of India
    • Columbia University
    • SAS
    • Verisign
    • Sony Ericsson
    • Nokia
    • HP
    Some technologies using Unicode
    • XML
    • Java
    • ECMAScript (Official standard defining JavaScript)
    Some important terms:
    • Unicode transformation format (UTF)
      • UTF is an algorithmic mapping from every Unicode code point (except surrogate code points) to a unique byte sequence.
      • UTF-8 is most common on the web.
      • UTF-16 is used by Java and Windows.
      • UTF-32 is used by various Unix systems.
      • The conversions between all of them are algorithmically based, fast and lossless.
      • UTF-16 uses 2 bytes.
      • UTF-32 uses 4 bytes.
      • UTF-16 is available in 3 forms.
        • UTF-16 (Unmarked) : Uses big-endian byte serialization by default, but may include a byte order mark at the beginning to indicate the actual byte serialization used.
        • UTF-16BE : Uses big-endian byte serialization (most significant byte first).
        • UTF-16LE : Uses little-endian byte serialization (least significant byte first).
      • UTF-32 is available in 3 forms.
        • UTF-32 (Unmarked) : Uses big-endian byte serialization by default, but may include a byte order mark at the beginning to indicate the actual byte serialization used.
        • UTF-32BE : Uses big-endian byte serialization (most significant byte first).
        • UTF-32LE : Uses little-endian byte serialization (least significant byte first).

    Important Links
    Comments