- According to unicode.org :
- Unicode provides a unique number for every character, no matter what the platform, no matter what the program, no matter what the language.
- In all, the Unicode Standard, Version 5.2 provides codes for 107,361 characters from the world's alphabets, ideograph sets, and symbol collections.
- The majority of common-use characters fit into the first 64K code points, an area of the codespace that is called the basic multilingual plane, or BMP for short.
- There are sixteen other supplementary planes available for encoding other characters which currently have over eight hundred thousand unused code points.
- The Unicode Standard defines three encoding forms that allow the same data to be transmitted in a byte, word or double word oriented format (i.e. in 8, 16 or 32-bits per code unit).
- Popular for HTML and similar protocols.
- Way of transforming all Unicode characters into a variable length encoding of bytes.
- It has the advantages that the Unicode characters corresponding to the familiar ASCII set have the same byte values as ASCII.
- Unicode characters transformed into UTF-8 can be used with much existing software without extensive software rewrites.
- Popular in many environments that need to balance efficient access to characters with economical use of storage
- It is reasonably compact and all the heavily used characters fit into a single 16-bit code unit, while all other characters are accessible via pairs of 16-bit code units.
- Useful where memory space is no concern, but fixed width, single code unit access to characters is desired.
- Each Unicode character is encoded in a single 32-bit code unit when using UTF-32.
- The character identified by a Unicode code point is an abstract entity, such as "LATIN CHARACTER CAPITAL S".
- The mark made on screen or paper -- called a glyph -- is a visual representation of the character.
- The Unicode Standard does not define glyph images.
- The standard defines how characters are interpreted, not how glyphs are rendered. The Unicode Standard does not specify the size, shape, nor style of on-screen characters.
- The Unicode Standard directly addresses only the encoding and semantics of text.
- The Unicode Character Standard primarily encodes scripts rather than languages.
- Where more than one language shares a set of symbols that have a historically related derivation, the union of the set of symbols of each such language is unified into a single collection identified as a single script.
- Some scripts like Latin and Devanagari can support many languages.
- Some languages may also make use of more than one script; for example, Japanese traditionally makes use of the Han (Kanji), Hiragana, and Katakana scripts, and modern Japanese usage commonly mixes in the Latin script as well.
- The primary scripts currently supported by Unicode 5.2.0 are:
ArabicAramaic, ImperialArmenianAvestanBalineseBamumBengaliBopomofoBugineseBuhidCanadian SyllabicsCarianChamCherokeeCopticCypriotCyrillicDeseretDevanagariEgyptian HieroglyphsEthiopicGeorgianGlagoliticGothicGreekGujaratiGurmukhiHanHangulHanunóoHebrewHiraganaJavaneseKaithiKannadaKatakanaKayah LiKharoshthiKhmerLaoLatinLepcha (Rong)LimbuLinear BLisuLycianLydianMalayalamMeetei MayekMongolianMyanmarNew Tai LueN'KoOghamOl ChikiOld Italic (Etruscan)Old Persian CuneiformOld South ArabianOld TurkicOsmanyaOriyaPahlavi, InscriptionalParthian, InscriptionalPhags-paPhoenicianRejangRunicSaurashtraSamaritanShavianSinhalaSumero-Akkadian CuneiformSundaneseSyloti NagriSyriacTagalogTagbanwaTai LeTai ThamTai VietTamilTeluguThaanaThaiTibetanTifinagh (Berber)UgariticVaiYi
Unicode also encodes a number of other collections of symbols. These other collections are as follows:
- General Diacritics
- General Punctuation
- General Symbols
- Mathematical Symbols
- Musical Symbols (Western, Byzantine, and Ancient Greek)
- Technical Symbols
- Arrows, Blocks, Box Drawing Forms, and Geometric Shapes
- Game Symbols
- Miscellaneous Symbols
- Presentation Forms
- Braille Patterns
- Kangxi Radicals
Some of the Members of Unicode Consortium (Source: Unicode.org)
- Government of India
- Columbia University
- Sony Ericsson
Some technologies using Unicode
Some important terms:
- Unicode transformation format (UTF)
- UTF is an algorithmic mapping from every Unicode code point (except surrogate code points) to a unique byte sequence.
- UTF-8 is most common on the web.
- UTF-16 is used by Java and Windows.
- UTF-32 is used by various Unix systems.
- The conversions between all of them are algorithmically based, fast and lossless.
- UTF-16 uses 2 bytes.
- UTF-32 uses 4 bytes.
- UTF-16 is available in 3 forms.
- UTF-16 (Unmarked) : Uses big-endian byte serialization by default, but may include a byte order mark at the beginning to indicate the actual byte serialization used.
- UTF-16BE : Uses big-endian byte serialization (most significant byte first).
- UTF-16LE : Uses little-endian byte serialization (least significant byte first).
- UTF-32 is available in 3 forms.
- UTF-32 (Unmarked) : Uses big-endian byte serialization by default, but may include a byte order mark at the beginning to indicate the actual byte serialization used.
- UTF-32BE : Uses big-endian byte serialization (most significant byte first).
- UTF-32LE : Uses little-endian byte serialization (least significant byte first).