advertisement

Journey of Bsdconv

50 %
50 %
advertisement
Information about Journey of Bsdconv
Technology

Published on March 1, 2014

Author: Buganini

Source: slideshare.net

Description

Unicode, Charset, Encoding, Conversion, Detection, Variants
advertisement

BSDCONV Buganini Q Since 2009

Charset & Encoding Character Set Collection of characters Encoding Binary representation

Charset & Encoding Unicode Unicode up to U+10FFFF Unicode BMP (up to U+FFFF) (Basic Multilingual Plane) . GB18030 CNS11643 CP950 Latin1 Figure: Character Sets

Charset & Encoding Unicode UTF-8 / UTF-32 / UCS4 Unicode up to U+10FFFF UTF-16 Unicode BMP (up to U+FFFF) UCS2 . GB18030 GB18030 CNS11643 CNS11643 CP950 CP950 (DBCS) Latin1 ISO-8859-1 / EBCDIC-0371 1 aka. IBM-37, some control characters are different from ISO-8859-1

Encoding :: UTF-32 / UCS4 Fixed Length 4 bytes Filesize *= 4 for ASCII text file Incompatible with C-style string convention Endianness concern

Encoding :: UCS2 Fixed Length 2 bytes Filesize *= 2 for ASCII text file Incompatible with C-style string convention Endianness concern BMP-only

Encoding :: UTF-16 Variable Length 2 bytes / 4 bytes (Surrogate pairs) Surrogates Using U+D800..U+DFFF Incompatible with C-style string convention Endianness concern ******** 110110** ******** ******** 110111** Table: UTF-16 Structure ********

Encoding :: UTF-8 Variable Length 1~6 bytes Compatible with C-style string convention Self-synchronizing Endian-neutral Sorting order = Code point order 0******* 110***** 1110**** 11110*** 111110** 1111110* (ASCII) 10****** 10****** 10****** 10****** 10****** 10****** 10****** 10****** 10****** 10****** 10****** 10****** Table: UTF-8 Structure 10****** 10****** 10******

Encoding :: CNS11643 (全字庫) http://www.cns11643.gov.tw/ #issue Only used by Taiwan government NOT a subset of Unicode Not just an charset/encoding Font Pronunciation Radical Component Stroke Tra/Sim mapping ㄇㄥ ˊ / méng 艸 艹日月 萌蕄 Table: Examples for some information provided by 全字庫 for「萌」

Encoding :: CCCII Variants Variant glyph at different plane Mostly used for library indexing 強 彊 强 21 3D 48 2D 3D 48 33 3D 48

Encoding :: Big5 Many incompatible variations (abusing PUA), none of standard tools can rule them all http://moztw.org/docs/big5/ Scenario Microsoft Taiwan BBS gov.tw gov.hk Dominating encoding CP950 UAO (Unicode-at-Once) Big5-2003 HKSCS (1999,2001,2004) Special characters conflict The second byte could be 0x5C (), 0x7C (|), 0x7E (~), which may have special meaning in certain context

Bsdconv :: Decoding and Encoding Alternative to iconv . ISO-8859-1 : UTF-8 from to Figure: Basic two phases conversion

Bsdconv :: Codecs & Fallback Optionally produce question mark (U+003F) as replacement . UTF-8 , 3F : ASCII , 3F from to Figure: Fallback codec Transliteration . UTF-8 : CP936 , CP936-TRANS , 3F from to Figure: Multiple fallback codecs

Encoding :: Big5 Many incompatible variations (abusing PUA), none of standard tools can rule them all http://moztw.org/docs/big5/ Scenario Microsoft Taiwan BBS gov.tw gov.hk Dominating encoding CP950 UAO (Unicode-at-Once) Big5-2003 HKSCS (1999,2001,2004) Special characters conflict The second byte could be 0x5C (), 0x7C (|), 0x7E (~), which may have special meaning in certain context

Big5 5C issue (許功蓋) BIG5:BIG5-5C,BIG5 # Big5 Literal ASCII/Hex Input ” 成功” ”xA6xA8xA5” Output ” 成功 ” ”xA6xA8xA5” BIG5-5C,BIG5:BIG5 # Big5 Literal ASCII/Hex Input ” 成功 ” ”xA6xA8xA5” Output ” 成功” ”xA6xA8xA5”

Traditional/Simplified Chinese NOT one-to-one mapping Traditional 乾幹干 vs. Simplified 干干干 Context dependent 之後、夜之后、入夜之後 Variants 峰、峯

Project Chvar (1/2) https://github.com/buganini/chvar . 签簽 . Canonical group 籖籤 Canonical group Compatibility group Figure: Two level grouping in Chvar TW CN CP950 GB2312 签 簽 簽 - 簽 签 签 籖 籤 籤 × Table: Canonical Group 籤 籖 × TW CN CP950 GB2312 签 簽 簽 - 簽 签 签 籖 簽 签 簽 签 籤 簽 签 簽 签 Table: Compatibility Group

Project Chvar (2/2) https://github.com/buganini/chvar Normalization Canonical Equivalence Transliteration Converted or Canonical Equivalence or Compatibility Equivalence Fuzzy character matching Compatibility Equivalence TW CN CP950 GB2312 签 簽 簽 - 簽 签 签 籖 籤 籤 × Table: Canonical Group 籤 籖 × TW CN CP950 GB2312 签 簽 簽 - 簽 签 签 籖 簽 签 簽 签 籤 簽 签 簽 签 Table: Compatibility Group

Bsdconv :: Phases Traditional Chinese ⇔ Simplified Chinese . UTF-8 : ZHTW : UTF-8 from inter to Figure: Conversion with inter-mapping phase

Bsdconv :: Phases Furthermore, phrases mapping . UTF-8 : ZHTW : ZHTW-WORDS : UTF-8 from inter inter to Figure: Conversion with multiple inter-mapping phases

Unicode :: Casing IS complicated Lowercase Uppercase a A i I Table: English Lowercase a à Uppercase A A Table: French Default Case Folding Lowercase ı i Uppercase I İ Table: Turkic Lowercase σ ς Uppercase Σ Σ Table: Greek

Unicode :: Normalization Forms (1/2) UAX#15 Indexing Identification security Username, Domain name Combining sequence Ordering of combining marks Hangul Singleton Ç q+◌̇+◌̣ 가 Ω Table: Canonical Equivalence C + ◌̧ q+◌̣+◌̇ ᄀ+ᅡ Ω

Unicode :: Normalization Forms (2/2) UAX#15 Font variants Breaking differences Cursive forms Circled Width, size, rotated Superscripts/subscripts Squared characters Fractions Others ℌ NBSP ‫ﻧ‬ ① カ ︷ ⁹ ㍿ ¾ dž H SP ‫ﻨ‬ 1 カ { 9 株+式+会+社 3+/+4 d + z + ◌̌ Table: Compatibility Equivalence

Normalization for fuzzy matching UTF-8:UPPER:UTF-8 Input: aăⅷDžбⓐᾥ Output: AĂⅧDŽБⒶᾭ UTF-8:ZH-FUZZY-TW:KANA-PHONETIC:NFKDCASEFOLD:UTF-8 Input: ¼ℌℍăDžⓐ⁹ 灣湾ド

#issue presentations

Add a comment

Related presentations

Presentación que realice en el Evento Nacional de Gobierno Abierto, realizado los ...

In this presentation we will describe our experience developing with a highly dyna...

Presentation to the LITA Forum 7th November 2014 Albuquerque, NM

Un recorrido por los cambios que nos generará el wearabletech en el futuro

Um paralelo entre as novidades & mercado em Wearable Computing e Tecnologias Assis...

Microsoft finally joins the smartwatch and fitness tracker game by introducing the...

Related pages

GitHub - buganini/bsdconv: BSD licensed charset/encoding ...

bsdconv - BSD licensed charset/encoding converter library with more functionalities than libiconv
Read more

www.freshports.org

Port details: ruby22 Object-oriented interpreted scripting language 2.2.5,1 lang =3 Maintainer: ruby@FreeBSD.org Port Added: 20 Feb 2015 23:52:23 Also ...
Read more

www.freebsd.org

Header und Logo. Externe Links. Suche. Startseite; Über FreeBSD. Für Einsteiger; Eigenschaften; Advocacy; Marketing; Privacy Policy; Bezugsquellen ...
Read more

Index of Packages Matching 'library' : Python Package Index

Index of Packages Matching 'library' Package Weight* Description; Library 0.0.0: 10: Useful library package stuff. PythonAPILibrary 1.0.0: 8: Python API ...
Read more

www.freebsd.org

Header und Logo. Externe Links. Suche. Startseite; Über FreeBSD. Für Einsteiger; Eigenschaften; Advocacy; Marketing; Privacy Policy; Bezugsquellen ...
Read more

Index of Packages Matching 'rar' : Python Package Index

Package Weight* Description; JiraRobot 1.0: 8: Robot library for interacting with JIRA: librarian 0.3.0: 8: Python advanced card game library ...
Read more

FreshPorts -- lang/ruby22

converters/rubygem-bsdconv; ... www/rubygem-journey; ... lang/ruby22: update to 2.2.5: 21 Apr 2016 16:43:15 2.2.4,1: swills :
Read more

FOSSASIA | 28 Feb.-2 Mar 2014
Phnom Penh, Cambodia

FOSSASIA 2014 Wrap up. 3 Days of Pure Knowledge Sharing, 71 International Speakers, 15 Presenters from Cambodia, 121 Talks, Workshops and Panels, ...
Read more