9. Encoding :: CNS11643 (全字庫) #issue
http://www.cns11643.gov.tw/
Only used by Taiwan government
NOT a subset of Unicode
Not just an charset/encoding
Font
Pronunciation ㄇㄥ ˊ / méng
Radical 艸
Component 艹日月
Stroke
Tra/Sim mapping 萌蕄
Table: Examples for some information provided by 全字庫 for「萌」
11. Encoding :: Big5
Many incompatible variations (abusing PUA), none of
standard tools can rule them all
http://moztw.org/docs/big5/
Scenario Dominating encoding
Microsoft CP950
Taiwan BBS UAO (Unicode-at-Once)
gov.tw Big5-2003
gov.hk HKSCS (1999,2001,2004)
Special characters conflict
The second byte could be 0x5C (), 0x7C (|), 0x7E (~), which
may have special meaning in certain context
12. Bsdconv :: Decoding and Encoding
Alternative to iconv
... ISO-8859-1. :. UTF-8..
from
.
to
Figure: Basic two phases conversion
13. Bsdconv :: Codecs & Fallback
Optionally produce question mark (U+003F) as replacement
... UTF-8. ,. 3F. :. ASCII. ,. 3F..
from
.
to
Figure: Fallback codec
Transliteration
... UTF-8. :. CP936. ,. CP936-TRANS. ,. 3F..
from
.
to
Figure: Multiple fallback codecs
14. Encoding :: Big5
Many incompatible variations (abusing PUA), none of
standard tools can rule them all
http://moztw.org/docs/big5/
Scenario Dominating encoding
Microsoft CP950
Taiwan BBS UAO (Unicode-at-Once)
gov.tw Big5-2003
gov.hk HKSCS (1999,2001,2004)
Special characters conflict
The second byte could be 0x5C (), 0x7C (|), 0x7E (~), which
may have special meaning in certain context
19. Bsdconv :: Phases
Traditional Chinese ⇔ Simplified Chinese
... UTF-8. :. ZHTW. :. UTF-8..
from
.
inter
.
to
Figure: Conversion with inter-mapping phase
20. Bsdconv :: Phases
Furthermore, phrases mapping
... UTF-8. :. ZHTW. :. ZHTW-WORDS. :. UTF-8..
from
.
inter
.
inter
.
to
Figure: Conversion with multiple inter-mapping phases
21. Unicode :: Casing
IS complicated
Lowercase Uppercase
a A
i I
Table: English
Lowercase Uppercase
ı I
i İ
Table: Turkic
Lowercase Uppercase
a A
à A
Table: French
Lowercase Uppercase
σ Σ
ς Σ
Table: Greek
Default Case Folding
22. Unicode :: Normalization Forms (1/2)
UAX#15
Indexing
Identification security
Username, Domain name
Combining sequence Ç C + ◌̧
Ordering of combining marks q+◌̇+◌̣ q+◌̣+◌̇
Hangul 가 ᄀ + ᅡ
Singleton Ω Ω
Table: Canonical Equivalence
23. Unicode :: Normalization Forms (2/2)
UAX#15
Font variants ℌ H
Breaking differences NBSP SP
Cursive forms ﻧ ﻨ
Circled ① 1
Width, size, rotated
カ カ
︷ {
Superscripts/subscripts ⁹ 9
Squared characters ㍿ 株 + 式 + 会 + 社
Fractions ¾ 3 + / + 4
Others dž d + z + ◌̌
Table: Compatibility Equivalence
24. Normalization for fuzzy matching
UTF-8:UPPER:UTF-8
Input: aăⅷDžбⓐᾥ
Output: AĂⅧDŽБⒶᾭ
UTF-8:ZH-FUZZY-TW:KANA-PHONETIC:NFKD-
CASEFOLD:UTF-8
Input: ¼ℌℍăDžⓐ⁹ 灣湾ド鬒鬒㊣ æß
Output: 1⁄4hhădža9 灣灣 do 鬒鬒正 æss
Composition Decomposition
Canonical NFC NFD
Compatibility NFKC NFKD
Table: The four Unicode normalization forms and the transformations
25. Bsdconv :: Cascade
Re-encode
... UTF-8. :. ISO-8859-1. |. BIG5. :. UTF-8..
from
.
to
.
from
.
to
Figure: Cascaded conversions
Input Output
¥x¥_ 台北
26. Bsdconv :: Codec argument
Other than question mark
... UTF-8. ,. ANY#0121. :. ASCII. ,. ANY#21..
from
.
to
Figure: Codec argument
Or more than one character
... UTF-8. ,. ANY#013F.0121. :. ASCII. ,. ANY#21..
from
.
to
Figure: Data list, separated by dot
28. Charset & Encoding
..
Unicode (32bits addr. space)
.
Unicode up to U+10FFFF
.
Unicode BMP (up to U+FFFF)
.
(Basic Multilingual Plane)
. GB18030.
CNS11643
.
CP950
.
Latin1
Figure: Character Sets
29. Bsdconv :: Types
(01) Unicode
(02) CNS11643
(03) Byte
(04) Chinese components
(1B) ANSI control sequences
(00) Bsdconv special characters
30. Encoding :: CNS11643 (全字庫) #issue
http://www.cns11643.gov.tw/
Only used by Taiwan government
NOT a subset of Unicode
Not just an charset/encoding
Font
Pronunciation ㄇㄥ ˊ / méng
Radical 艸
Component 艹日月
Stroke
Tra/Sim mapping 萌蕄
Table: Examples for some information provided by 全字庫 for「萌」
38. Unicode :: East Asian Width (1/2)
UAX#11
..
..Narrow.
Halfwidth
..
.. Wide.
Fullwidth
.
Ambiguous
.
Neutral
Figure: Venn Diagram Showing the Set Relations for Six Categories
39. Unicode :: East Asian Width (2/2)
UAX#11
Narrow Ambiguous Wide
Я
N ऊ
Na A A F
H カ カ W
咦 W
Table: Examples for Each Character Class and Their Resolved Widths
Na Narrow
N Neural, usually treated as Narrow
W Wide
F Fullwidth
H Halfwidth
Table: Width attributes
42. Khmer legacy font converter
https://github.com/buganini/khmerconv
Issues
Encoding without registerd name, bound on fonts
Stored in CP1252 or UTF-8
Solution
Two pass detection
Detect encoding
Detect font family (currently not working)
(High converage in SBCS)
Algorithm ported from Khmer Converter3
Khmer Converter
Mapping
Reordering
Visual order vs. Unicode model
Unicode Model: baseCharacter [+ [Robat/Shifter] + [Coeng*]
+ [Shifter] + [Vowel] + [Sign]]
3
http://www.khmeros.info/en/khmer-converter
43. Encoding :: Big5
Many incompatible variations (abusing PUA), none of
standard tools can rule them all
http://moztw.org/docs/big5/
Scenario Dominating encoding
Microsoft CP950
Taiwan BBS UAO (Unicode-at-Once)
gov.tw Big5-2003
gov.hk HKSCS (1999,2001,2004)
Special characters conflict
The second byte could be 0x5C (), 0x7C (|), 0x7E (~), which
may have special meaning in certain context
44. Unicode :: East Asian Width (1/2)
UAX#11
..
..Narrow.
Halfwidth
..
.. Wide.
Fullwidth
.
Ambiguous
.
Neutral
Figure: Venn Diagram Showing the Set Relations for Six Categories
45. Unicode :: East Asian Width (2/2)
UAX#11
Narrow Ambiguous Wide
Я
N ऊ
Na A A F
H カ カ W
咦 W
Table: Examples for Each Character Class and Their Resolved Widths
Na Narrow
N Neural, usually treated as Narrow
W Wide
F Fullwidth
H Halfwidth
Table: Width attributes
46. Terminal transcoding
https://github.com/buganini/bug5
Issues
UAO: Non-standard big5 extension
Double color hack
ANSI control sequence in the middle of DBCS
Ambiguous width characters
luit/screen cannot help
Solution (tl;dr)
Big5 to Unicode
ANSI-CONTROL,BYTE:BIG5-DEFRAG:
BYTE,PASS#MARK&FOR=1B|
PASS#UNMARK,BIG5:AMBIGUOUS-PAD:
UTF-8,PASS#FOR=1B
Unicode to Big5
UTF-8,00,BYTE:ZHTW:AMBIGUOUS-UNPAD:
BIG5,CP950-TRANS,UAO,00,ANY#3F