Journey of Bsdconv

Charset & Encoding
Character Set
Collection of characters
Encoding
Binary representation

Charset & Encoding
..
Unicode (32bits addr. space)
.
Unicode up to U+10FFFF
.
Unicode BMP (up to U+FFFF)
.
(Basic Multilingual Plane)
. GB18030.
CNS11643
.
CP950
.
Latin1
Figure: Character Sets

Charset & Encoding
..
Unicode (32bits addr. space)
.
Unicode up to U+10FFFF
.
Unicode BMP (up to U+FFFF)
.GB18030.
CNS11643
.
CP950
.
Latin1
.
UTF-32 / UCS4
.
UTF-81 / UTF-16
.
UCS2
. GB18030.
CNS11643
.
CP950 (DBCS)
.
ISO-8859-1 / EBCDIC-0372
1
Could cover more but restricted by RFC 3629
2
Aka. IBM-37, some control characters are diﬀerent from ISO-8859-1

Encoding :: UTF-32 / UCS4
Fixed Length
4 bytes
Filesize *= 4 for ASCII text ﬁle
Incompatible with C-style string convention
Endianness concern

Encoding :: UCS2
Fixed Length
2 bytes
Filesize *= 2 for ASCII text ﬁle
Endianness concern
BMP-only

Encoding :: UTF-16
Variable Length
2 bytes / 4 bytes (Surrogate pairs)
Surrogates
Using U+D800..U+DFFF
Endianness concern
******** ********
110110** ******** 110111** ********
Table: UTF-16 Structure

Encoding :: UTF-8
Variable Length
1~6 bytes
Compatible with C-style string convention
Self-synchronizing
Endian-neutral
Sorting order = Code point order
0******* (ASCII)
110***** 10******
1110**** 10****** 10******
11110*** 10****** 10****** 10******
111110** 10****** 10****** 10****** 10******
1111110* 10****** 10****** 10****** 10****** 10******
Table: UTF-8 Structure

Encoding :: CNS11643 (全字庫) #issue
http://www.cns11643.gov.tw/
Only used by Taiwan government
NOT a subset of Unicode
Not just an charset/encoding
Font
Pronunciation ㄇㄥ ˊ / méng
Radical 艸
Component 艹日月
Stroke
Tra/Sim mapping 萌蕄
Table: Examples for some information provided by 全字庫 for「萌」

Encoding :: CCCII
Variants
Variant glyph at diﬀerent plane
Mostly used for library indexing
強 21 3D 48
彊 2D 3D 48
强 33 3D 48

Encoding :: Big5
Many incompatible variations (abusing PUA), none of
standard tools can rule them all
http://moztw.org/docs/big5/
Scenario Dominating encoding
Microsoft CP950
Taiwan BBS UAO (Unicode-at-Once)
gov.tw Big5-2003
gov.hk HKSCS (1999,2001,2004)
Special characters conﬂict
The second byte could be 0x5C (), 0x7C (|), 0x7E (~), which
may have special meaning in certain context

Bsdconv :: Decoding and Encoding
Alternative to iconv
... ISO-8859-1. :. UTF-8..
from
.
to
Figure: Basic two phases conversion

Bsdconv :: Codecs & Fallback
Optionally produce question mark (U+003F) as replacement
... UTF-8. ,. 3F. :. ASCII. ,. 3F..
from
.
to
Figure: Fallback codec
Transliteration
... UTF-8. :. CP936. ,. CP936-TRANS. ,. 3F..
from
.
to
Figure: Multiple fallback codecs

Big5 5C issue (許功蓋)
BIG5:BIG5-5C,BIG5
# Input Output
Big5 Literal ” 成功” ” 成功 ”
ASCII/Hex ”xA6xA8xA5” ”xA6xA8xA5”
BIG5-5C,BIG5:BIG5
# Input Output
Big5 Literal ” 成功 ” ” 成功”
ASCII/Hex ”xA6xA8xA5” ”xA6xA8xA5”

Traditional/Simpliﬁed Chinese
NOT one-to-one mapping
Traditional 乾幹干
vs.
Simpliﬁed 干干干
Context dependent
之後、夜之后、入夜之後
Variants
峰、峯

Project Chvar (1/2)
https://github.com/buganini/chvar
..
..签簽. 籖籤.
Canonical group
.
Canonical group
.
Compatibility group
Figure: Two level grouping in Chvar
签簽籖籤
TW 簽 - 籤 -
CN - 签 - 籖
CP950 簽 - 籤 -
GB2312 - 签 × ×
Table: Canonical Group
签簽籖籤
TW 簽 - 簽簽
CN - 签签签
CP950 簽 - 簽簽
GB2312 - 签签签
Table: Compatibility Group

Project Chvar (2/2)
https://github.com/buganini/chvar
Normalization
Canonical Equivalence
Transliteration
Converted
or Canonical Equivalence
or Compatibility Equivalence
Fuzzy character matching
Compatibility Equivalence
签簽籖籤
TW 簽 - 籤 -
CN - 签 - 籖
CP950 簽 - 籤 -
GB2312 - 签 × ×
Table: Canonical Group
签簽籖籤
TW 簽 - 簽簽
CN - 签签签
CP950 簽 - 簽簽
GB2312 - 签签签
Table: Compatibility Group

Bsdconv :: Phases
Traditional Chinese ⇔ Simpliﬁed Chinese
... UTF-8. :. ZHTW. :. UTF-8..
from
.
inter
.
to
Figure: Conversion with inter-mapping phase

Bsdconv :: Phases
Furthermore, phrases mapping
... UTF-8. :. ZHTW. :. ZHTW-WORDS. :. UTF-8..
from
.
inter
.
inter
.
to
Figure: Conversion with multiple inter-mapping phases

Unicode :: Casing
IS complicated
Lowercase Uppercase
a A
i I
Table: English
Lowercase Uppercase
ı I
i İ
Table: Turkic
Lowercase Uppercase
a A
à A
Table: French
Lowercase Uppercase
σ Σ
ς Σ
Table: Greek
Default Case Folding

Unicode :: Normalization Forms (1/2)
UAX#15
Indexing
Identiﬁcation security
Username, Domain name
Combining sequence Ç C + ◌̧
Ordering of combining marks q+◌̇+◌̣ q+◌̣+◌̇
Hangul 가 ᄀ + ᅡ
Singleton Ω Ω
Table: Canonical Equivalence

Unicode :: Normalization Forms (2/2)
UAX#15
Font variants ℌ H
Breaking diﬀerences NBSP SP
Cursive forms ‫ﻧ‬ ‫ﻨ‬
Circled ① 1
Width, size, rotated
ｶカ
︷ {
Superscripts/subscripts ⁹ 9
Squared characters ㍿株 + 式 + 会 + 社
Fractions ¾ 3 + / + 4
Others ǆ d + z + ◌̌
Table: Compatibility Equivalence

Normalization for fuzzy matching
UTF-8:UPPER:UTF-8
Input: aăⅷǅбⓐᾥ
Output: AĂⅧǄБⒶᾭ
UTF-8:ZH-FUZZY-TW:KANA-PHONETIC:NFKD-
CASEFOLD:UTF-8
Input: ¼ℌℍăǅⓐ⁹ 灣湾ド鬒鬒㊣ æß
Output: 1⁄4hhădža9 灣灣 do 鬒鬒正 æss
Composition Decomposition
Canonical NFC NFD
Compatibility NFKC NFKD
Table: The four Unicode normalization forms and the transformations

Bsdconv :: Cascade
Re-encode
... UTF-8. :. ISO-8859-1. |. BIG5. :. UTF-8..
from
.
to
.
from
.
to
Figure: Cascaded conversions
Input Output
¥x¥_ 台北

Bsdconv :: Codec argument
Other than question mark
... UTF-8. ,. ANY#0121. :. ASCII. ,. ANY#21..
from
.
to
Figure: Codec argument
Or more than one character
... UTF-8. ,. ANY#013F.0121. :. ASCII. ,. ANY#21..
from
.
to
Figure: Data list, separated by dot

Bsdconv :: Alias
from/3F
ANY#013F&ERROR
to/3F
ANY#3F&ERROR
from/UTF-8
ASCII,_UTF-8
inter/NFKD
_NFKD:_NF-HANGUL-DECOMPOSITION:_NF-ORDER
inter/NFKC
NFKD:_NFC:_NF-HANGUL-COMPOSITION
inter/NFKD-CASEFOLD
NFD:CASEFOLD:NFKD:CASEFOLD:NFKD
ﬁlter/01
UNICODE

Bsdconv :: Types
(01) Unicode
(02) CNS11643
(03) Byte
(04) Chinese components
(1B) ANSI control sequences
(00) Bsdconv special characters

Chinese components composition
https://github.com/buganini/chicomp
UTF-8:ZH-DECOMP:ZH-COMP:UTF-8
Input: 功夫不好不要艹我
Output: 巭孬嫑莪
UTF-8:ZH-DECOMP:ZH-COMP:CHEWING:UTF-8
Output: ㄆㄨㄋㄠㄧㄠㄜ ˊ
UTF-8:ZH-DECOMP:ZH-COMP:CHEWING:HAN-
PINYIN:UTF-8
Output: pu nao yao [uh]2

Bsdconv :: Flags
FREE - memory management
MARK - identiﬁer

Look-through (1/4)
..%u03B1%CE%B2.
Input (UTF-8 literal)
. ESCAPE : ....
Decoder
.
..
01
.
03
.
B1
.
03
.
CE
.
03
.
B2
.
Internal data
.
Entity Unicode UTF-8 Hex
α U+03B1 CEB1
β U+03B2 CEB2
Figure: ESCAPE:PASS#MARK&FOR=1,BYTE|PASS#UNMARK,UTF-8:UTF-8

Look-through (2/4)
..
..01.
03
.
B1
. 03.
CE
. 03.
B2.
Internal data
. ... : PASS#MARK&FOR=1,BYTE.
Encoder
.
..
01
.
03
.
B1
.
MARK
.
CE
.
B2
.
Internal data
.
α U+03B1 CEB1
β U+03B2 CEB2

Look-through (3/4)
..
..01.
03
.
B1
.
MARK
. CE. B2
.
Internal data
. PASS#UNMARK,UTF-8 : ....
Decoder
.
..
01
.
03
.
B1
.
01
.
03
.
B2
.
Internal data
.
α U+03B1 CEB1
β U+03B2 CEB2

Look-through (4/4)
..
01
.
03
.
B1
.
01
.
03
.
B2
Internal data
... : UTF-8
Encoder
..
CE
.
B1
.”α”.
CE
.
B2
. ”β”
Internal data
αβ
Output (UTF-8 literal)
α U+03B1 CEB1
β U+03B2 CEB2

Unicode :: East Asian Width (1/2)
UAX#11
..
..Narrow.
Halfwidth
..
.. Wide.
Fullwidth
.
Ambiguous
.
Neutral
Figure: Venn Diagram Showing the Set Relations for Six Categories

Unicode :: East Asian Width (2/2)
UAX#11
Narrow Ambiguous Wide
Я
N ऊ
Na A Ａ F
H ｶカ W
咦 W
Table: Examples for Each Character Class and Their Resolved Widths
Na Narrow
N Neural, usually treated as Narrow
W Wide
F Fullwidth
H Halfwidth
Table: Width attributes

String width measurement
echo "42(ˊ_>ˋ) 紅茶" | bsdconv UTF-8:WIDTH:NULL
FULL: 2
HALF: 7
AMBI: 2

Chinese charset encoding detection
https://github.com/buganini/chiconv
ENCODING:SCORE#WITH=CJK:COUNT:ZH-
BONUS:ZHTW:ZH-BONUS-PHRASE:NULL
Score(s) = $SCORE−$IERR∗$COUNT∗0.01
$COUNT
帥呆了 ⇒ UTF-8:SCORE#WITH=CJK:……
ENCODING SCORE COUNT IERR Score(s)
UTF-8 19 4 0 4.75
BIG5 8 3 2 -4.0
GBK 4 1 4 -36.0
CCCII 36 9 0 4.0
UTF-16LE 20 5 2 0.0

Khmer legacy font converter
https://github.com/buganini/khmerconv
Issues
Encoding without registerd name, bound on fonts
Stored in CP1252 or UTF-8
Solution
Two pass detection
Detect encoding
Detect font family (currently not working)
(High converage in SBCS)
Algorithm ported from Khmer Converter3
Khmer Converter
Mapping
Reordering
Visual order vs. Unicode model
Unicode Model: baseCharacter [+ [Robat/Shifter] + [Coeng*]
+ [Shifter] + [Vowel] + [Sign]]
3
http://www.khmeros.info/en/khmer-converter

Terminal transcoding
https://github.com/buganini/bug5
Issues
UAO: Non-standard big5 extension
Double color hack
ANSI control sequence in the middle of DBCS
Ambiguous width characters
luit/screen cannot help
Solution (tl;dr)
Big5 to Unicode
ANSI-CONTROL,BYTE:BIG5-DEFRAG:
BYTE,PASS#MARK&FOR=1B|
PASS#UNMARK,BIG5:AMBIGUOUS-PAD:
UTF-8,PASS#FOR=1B
Unicode to Big5
UTF-8,00,BYTE:ZHTW:AMBIGUOUS-UNPAD:
BIG5,CP950-TRANS,UAO,00,ANY#3F

Bug5 explained (1/6)
..⋆xC5x1B[1mxE5.
Input (Big5 literal)
. ANSI-CONTROL,BYTE : ....
Decoder
.
..
03
.
A1
.
03
.
B9
.
03
.
C5
.
1B
.
5B
.
31
.
6D
.
03
.
E5
.
Internal data
.
Entity Unicode UTF-8 Hex Big5 Hex
⋆ U+2605 E29885 A1B9
驚 U+9A5A E9A99A C5E5
[ U+005B 5B 5B
1 U+0031 31 31
m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure: ANSI-CONTROL,BYTE:BIG5-DEFRAG:BYTE,PASS#MARK&FOR=1B|
PASS#UNMARK,BIG5:AMBIGUOUS-PAD:UTF-8,PASS#FOR=1B

..
..03.
A1
. 03.
B9
. 03.
C5
. 1B.
5B
.
31
.
6D
. 03.
E5.
Internal data
. ... : BIG5-DEFRAG : ....
Inter-conversion
.
..
03
.
A1
.
03
.
B9
.
03
.
C5
.
03
.
E5
.
1B
.
5B
.
31
.
6D
.
Internal data
.
⋆ U+2605 E29885 A1B9
[ U+005B 5B 5B
1 U+0031 31 31
m U+006D 6D 6D

..
..03.
A1
. 03.
B9
. 03.
C5
. 03.
E5
. 1B.
5B
.
31
.
6D
.
Internal data
. ... : BYTE,PASS#MARK&FOR=1B.
Encoder
.
..
A1
.
B9
.
C5
.
E5
.
1B
.
5B
.
31
.
6D
.
MARK
.
Internal data
.
⋆ U+2605 E29885 A1B9
[ U+005B 5B 5B
1 U+0031 31 31
m U+006D 6D 6D

..
..A1. B9. C5. E5. 1B.
5B
.
31
.
6D
.
MARK
.
Internal data
. PASS#UNMARK,BIG5 : ....
Decoder
.
..
01
.
26
.
05
.
01
.
9A
.
5A
.
1B
.
5B
.
31
.
6D
.
Internal data
.
⋆ U+2605 E29885 A1B9
[ U+005B 5B 5B
1 U+0031 31 31
m U+006D 6D 6D

..
..01.
26
.
05
. 01.
9A
.
5A
. 1B.
5B
.
31
.
6D
.
Internal data
. ... : AMBIGUOUS-PAD : ....
Inter-conversion
.
..
01
.
26
.
05
.
01
.
A0
.
01
.
9A
.
5A
.
1B
.
5B
.
31
.
6D
.
Internal data
.
⋆ U+2605 E29885 A1B9
[ U+005B 5B 5B
1 U+0031 31 31
m U+006D 6D 6D

..
..01.
26
.
05
. 01.
A0
. 01.
9A
.
5A
. 1B.
5B
.
31
.
6D
.
Internal data
. ... : UTF-8,PASS#FOR=1B.
Encoder
.
⋆ 驚 x1B[1m
.
Output (UTF-8 literal)
.
⋆ U+2605 E29885 A1B9
[ U+005B 5B 5B
1 U+0031 31 31
m U+006D 6D 6D

Bsdconv :: Bindings
Python/Ruby/Go/Perl/PHP
https://pypi.python.org/pypi/bsdconv
https://rubygems.org/gems/ruby-bsdconv
https://github.com/buganini/go-bsdconv
https://github.com/buganini/perl-bsdconv
https://github.com/buganini/php-bsdconv
PostgreSQL/MySQL
https://github.com/buganini/postgres-bsdconv
https://github.com/buganini/mysql-udf-bsdconv
Irssi
https://github.com/buganini/irssi-scripts/blob/master/irssi-bsdconv.pl

Bsdconv :: GUI
https://github.com/buganini/gbsdconv
Alternative to ConvertZ
Text
File name
File content
Meta tag

Thanks
ESCAPE,UTF-8:PA
SS#FOR=UNICODE&M
ARK,BYTE|PASS#UNMA
RK,UTF-8:NFC:ASCII,ES
CAPE|
https://github.com/buganini/bsdconv

Journey of Bsdconv

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Journey of Bsdconv

Ähnlich wie Journey of Bsdconv (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Journey of Bsdconv