SlideShare ist ein Scribd-Unternehmen logo
1 von 55
Downloaden Sie, um offline zu lesen
BSDCONV
Buganini Q
Since 2009
Charset & Encoding
Character Set
Collection of characters
Encoding
Binary representation
Charset & Encoding
..
Unicode (32bits addr. space)
.
Unicode up to U+10FFFF
.
Unicode BMP (up to U+FFFF)
.
(Basic Multilingual Plane)
. GB18030.
CNS11643
.
CP950
.
Latin1
Figure: Character Sets
Charset & Encoding
..
Unicode (32bits addr. space)
.
Unicode up to U+10FFFF
.
Unicode BMP (up to U+FFFF)
.GB18030.
CNS11643
.
CP950
.
Latin1
.
UTF-32 / UCS4
.
UTF-81 / UTF-16
.
UCS2
. GB18030.
CNS11643
.
CP950 (DBCS)
.
ISO-8859-1 / EBCDIC-0372
1
Could cover more but restricted by RFC 3629
2
Aka. IBM-37, some control characters are different from ISO-8859-1
Encoding :: UTF-32 / UCS4
Fixed Length
4 bytes
Filesize *= 4 for ASCII text file
Incompatible with C-style string convention
Endianness concern
Encoding :: UCS2
Fixed Length
2 bytes
Filesize *= 2 for ASCII text file
Incompatible with C-style string convention
Endianness concern
BMP-only
Encoding :: UTF-16
Variable Length
2 bytes / 4 bytes (Surrogate pairs)
Surrogates
Using U+D800..U+DFFF
Incompatible with C-style string convention
Endianness concern
******** ********
110110** ******** 110111** ********
Table: UTF-16 Structure
Encoding :: UTF-8
Variable Length
1~6 bytes
Compatible with C-style string convention
Self-synchronizing
Endian-neutral
Sorting order = Code point order
0******* (ASCII)
110***** 10******
1110**** 10****** 10******
11110*** 10****** 10****** 10******
111110** 10****** 10****** 10****** 10******
1111110* 10****** 10****** 10****** 10****** 10******
Table: UTF-8 Structure
Encoding :: CNS11643 (全字庫) #issue
http://www.cns11643.gov.tw/
Only used by Taiwan government
NOT a subset of Unicode
Not just an charset/encoding
Font
Pronunciation ㄇㄥ ˊ / méng
Radical 艸
Component 艹日月
Stroke
Tra/Sim mapping 萌蕄
Table: Examples for some information provided by 全字庫 for「萌」
Encoding :: CCCII
Variants
Variant glyph at different plane
Mostly used for library indexing
強 21 3D 48
彊 2D 3D 48
强 33 3D 48
Encoding :: Big5
Many incompatible variations (abusing PUA), none of
standard tools can rule them all
http://moztw.org/docs/big5/
Scenario Dominating encoding
Microsoft CP950
Taiwan BBS UAO (Unicode-at-Once)
gov.tw Big5-2003
gov.hk HKSCS (1999,2001,2004)
Special characters conflict
The second byte could be 0x5C (), 0x7C (|), 0x7E (~), which
may have special meaning in certain context
Bsdconv :: Decoding and Encoding
Alternative to iconv
... ISO-8859-1. :. UTF-8..
from
.
to
Figure: Basic two phases conversion
Bsdconv :: Codecs & Fallback
Optionally produce question mark (U+003F) as replacement
... UTF-8. ,. 3F. :. ASCII. ,. 3F..
from
.
to
Figure: Fallback codec
Transliteration
... UTF-8. :. CP936. ,. CP936-TRANS. ,. 3F..
from
.
to
Figure: Multiple fallback codecs
Encoding :: Big5
Many incompatible variations (abusing PUA), none of
standard tools can rule them all
http://moztw.org/docs/big5/
Scenario Dominating encoding
Microsoft CP950
Taiwan BBS UAO (Unicode-at-Once)
gov.tw Big5-2003
gov.hk HKSCS (1999,2001,2004)
Special characters conflict
The second byte could be 0x5C (), 0x7C (|), 0x7E (~), which
may have special meaning in certain context
Big5 5C issue (許功蓋)
BIG5:BIG5-5C,BIG5
# Input Output
Big5 Literal ” 成功” ” 成功 ”
ASCII/Hex ”xA6xA8xA5” ”xA6xA8xA5”
BIG5-5C,BIG5:BIG5
# Input Output
Big5 Literal ” 成功 ” ” 成功”
ASCII/Hex ”xA6xA8xA5” ”xA6xA8xA5”
Traditional/Simplified Chinese
NOT one-to-one mapping
Traditional 乾幹干
vs.
Simplified 干干干
Context dependent
之後、夜之后、入夜之後
Variants
峰、峯
Project Chvar (1/2)
https://github.com/buganini/chvar
..
..签簽. 籖籤.
Canonical group
.
Canonical group
.
Compatibility group
Figure: Two level grouping in Chvar
签 簽 籖 籤
TW 簽 - 籤 -
CN - 签 - 籖
CP950 簽 - 籤 -
GB2312 - 签 × ×
Table: Canonical Group
签 簽 籖 籤
TW 簽 - 簽 簽
CN - 签 签 签
CP950 簽 - 簽 簽
GB2312 - 签 签 签
Table: Compatibility Group
Project Chvar (2/2)
https://github.com/buganini/chvar
Normalization
Canonical Equivalence
Transliteration
Converted
or Canonical Equivalence
or Compatibility Equivalence
Fuzzy character matching
Compatibility Equivalence
签 簽 籖 籤
TW 簽 - 籤 -
CN - 签 - 籖
CP950 簽 - 籤 -
GB2312 - 签 × ×
Table: Canonical Group
签 簽 籖 籤
TW 簽 - 簽 簽
CN - 签 签 签
CP950 簽 - 簽 簽
GB2312 - 签 签 签
Table: Compatibility Group
Bsdconv :: Phases
Traditional Chinese ⇔ Simplified Chinese
... UTF-8. :. ZHTW. :. UTF-8..
from
.
inter
.
to
Figure: Conversion with inter-mapping phase
Bsdconv :: Phases
Furthermore, phrases mapping
... UTF-8. :. ZHTW. :. ZHTW-WORDS. :. UTF-8..
from
.
inter
.
inter
.
to
Figure: Conversion with multiple inter-mapping phases
Unicode :: Casing
IS complicated
Lowercase Uppercase
a A
i I
Table: English
Lowercase Uppercase
ı I
i İ
Table: Turkic
Lowercase Uppercase
a A
à A
Table: French
Lowercase Uppercase
σ Σ
ς Σ
Table: Greek
Default Case Folding
Unicode :: Normalization Forms (1/2)
UAX#15
Indexing
Identification security
Username, Domain name
Combining sequence Ç C + ◌̧
Ordering of combining marks q+◌̇+◌̣ q+◌̣+◌̇
Hangul 가 ᄀ + ᅡ
Singleton Ω Ω
Table: Canonical Equivalence
Unicode :: Normalization Forms (2/2)
UAX#15
Font variants ℌ H
Breaking differences NBSP SP
Cursive forms ‫ﻧ‬ ‫ﻨ‬
Circled ① 1
Width, size, rotated
カ カ
︷ {
Superscripts/subscripts ⁹ 9
Squared characters ㍿ 株 + 式 + 会 + 社
Fractions ¾ 3 + / + 4
Others dž d + z + ◌̌
Table: Compatibility Equivalence
Normalization for fuzzy matching
UTF-8:UPPER:UTF-8
Input: aăⅷDžбⓐᾥ
Output: AĂⅧDŽБⒶᾭ
UTF-8:ZH-FUZZY-TW:KANA-PHONETIC:NFKD-
CASEFOLD:UTF-8
Input: ¼ℌℍăDžⓐ⁹ 灣湾ド鬒鬒㊣ æß
Output: 1⁄4hhădža9 灣灣 do 鬒鬒正 æss
Composition Decomposition
Canonical NFC NFD
Compatibility NFKC NFKD
Table: The four Unicode normalization forms and the transformations
Bsdconv :: Cascade
Re-encode
... UTF-8. :. ISO-8859-1. |. BIG5. :. UTF-8..
from
.
to
.
from
.
to
Figure: Cascaded conversions
Input Output
¥x¥_ 台北
Bsdconv :: Codec argument
Other than question mark
... UTF-8. ,. ANY#0121. :. ASCII. ,. ANY#21..
from
.
to
Figure: Codec argument
Or more than one character
... UTF-8. ,. ANY#013F.0121. :. ASCII. ,. ANY#21..
from
.
to
Figure: Data list, separated by dot
Bsdconv :: Alias
from/3F
ANY#013F&ERROR
to/3F
ANY#3F&ERROR
from/UTF-8
ASCII,_UTF-8
inter/NFKD
_NFKD:_NF-HANGUL-DECOMPOSITION:_NF-ORDER
inter/NFKC
NFKD:_NFC:_NF-HANGUL-COMPOSITION
inter/NFKD-CASEFOLD
NFD:CASEFOLD:NFKD:CASEFOLD:NFKD
filter/01
UNICODE
Charset & Encoding
..
Unicode (32bits addr. space)
.
Unicode up to U+10FFFF
.
Unicode BMP (up to U+FFFF)
.
(Basic Multilingual Plane)
. GB18030.
CNS11643
.
CP950
.
Latin1
Figure: Character Sets
Bsdconv :: Types
(01) Unicode
(02) CNS11643
(03) Byte
(04) Chinese components
(1B) ANSI control sequences
(00) Bsdconv special characters
Encoding :: CNS11643 (全字庫) #issue
http://www.cns11643.gov.tw/
Only used by Taiwan government
NOT a subset of Unicode
Not just an charset/encoding
Font
Pronunciation ㄇㄥ ˊ / méng
Radical 艸
Component 艹日月
Stroke
Tra/Sim mapping 萌蕄
Table: Examples for some information provided by 全字庫 for「萌」
Chinese components composition
https://github.com/buganini/chicomp
UTF-8:ZH-DECOMP:ZH-COMP:UTF-8
Input: 功夫不好不要艹我
Output: 巭孬嫑莪
UTF-8:ZH-DECOMP:ZH-COMP:CHEWING:UTF-8
Input: 功夫不好不要艹我
Output: ㄆㄨ ㄋㄠ ㄧㄠ ㄜ ˊ
UTF-8:ZH-DECOMP:ZH-COMP:CHEWING:HAN-
PINYIN:UTF-8
Input: 功夫不好不要艹我
Output: pu nao yao [uh]2
Bsdconv :: Flags
FREE - memory management
MARK - identifier
Bsdconv :: Cascade
Re-encode
... UTF-8. :. ISO-8859-1. |. BIG5. :. UTF-8..
from
.
to
.
from
.
to
Figure: Cascaded conversions
Input Output
¥x¥_ 台北
Look-through (1/4)
..%u03B1%CE%B2.
Input (UTF-8 literal)
. ESCAPE : ....
Decoder
.
..
01
.
03
.
B1
.
03
.
CE
.
03
.
B2
.
Internal data
.
Entity Unicode UTF-8 Hex
α U+03B1 CEB1
β U+03B2 CEB2
Figure: ESCAPE:PASS#MARK&FOR=1,BYTE|PASS#UNMARK,UTF-8:UTF-8
Look-through (2/4)
..
..01.
03
.
B1
. 03.
CE
. 03.
B2.
Internal data
. ... : PASS#MARK&FOR=1,BYTE.
Encoder
.
..
01
.
03
.
B1
.
MARK
.
CE
.
B2
.
Internal data
.
Entity Unicode UTF-8 Hex
α U+03B1 CEB1
β U+03B2 CEB2
Figure: ESCAPE:PASS#MARK&FOR=1,BYTE|PASS#UNMARK,UTF-8:UTF-8
Look-through (3/4)
..
..01.
03
.
B1
.
MARK
. CE. B2
.
Internal data
. PASS#UNMARK,UTF-8 : ....
Decoder
.
..
01
.
03
.
B1
.
01
.
03
.
B2
.
Internal data
.
Entity Unicode UTF-8 Hex
α U+03B1 CEB1
β U+03B2 CEB2
Figure: ESCAPE:PASS#MARK&FOR=1,BYTE|PASS#UNMARK,UTF-8:UTF-8
Look-through (4/4)
..
01
.
03
.
B1
.
01
.
03
.
B2
Internal data
... : UTF-8
Encoder
..
CE
.
B1
.”α”.
CE
.
B2
. ”β”
Internal data
αβ
Output (UTF-8 literal)
Entity Unicode UTF-8 Hex
α U+03B1 CEB1
β U+03B2 CEB2
Figure: ESCAPE:PASS#MARK&FOR=1,BYTE|PASS#UNMARK,UTF-8:UTF-8
Unicode :: East Asian Width (1/2)
UAX#11
..
..Narrow.
Halfwidth
..
.. Wide.
Fullwidth
.
Ambiguous
.
Neutral
Figure: Venn Diagram Showing the Set Relations for Six Categories
Unicode :: East Asian Width (2/2)
UAX#11
Narrow Ambiguous Wide
Я
N ऊ
Na A A F
H カ カ W
咦 W
Table: Examples for Each Character Class and Their Resolved Widths
Na Narrow
N Neural, usually treated as Narrow
W Wide
F Fullwidth
H Halfwidth
Table: Width attributes
String width measurement
echo "42(ˊ_>ˋ) 紅茶" | bsdconv UTF-8:WIDTH:NULL
FULL: 2
HALF: 7
AMBI: 2
Chinese charset encoding detection
https://github.com/buganini/chiconv
ENCODING:SCORE#WITH=CJK:COUNT:ZH-
BONUS:ZHTW:ZH-BONUS-PHRASE:NULL
Score(s) = $SCORE−$IERR∗$COUNT∗0.01
$COUNT
帥呆了 ⇒ UTF-8:SCORE#WITH=CJK:……
ENCODING SCORE COUNT IERR Score(s)
UTF-8 19 4 0 4.75
BIG5 8 3 2 -4.0
GBK 4 1 4 -36.0
CCCII 36 9 0 4.0
UTF-16LE 20 5 2 0.0
Khmer legacy font converter
https://github.com/buganini/khmerconv
Issues
Encoding without registerd name, bound on fonts
Stored in CP1252 or UTF-8
Solution
Two pass detection
Detect encoding
Detect font family (currently not working)
(High converage in SBCS)
Algorithm ported from Khmer Converter3
Khmer Converter
Mapping
Reordering
Visual order vs. Unicode model
Unicode Model: baseCharacter [+ [Robat/Shifter] + [Coeng*]
+ [Shifter] + [Vowel] + [Sign]]
3
http://www.khmeros.info/en/khmer-converter
Encoding :: Big5
Many incompatible variations (abusing PUA), none of
standard tools can rule them all
http://moztw.org/docs/big5/
Scenario Dominating encoding
Microsoft CP950
Taiwan BBS UAO (Unicode-at-Once)
gov.tw Big5-2003
gov.hk HKSCS (1999,2001,2004)
Special characters conflict
The second byte could be 0x5C (), 0x7C (|), 0x7E (~), which
may have special meaning in certain context
Unicode :: East Asian Width (1/2)
UAX#11
..
..Narrow.
Halfwidth
..
.. Wide.
Fullwidth
.
Ambiguous
.
Neutral
Figure: Venn Diagram Showing the Set Relations for Six Categories
Unicode :: East Asian Width (2/2)
UAX#11
Narrow Ambiguous Wide
Я
N ऊ
Na A A F
H カ カ W
咦 W
Table: Examples for Each Character Class and Their Resolved Widths
Na Narrow
N Neural, usually treated as Narrow
W Wide
F Fullwidth
H Halfwidth
Table: Width attributes
Terminal transcoding
https://github.com/buganini/bug5
Issues
UAO: Non-standard big5 extension
Double color hack
ANSI control sequence in the middle of DBCS
Ambiguous width characters
luit/screen cannot help
Solution (tl;dr)
Big5 to Unicode
ANSI-CONTROL,BYTE:BIG5-DEFRAG:
BYTE,PASS#MARK&FOR=1B|
PASS#UNMARK,BIG5:AMBIGUOUS-PAD:
UTF-8,PASS#FOR=1B
Unicode to Big5
UTF-8,00,BYTE:ZHTW:AMBIGUOUS-UNPAD:
BIG5,CP950-TRANS,UAO,00,ANY#3F
Bug5 explained (1/6)
..⋆xC5x1B[1mxE5.
Input (Big5 literal)
. ANSI-CONTROL,BYTE : ....
Decoder
.
..
03
.
A1
.
03
.
B9
.
03
.
C5
.
1B
.
5B
.
31
.
6D
.
03
.
E5
.
Internal data
.
Entity Unicode UTF-8 Hex Big5 Hex
⋆ U+2605 E29885 A1B9
驚 U+9A5A E9A99A C5E5
[ U+005B 5B 5B
1 U+0031 31 31
m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure: ANSI-CONTROL,BYTE:BIG5-DEFRAG:BYTE,PASS#MARK&FOR=1B|
PASS#UNMARK,BIG5:AMBIGUOUS-PAD:UTF-8,PASS#FOR=1B
Bug5 explained (2/6)
..
..03.
A1
. 03.
B9
. 03.
C5
. 1B.
5B
.
31
.
6D
. 03.
E5.
Internal data
. ... : BIG5-DEFRAG : ....
Inter-conversion
.
..
03
.
A1
.
03
.
B9
.
03
.
C5
.
03
.
E5
.
1B
.
5B
.
31
.
6D
.
Internal data
.
Entity Unicode UTF-8 Hex Big5 Hex
⋆ U+2605 E29885 A1B9
驚 U+9A5A E9A99A C5E5
[ U+005B 5B 5B
1 U+0031 31 31
m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure: ANSI-CONTROL,BYTE:BIG5-DEFRAG:BYTE,PASS#MARK&FOR=1B|
PASS#UNMARK,BIG5:AMBIGUOUS-PAD:UTF-8,PASS#FOR=1B
Bug5 explained (3/6)
..
..03.
A1
. 03.
B9
. 03.
C5
. 03.
E5
. 1B.
5B
.
31
.
6D
.
Internal data
. ... : BYTE,PASS#MARK&FOR=1B.
Encoder
.
..
A1
.
B9
.
C5
.
E5
.
1B
.
5B
.
31
.
6D
.
MARK
.
Internal data
.
Entity Unicode UTF-8 Hex Big5 Hex
⋆ U+2605 E29885 A1B9
驚 U+9A5A E9A99A C5E5
[ U+005B 5B 5B
1 U+0031 31 31
m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure: ANSI-CONTROL,BYTE:BIG5-DEFRAG:BYTE,PASS#MARK&FOR=1B|
PASS#UNMARK,BIG5:AMBIGUOUS-PAD:UTF-8,PASS#FOR=1B
Bug5 explained (4/6)
..
..A1. B9. C5. E5. 1B.
5B
.
31
.
6D
.
MARK
.
Internal data
. PASS#UNMARK,BIG5 : ....
Decoder
.
..
01
.
26
.
05
.
01
.
9A
.
5A
.
1B
.
5B
.
31
.
6D
.
Internal data
.
Entity Unicode UTF-8 Hex Big5 Hex
⋆ U+2605 E29885 A1B9
驚 U+9A5A E9A99A C5E5
[ U+005B 5B 5B
1 U+0031 31 31
m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure: ANSI-CONTROL,BYTE:BIG5-DEFRAG:BYTE,PASS#MARK&FOR=1B|
PASS#UNMARK,BIG5:AMBIGUOUS-PAD:UTF-8,PASS#FOR=1B
Bug5 explained (5/6)
..
..01.
26
.
05
. 01.
9A
.
5A
. 1B.
5B
.
31
.
6D
.
Internal data
. ... : AMBIGUOUS-PAD : ....
Inter-conversion
.
..
01
.
26
.
05
.
01
.
A0
.
01
.
9A
.
5A
.
1B
.
5B
.
31
.
6D
.
Internal data
.
Entity Unicode UTF-8 Hex Big5 Hex
⋆ U+2605 E29885 A1B9
驚 U+9A5A E9A99A C5E5
[ U+005B 5B 5B
1 U+0031 31 31
m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure: ANSI-CONTROL,BYTE:BIG5-DEFRAG:BYTE,PASS#MARK&FOR=1B|
PASS#UNMARK,BIG5:AMBIGUOUS-PAD:UTF-8,PASS#FOR=1B
Bug5 explained (6/6)
..
..01.
26
.
05
. 01.
A0
. 01.
9A
.
5A
. 1B.
5B
.
31
.
6D
.
Internal data
. ... : UTF-8,PASS#FOR=1B.
Encoder
.
⋆ 驚 x1B[1m
.
Output (UTF-8 literal)
.
Entity Unicode UTF-8 Hex Big5 Hex
⋆ U+2605 E29885 A1B9
驚 U+9A5A E9A99A C5E5
[ U+005B 5B 5B
1 U+0031 31 31
m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure: ANSI-CONTROL,BYTE:BIG5-DEFRAG:BYTE,PASS#MARK&FOR=1B|
PASS#UNMARK,BIG5:AMBIGUOUS-PAD:UTF-8,PASS#FOR=1B
Bsdconv :: Bindings
Python/Ruby/Go/Perl/PHP
https://pypi.python.org/pypi/bsdconv
https://rubygems.org/gems/ruby-bsdconv
https://github.com/buganini/go-bsdconv
https://github.com/buganini/perl-bsdconv
https://github.com/buganini/php-bsdconv
PostgreSQL/MySQL
https://github.com/buganini/postgres-bsdconv
https://github.com/buganini/mysql-udf-bsdconv
Irssi
https://github.com/buganini/irssi-scripts/blob/master/irssi-bsdconv.pl
Bsdconv :: GUI
https://github.com/buganini/gbsdconv
Alternative to ConvertZ
Text
File name
File content
Meta tag
Thanks
ESCAPE,UTF-8:PA
SS#FOR=UNICODE&M
ARK,BYTE|PASS#UNMA
RK,UTF-8:NFC:ASCII,ES
CAPE|
https://github.com/buganini/bsdconv

Weitere ähnliche Inhalte

Was ist angesagt?

assembly language programming and organization of IBM PC" by YTHA YU
assembly language programming and organization of IBM PC" by YTHA YUassembly language programming and organization of IBM PC" by YTHA YU
assembly language programming and organization of IBM PC" by YTHA YUEducation
 
Introduction to Assembly Language
Introduction to Assembly LanguageIntroduction to Assembly Language
Introduction to Assembly LanguageMotaz Saad
 
EMBEDDED SYSTEMS 4&5
EMBEDDED SYSTEMS 4&5EMBEDDED SYSTEMS 4&5
EMBEDDED SYSTEMS 4&5PRADEEP
 
N_Asm Assembly arithmetic instructions (sol)
N_Asm Assembly arithmetic instructions (sol)N_Asm Assembly arithmetic instructions (sol)
N_Asm Assembly arithmetic instructions (sol)Selomon birhane
 
Programmable Logic Devices
Programmable Logic DevicesProgrammable Logic Devices
Programmable Logic DevicesMadhusudan Donga
 
Assembly language (coal)
Assembly language (coal)Assembly language (coal)
Assembly language (coal)Hareem Aslam
 
Assembly Language Lecture 2
Assembly Language Lecture 2Assembly Language Lecture 2
Assembly Language Lecture 2Motaz Saad
 
Instruction set-of-8086
Instruction set-of-8086Instruction set-of-8086
Instruction set-of-8086mudulin
 
Introduction to 8088 microprocessor
Introduction to 8088 microprocessorIntroduction to 8088 microprocessor
Introduction to 8088 microprocessorDwight Sabio
 
Assembly Language Programming By Ytha Yu, Charles Marut Chap 4 (Introduction ...
Assembly Language Programming By Ytha Yu, Charles Marut Chap 4 (Introduction ...Assembly Language Programming By Ytha Yu, Charles Marut Chap 4 (Introduction ...
Assembly Language Programming By Ytha Yu, Charles Marut Chap 4 (Introduction ...Bilal Amjad
 
Chapter 6 Flow control Instructions
Chapter 6 Flow control InstructionsChapter 6 Flow control Instructions
Chapter 6 Flow control Instructionswarda aziz
 
C cheat sheet for varsity (extreme edition)
C cheat sheet for varsity (extreme edition)C cheat sheet for varsity (extreme edition)
C cheat sheet for varsity (extreme edition)Saifur Rahman
 

Was ist angesagt? (20)

assembly language programming and organization of IBM PC" by YTHA YU
assembly language programming and organization of IBM PC" by YTHA YUassembly language programming and organization of IBM PC" by YTHA YU
assembly language programming and organization of IBM PC" by YTHA YU
 
Introduction to Assembly Language
Introduction to Assembly LanguageIntroduction to Assembly Language
Introduction to Assembly Language
 
EMBEDDED SYSTEMS 4&5
EMBEDDED SYSTEMS 4&5EMBEDDED SYSTEMS 4&5
EMBEDDED SYSTEMS 4&5
 
N_Asm Assembly arithmetic instructions (sol)
N_Asm Assembly arithmetic instructions (sol)N_Asm Assembly arithmetic instructions (sol)
N_Asm Assembly arithmetic instructions (sol)
 
Embedded c
Embedded cEmbedded c
Embedded c
 
Programmable Logic Devices
Programmable Logic DevicesProgrammable Logic Devices
Programmable Logic Devices
 
Assembly language (coal)
Assembly language (coal)Assembly language (coal)
Assembly language (coal)
 
C programming part2
C programming part2C programming part2
C programming part2
 
Assembly Language Lecture 2
Assembly Language Lecture 2Assembly Language Lecture 2
Assembly Language Lecture 2
 
Instruction set-of-8086
Instruction set-of-8086Instruction set-of-8086
Instruction set-of-8086
 
Introduction to 8088 microprocessor
Introduction to 8088 microprocessorIntroduction to 8088 microprocessor
Introduction to 8088 microprocessor
 
Assembly Language Programming By Ytha Yu, Charles Marut Chap 4 (Introduction ...
Assembly Language Programming By Ytha Yu, Charles Marut Chap 4 (Introduction ...Assembly Language Programming By Ytha Yu, Charles Marut Chap 4 (Introduction ...
Assembly Language Programming By Ytha Yu, Charles Marut Chap 4 (Introduction ...
 
Assembly language part I
Assembly language part IAssembly language part I
Assembly language part I
 
Chapter 6 Flow control Instructions
Chapter 6 Flow control InstructionsChapter 6 Flow control Instructions
Chapter 6 Flow control Instructions
 
[ASM] Lab1
[ASM] Lab1[ASM] Lab1
[ASM] Lab1
 
Instruction formats-in-8086
Instruction formats-in-8086Instruction formats-in-8086
Instruction formats-in-8086
 
Lecture6
Lecture6Lecture6
Lecture6
 
Ch9a
Ch9aCh9a
Ch9a
 
C cheat sheet for varsity (extreme edition)
C cheat sheet for varsity (extreme edition)C cheat sheet for varsity (extreme edition)
C cheat sheet for varsity (extreme edition)
 
Lecture5(1)
Lecture5(1)Lecture5(1)
Lecture5(1)
 

Ähnlich wie Journey of Bsdconv

Comprehasive Exam - IT
Comprehasive Exam - ITComprehasive Exam - IT
Comprehasive Exam - ITguest6ddfb98
 
Chapter 1SyllabusCatalog Description Computer structu
Chapter 1SyllabusCatalog Description Computer structuChapter 1SyllabusCatalog Description Computer structu
Chapter 1SyllabusCatalog Description Computer structuEstelaJeffery653
 
Reed Solomon Frame Structures Revealed
Reed Solomon Frame Structures RevealedReed Solomon Frame Structures Revealed
Reed Solomon Frame Structures RevealedDavid Alan Tyner
 
Except UnicodeError: battling Unicode demons in Python
Except UnicodeError: battling Unicode demons in PythonExcept UnicodeError: battling Unicode demons in Python
Except UnicodeError: battling Unicode demons in PythonAram Dulyan
 
UTF-8: The Secret of Character Encoding
UTF-8: The Secret of Character EncodingUTF-8: The Secret of Character Encoding
UTF-8: The Secret of Character EncodingBert Pattyn
 
HES2011 - James Oakley and Sergey bratus-Exploiting-the-Hard-Working-DWARF
HES2011 - James Oakley and Sergey bratus-Exploiting-the-Hard-Working-DWARFHES2011 - James Oakley and Sergey bratus-Exploiting-the-Hard-Working-DWARF
HES2011 - James Oakley and Sergey bratus-Exploiting-the-Hard-Working-DWARFHackito Ergo Sum
 
Embedded system (Chapter 2) part 2
Embedded system (Chapter 2) part 2Embedded system (Chapter 2) part 2
Embedded system (Chapter 2) part 2Ikhwan_Fakrudin
 
Keyboard interrupt
Keyboard interruptKeyboard interrupt
Keyboard interruptTech_MX
 
Ilfak Guilfanov - Decompiler internals: Microcode [rooted2018]
Ilfak Guilfanov - Decompiler internals: Microcode [rooted2018]Ilfak Guilfanov - Decompiler internals: Microcode [rooted2018]
Ilfak Guilfanov - Decompiler internals: Microcode [rooted2018]RootedCON
 
Getting Started with Raspberry Pi - DCC 2013.1
Getting Started with Raspberry Pi - DCC 2013.1Getting Started with Raspberry Pi - DCC 2013.1
Getting Started with Raspberry Pi - DCC 2013.1Tom Paulus
 
Abstracting Vector Architectures in Library Generators: Case Study Convolutio...
Abstracting Vector Architectures in Library Generators: Case Study Convolutio...Abstracting Vector Architectures in Library Generators: Case Study Convolutio...
Abstracting Vector Architectures in Library Generators: Case Study Convolutio...ETH Zurich
 
Unicode and character sets
Unicode and character setsUnicode and character sets
Unicode and character setsrenchenyu
 
ITU - MDD - Textural Languages and Grammars
ITU - MDD - Textural Languages and GrammarsITU - MDD - Textural Languages and Grammars
ITU - MDD - Textural Languages and GrammarsTonny Madsen
 

Ähnlich wie Journey of Bsdconv (20)

Comprehasive Exam - IT
Comprehasive Exam - ITComprehasive Exam - IT
Comprehasive Exam - IT
 
20141106 asfws unicode_hacks
20141106 asfws unicode_hacks20141106 asfws unicode_hacks
20141106 asfws unicode_hacks
 
Chapter 1SyllabusCatalog Description Computer structu
Chapter 1SyllabusCatalog Description Computer structuChapter 1SyllabusCatalog Description Computer structu
Chapter 1SyllabusCatalog Description Computer structu
 
Reed Solomon Frame Structures Revealed
Reed Solomon Frame Structures RevealedReed Solomon Frame Structures Revealed
Reed Solomon Frame Structures Revealed
 
ISA.pptx
ISA.pptxISA.pptx
ISA.pptx
 
Except UnicodeError: battling Unicode demons in Python
Except UnicodeError: battling Unicode demons in PythonExcept UnicodeError: battling Unicode demons in Python
Except UnicodeError: battling Unicode demons in Python
 
UTF-8: The Secret of Character Encoding
UTF-8: The Secret of Character EncodingUTF-8: The Secret of Character Encoding
UTF-8: The Secret of Character Encoding
 
Chap 01[1]
Chap 01[1]Chap 01[1]
Chap 01[1]
 
ASCII-EBCDIC-HEX
ASCII-EBCDIC-HEXASCII-EBCDIC-HEX
ASCII-EBCDIC-HEX
 
HES2011 - James Oakley and Sergey bratus-Exploiting-the-Hard-Working-DWARF
HES2011 - James Oakley and Sergey bratus-Exploiting-the-Hard-Working-DWARFHES2011 - James Oakley and Sergey bratus-Exploiting-the-Hard-Working-DWARF
HES2011 - James Oakley and Sergey bratus-Exploiting-the-Hard-Working-DWARF
 
Embedded system (Chapter 2) part 2
Embedded system (Chapter 2) part 2Embedded system (Chapter 2) part 2
Embedded system (Chapter 2) part 2
 
Keyboard interrupt
Keyboard interruptKeyboard interrupt
Keyboard interrupt
 
Ilfak Guilfanov - Decompiler internals: Microcode [rooted2018]
Ilfak Guilfanov - Decompiler internals: Microcode [rooted2018]Ilfak Guilfanov - Decompiler internals: Microcode [rooted2018]
Ilfak Guilfanov - Decompiler internals: Microcode [rooted2018]
 
Y03301460154
Y03301460154Y03301460154
Y03301460154
 
Getting Started with Raspberry Pi - DCC 2013.1
Getting Started with Raspberry Pi - DCC 2013.1Getting Started with Raspberry Pi - DCC 2013.1
Getting Started with Raspberry Pi - DCC 2013.1
 
Abstracting Vector Architectures in Library Generators: Case Study Convolutio...
Abstracting Vector Architectures in Library Generators: Case Study Convolutio...Abstracting Vector Architectures in Library Generators: Case Study Convolutio...
Abstracting Vector Architectures in Library Generators: Case Study Convolutio...
 
Unicode and character sets
Unicode and character setsUnicode and character sets
Unicode and character sets
 
ITU - MDD - Textural Languages and Grammars
ITU - MDD - Textural Languages and GrammarsITU - MDD - Textural Languages and Grammars
ITU - MDD - Textural Languages and Grammars
 
Assembler1
Assembler1Assembler1
Assembler1
 
C programming part2
C programming part2C programming part2
C programming part2
 

Kürzlich hochgeladen

Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 

Kürzlich hochgeladen (20)

Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 

Journey of Bsdconv

  • 2. Charset & Encoding Character Set Collection of characters Encoding Binary representation
  • 3. Charset & Encoding .. Unicode (32bits addr. space) . Unicode up to U+10FFFF . Unicode BMP (up to U+FFFF) . (Basic Multilingual Plane) . GB18030. CNS11643 . CP950 . Latin1 Figure: Character Sets
  • 4. Charset & Encoding .. Unicode (32bits addr. space) . Unicode up to U+10FFFF . Unicode BMP (up to U+FFFF) .GB18030. CNS11643 . CP950 . Latin1 . UTF-32 / UCS4 . UTF-81 / UTF-16 . UCS2 . GB18030. CNS11643 . CP950 (DBCS) . ISO-8859-1 / EBCDIC-0372 1 Could cover more but restricted by RFC 3629 2 Aka. IBM-37, some control characters are different from ISO-8859-1
  • 5. Encoding :: UTF-32 / UCS4 Fixed Length 4 bytes Filesize *= 4 for ASCII text file Incompatible with C-style string convention Endianness concern
  • 6. Encoding :: UCS2 Fixed Length 2 bytes Filesize *= 2 for ASCII text file Incompatible with C-style string convention Endianness concern BMP-only
  • 7. Encoding :: UTF-16 Variable Length 2 bytes / 4 bytes (Surrogate pairs) Surrogates Using U+D800..U+DFFF Incompatible with C-style string convention Endianness concern ******** ******** 110110** ******** 110111** ******** Table: UTF-16 Structure
  • 8. Encoding :: UTF-8 Variable Length 1~6 bytes Compatible with C-style string convention Self-synchronizing Endian-neutral Sorting order = Code point order 0******* (ASCII) 110***** 10****** 1110**** 10****** 10****** 11110*** 10****** 10****** 10****** 111110** 10****** 10****** 10****** 10****** 1111110* 10****** 10****** 10****** 10****** 10****** Table: UTF-8 Structure
  • 9. Encoding :: CNS11643 (全字庫) #issue http://www.cns11643.gov.tw/ Only used by Taiwan government NOT a subset of Unicode Not just an charset/encoding Font Pronunciation ㄇㄥ ˊ / méng Radical 艸 Component 艹日月 Stroke Tra/Sim mapping 萌蕄 Table: Examples for some information provided by 全字庫 for「萌」
  • 10. Encoding :: CCCII Variants Variant glyph at different plane Mostly used for library indexing 強 21 3D 48 彊 2D 3D 48 强 33 3D 48
  • 11. Encoding :: Big5 Many incompatible variations (abusing PUA), none of standard tools can rule them all http://moztw.org/docs/big5/ Scenario Dominating encoding Microsoft CP950 Taiwan BBS UAO (Unicode-at-Once) gov.tw Big5-2003 gov.hk HKSCS (1999,2001,2004) Special characters conflict The second byte could be 0x5C (), 0x7C (|), 0x7E (~), which may have special meaning in certain context
  • 12. Bsdconv :: Decoding and Encoding Alternative to iconv ... ISO-8859-1. :. UTF-8.. from . to Figure: Basic two phases conversion
  • 13. Bsdconv :: Codecs & Fallback Optionally produce question mark (U+003F) as replacement ... UTF-8. ,. 3F. :. ASCII. ,. 3F.. from . to Figure: Fallback codec Transliteration ... UTF-8. :. CP936. ,. CP936-TRANS. ,. 3F.. from . to Figure: Multiple fallback codecs
  • 14. Encoding :: Big5 Many incompatible variations (abusing PUA), none of standard tools can rule them all http://moztw.org/docs/big5/ Scenario Dominating encoding Microsoft CP950 Taiwan BBS UAO (Unicode-at-Once) gov.tw Big5-2003 gov.hk HKSCS (1999,2001,2004) Special characters conflict The second byte could be 0x5C (), 0x7C (|), 0x7E (~), which may have special meaning in certain context
  • 15. Big5 5C issue (許功蓋) BIG5:BIG5-5C,BIG5 # Input Output Big5 Literal ” 成功” ” 成功 ” ASCII/Hex ”xA6xA8xA5” ”xA6xA8xA5” BIG5-5C,BIG5:BIG5 # Input Output Big5 Literal ” 成功 ” ” 成功” ASCII/Hex ”xA6xA8xA5” ”xA6xA8xA5”
  • 16. Traditional/Simplified Chinese NOT one-to-one mapping Traditional 乾幹干 vs. Simplified 干干干 Context dependent 之後、夜之后、入夜之後 Variants 峰、峯
  • 17. Project Chvar (1/2) https://github.com/buganini/chvar .. ..签簽. 籖籤. Canonical group . Canonical group . Compatibility group Figure: Two level grouping in Chvar 签 簽 籖 籤 TW 簽 - 籤 - CN - 签 - 籖 CP950 簽 - 籤 - GB2312 - 签 × × Table: Canonical Group 签 簽 籖 籤 TW 簽 - 簽 簽 CN - 签 签 签 CP950 簽 - 簽 簽 GB2312 - 签 签 签 Table: Compatibility Group
  • 18. Project Chvar (2/2) https://github.com/buganini/chvar Normalization Canonical Equivalence Transliteration Converted or Canonical Equivalence or Compatibility Equivalence Fuzzy character matching Compatibility Equivalence 签 簽 籖 籤 TW 簽 - 籤 - CN - 签 - 籖 CP950 簽 - 籤 - GB2312 - 签 × × Table: Canonical Group 签 簽 籖 籤 TW 簽 - 簽 簽 CN - 签 签 签 CP950 簽 - 簽 簽 GB2312 - 签 签 签 Table: Compatibility Group
  • 19. Bsdconv :: Phases Traditional Chinese ⇔ Simplified Chinese ... UTF-8. :. ZHTW. :. UTF-8.. from . inter . to Figure: Conversion with inter-mapping phase
  • 20. Bsdconv :: Phases Furthermore, phrases mapping ... UTF-8. :. ZHTW. :. ZHTW-WORDS. :. UTF-8.. from . inter . inter . to Figure: Conversion with multiple inter-mapping phases
  • 21. Unicode :: Casing IS complicated Lowercase Uppercase a A i I Table: English Lowercase Uppercase ı I i İ Table: Turkic Lowercase Uppercase a A à A Table: French Lowercase Uppercase σ Σ ς Σ Table: Greek Default Case Folding
  • 22. Unicode :: Normalization Forms (1/2) UAX#15 Indexing Identification security Username, Domain name Combining sequence Ç C + ◌̧ Ordering of combining marks q+◌̇+◌̣ q+◌̣+◌̇ Hangul 가 ᄀ + ᅡ Singleton Ω Ω Table: Canonical Equivalence
  • 23. Unicode :: Normalization Forms (2/2) UAX#15 Font variants ℌ H Breaking differences NBSP SP Cursive forms ‫ﻧ‬ ‫ﻨ‬ Circled ① 1 Width, size, rotated カ カ ︷ { Superscripts/subscripts ⁹ 9 Squared characters ㍿ 株 + 式 + 会 + 社 Fractions ¾ 3 + / + 4 Others dž d + z + ◌̌ Table: Compatibility Equivalence
  • 24. Normalization for fuzzy matching UTF-8:UPPER:UTF-8 Input: aăⅷDžбⓐᾥ Output: AĂⅧDŽБⒶᾭ UTF-8:ZH-FUZZY-TW:KANA-PHONETIC:NFKD- CASEFOLD:UTF-8 Input: ¼ℌℍăDžⓐ⁹ 灣湾ド鬒鬒㊣ æß Output: 1⁄4hhădža9 灣灣 do 鬒鬒正 æss Composition Decomposition Canonical NFC NFD Compatibility NFKC NFKD Table: The four Unicode normalization forms and the transformations
  • 25. Bsdconv :: Cascade Re-encode ... UTF-8. :. ISO-8859-1. |. BIG5. :. UTF-8.. from . to . from . to Figure: Cascaded conversions Input Output ¥x¥_ 台北
  • 26. Bsdconv :: Codec argument Other than question mark ... UTF-8. ,. ANY#0121. :. ASCII. ,. ANY#21.. from . to Figure: Codec argument Or more than one character ... UTF-8. ,. ANY#013F.0121. :. ASCII. ,. ANY#21.. from . to Figure: Data list, separated by dot
  • 28. Charset & Encoding .. Unicode (32bits addr. space) . Unicode up to U+10FFFF . Unicode BMP (up to U+FFFF) . (Basic Multilingual Plane) . GB18030. CNS11643 . CP950 . Latin1 Figure: Character Sets
  • 29. Bsdconv :: Types (01) Unicode (02) CNS11643 (03) Byte (04) Chinese components (1B) ANSI control sequences (00) Bsdconv special characters
  • 30. Encoding :: CNS11643 (全字庫) #issue http://www.cns11643.gov.tw/ Only used by Taiwan government NOT a subset of Unicode Not just an charset/encoding Font Pronunciation ㄇㄥ ˊ / méng Radical 艸 Component 艹日月 Stroke Tra/Sim mapping 萌蕄 Table: Examples for some information provided by 全字庫 for「萌」
  • 31. Chinese components composition https://github.com/buganini/chicomp UTF-8:ZH-DECOMP:ZH-COMP:UTF-8 Input: 功夫不好不要艹我 Output: 巭孬嫑莪 UTF-8:ZH-DECOMP:ZH-COMP:CHEWING:UTF-8 Input: 功夫不好不要艹我 Output: ㄆㄨ ㄋㄠ ㄧㄠ ㄜ ˊ UTF-8:ZH-DECOMP:ZH-COMP:CHEWING:HAN- PINYIN:UTF-8 Input: 功夫不好不要艹我 Output: pu nao yao [uh]2
  • 32. Bsdconv :: Flags FREE - memory management MARK - identifier
  • 33. Bsdconv :: Cascade Re-encode ... UTF-8. :. ISO-8859-1. |. BIG5. :. UTF-8.. from . to . from . to Figure: Cascaded conversions Input Output ¥x¥_ 台北
  • 34. Look-through (1/4) ..%u03B1%CE%B2. Input (UTF-8 literal) . ESCAPE : .... Decoder . .. 01 . 03 . B1 . 03 . CE . 03 . B2 . Internal data . Entity Unicode UTF-8 Hex α U+03B1 CEB1 β U+03B2 CEB2 Figure: ESCAPE:PASS#MARK&FOR=1,BYTE|PASS#UNMARK,UTF-8:UTF-8
  • 35. Look-through (2/4) .. ..01. 03 . B1 . 03. CE . 03. B2. Internal data . ... : PASS#MARK&FOR=1,BYTE. Encoder . .. 01 . 03 . B1 . MARK . CE . B2 . Internal data . Entity Unicode UTF-8 Hex α U+03B1 CEB1 β U+03B2 CEB2 Figure: ESCAPE:PASS#MARK&FOR=1,BYTE|PASS#UNMARK,UTF-8:UTF-8
  • 36. Look-through (3/4) .. ..01. 03 . B1 . MARK . CE. B2 . Internal data . PASS#UNMARK,UTF-8 : .... Decoder . .. 01 . 03 . B1 . 01 . 03 . B2 . Internal data . Entity Unicode UTF-8 Hex α U+03B1 CEB1 β U+03B2 CEB2 Figure: ESCAPE:PASS#MARK&FOR=1,BYTE|PASS#UNMARK,UTF-8:UTF-8
  • 37. Look-through (4/4) .. 01 . 03 . B1 . 01 . 03 . B2 Internal data ... : UTF-8 Encoder .. CE . B1 .”α”. CE . B2 . ”β” Internal data αβ Output (UTF-8 literal) Entity Unicode UTF-8 Hex α U+03B1 CEB1 β U+03B2 CEB2 Figure: ESCAPE:PASS#MARK&FOR=1,BYTE|PASS#UNMARK,UTF-8:UTF-8
  • 38. Unicode :: East Asian Width (1/2) UAX#11 .. ..Narrow. Halfwidth .. .. Wide. Fullwidth . Ambiguous . Neutral Figure: Venn Diagram Showing the Set Relations for Six Categories
  • 39. Unicode :: East Asian Width (2/2) UAX#11 Narrow Ambiguous Wide Я N ऊ Na A A F H カ カ W 咦 W Table: Examples for Each Character Class and Their Resolved Widths Na Narrow N Neural, usually treated as Narrow W Wide F Fullwidth H Halfwidth Table: Width attributes
  • 40. String width measurement echo "42(ˊ_>ˋ) 紅茶" | bsdconv UTF-8:WIDTH:NULL FULL: 2 HALF: 7 AMBI: 2
  • 41. Chinese charset encoding detection https://github.com/buganini/chiconv ENCODING:SCORE#WITH=CJK:COUNT:ZH- BONUS:ZHTW:ZH-BONUS-PHRASE:NULL Score(s) = $SCORE−$IERR∗$COUNT∗0.01 $COUNT 帥呆了 ⇒ UTF-8:SCORE#WITH=CJK:…… ENCODING SCORE COUNT IERR Score(s) UTF-8 19 4 0 4.75 BIG5 8 3 2 -4.0 GBK 4 1 4 -36.0 CCCII 36 9 0 4.0 UTF-16LE 20 5 2 0.0
  • 42. Khmer legacy font converter https://github.com/buganini/khmerconv Issues Encoding without registerd name, bound on fonts Stored in CP1252 or UTF-8 Solution Two pass detection Detect encoding Detect font family (currently not working) (High converage in SBCS) Algorithm ported from Khmer Converter3 Khmer Converter Mapping Reordering Visual order vs. Unicode model Unicode Model: baseCharacter [+ [Robat/Shifter] + [Coeng*] + [Shifter] + [Vowel] + [Sign]] 3 http://www.khmeros.info/en/khmer-converter
  • 43. Encoding :: Big5 Many incompatible variations (abusing PUA), none of standard tools can rule them all http://moztw.org/docs/big5/ Scenario Dominating encoding Microsoft CP950 Taiwan BBS UAO (Unicode-at-Once) gov.tw Big5-2003 gov.hk HKSCS (1999,2001,2004) Special characters conflict The second byte could be 0x5C (), 0x7C (|), 0x7E (~), which may have special meaning in certain context
  • 44. Unicode :: East Asian Width (1/2) UAX#11 .. ..Narrow. Halfwidth .. .. Wide. Fullwidth . Ambiguous . Neutral Figure: Venn Diagram Showing the Set Relations for Six Categories
  • 45. Unicode :: East Asian Width (2/2) UAX#11 Narrow Ambiguous Wide Я N ऊ Na A A F H カ カ W 咦 W Table: Examples for Each Character Class and Their Resolved Widths Na Narrow N Neural, usually treated as Narrow W Wide F Fullwidth H Halfwidth Table: Width attributes
  • 46. Terminal transcoding https://github.com/buganini/bug5 Issues UAO: Non-standard big5 extension Double color hack ANSI control sequence in the middle of DBCS Ambiguous width characters luit/screen cannot help Solution (tl;dr) Big5 to Unicode ANSI-CONTROL,BYTE:BIG5-DEFRAG: BYTE,PASS#MARK&FOR=1B| PASS#UNMARK,BIG5:AMBIGUOUS-PAD: UTF-8,PASS#FOR=1B Unicode to Big5 UTF-8,00,BYTE:ZHTW:AMBIGUOUS-UNPAD: BIG5,CP950-TRANS,UAO,00,ANY#3F
  • 47. Bug5 explained (1/6) ..⋆xC5x1B[1mxE5. Input (Big5 literal) . ANSI-CONTROL,BYTE : .... Decoder . .. 03 . A1 . 03 . B9 . 03 . C5 . 1B . 5B . 31 . 6D . 03 . E5 . Internal data . Entity Unicode UTF-8 Hex Big5 Hex ⋆ U+2605 E29885 A1B9 驚 U+9A5A E9A99A C5E5 [ U+005B 5B 5B 1 U+0031 31 31 m U+006D 6D 6D (NBSP) U+00A0 C2A0 - Figure: ANSI-CONTROL,BYTE:BIG5-DEFRAG:BYTE,PASS#MARK&FOR=1B| PASS#UNMARK,BIG5:AMBIGUOUS-PAD:UTF-8,PASS#FOR=1B
  • 48. Bug5 explained (2/6) .. ..03. A1 . 03. B9 . 03. C5 . 1B. 5B . 31 . 6D . 03. E5. Internal data . ... : BIG5-DEFRAG : .... Inter-conversion . .. 03 . A1 . 03 . B9 . 03 . C5 . 03 . E5 . 1B . 5B . 31 . 6D . Internal data . Entity Unicode UTF-8 Hex Big5 Hex ⋆ U+2605 E29885 A1B9 驚 U+9A5A E9A99A C5E5 [ U+005B 5B 5B 1 U+0031 31 31 m U+006D 6D 6D (NBSP) U+00A0 C2A0 - Figure: ANSI-CONTROL,BYTE:BIG5-DEFRAG:BYTE,PASS#MARK&FOR=1B| PASS#UNMARK,BIG5:AMBIGUOUS-PAD:UTF-8,PASS#FOR=1B
  • 49. Bug5 explained (3/6) .. ..03. A1 . 03. B9 . 03. C5 . 03. E5 . 1B. 5B . 31 . 6D . Internal data . ... : BYTE,PASS#MARK&FOR=1B. Encoder . .. A1 . B9 . C5 . E5 . 1B . 5B . 31 . 6D . MARK . Internal data . Entity Unicode UTF-8 Hex Big5 Hex ⋆ U+2605 E29885 A1B9 驚 U+9A5A E9A99A C5E5 [ U+005B 5B 5B 1 U+0031 31 31 m U+006D 6D 6D (NBSP) U+00A0 C2A0 - Figure: ANSI-CONTROL,BYTE:BIG5-DEFRAG:BYTE,PASS#MARK&FOR=1B| PASS#UNMARK,BIG5:AMBIGUOUS-PAD:UTF-8,PASS#FOR=1B
  • 50. Bug5 explained (4/6) .. ..A1. B9. C5. E5. 1B. 5B . 31 . 6D . MARK . Internal data . PASS#UNMARK,BIG5 : .... Decoder . .. 01 . 26 . 05 . 01 . 9A . 5A . 1B . 5B . 31 . 6D . Internal data . Entity Unicode UTF-8 Hex Big5 Hex ⋆ U+2605 E29885 A1B9 驚 U+9A5A E9A99A C5E5 [ U+005B 5B 5B 1 U+0031 31 31 m U+006D 6D 6D (NBSP) U+00A0 C2A0 - Figure: ANSI-CONTROL,BYTE:BIG5-DEFRAG:BYTE,PASS#MARK&FOR=1B| PASS#UNMARK,BIG5:AMBIGUOUS-PAD:UTF-8,PASS#FOR=1B
  • 51. Bug5 explained (5/6) .. ..01. 26 . 05 . 01. 9A . 5A . 1B. 5B . 31 . 6D . Internal data . ... : AMBIGUOUS-PAD : .... Inter-conversion . .. 01 . 26 . 05 . 01 . A0 . 01 . 9A . 5A . 1B . 5B . 31 . 6D . Internal data . Entity Unicode UTF-8 Hex Big5 Hex ⋆ U+2605 E29885 A1B9 驚 U+9A5A E9A99A C5E5 [ U+005B 5B 5B 1 U+0031 31 31 m U+006D 6D 6D (NBSP) U+00A0 C2A0 - Figure: ANSI-CONTROL,BYTE:BIG5-DEFRAG:BYTE,PASS#MARK&FOR=1B| PASS#UNMARK,BIG5:AMBIGUOUS-PAD:UTF-8,PASS#FOR=1B
  • 52. Bug5 explained (6/6) .. ..01. 26 . 05 . 01. A0 . 01. 9A . 5A . 1B. 5B . 31 . 6D . Internal data . ... : UTF-8,PASS#FOR=1B. Encoder . ⋆ 驚 x1B[1m . Output (UTF-8 literal) . Entity Unicode UTF-8 Hex Big5 Hex ⋆ U+2605 E29885 A1B9 驚 U+9A5A E9A99A C5E5 [ U+005B 5B 5B 1 U+0031 31 31 m U+006D 6D 6D (NBSP) U+00A0 C2A0 - Figure: ANSI-CONTROL,BYTE:BIG5-DEFRAG:BYTE,PASS#MARK&FOR=1B| PASS#UNMARK,BIG5:AMBIGUOUS-PAD:UTF-8,PASS#FOR=1B
  • 54. Bsdconv :: GUI https://github.com/buganini/gbsdconv Alternative to ConvertZ Text File name File content Meta tag