IAC 2024 - IA Fast Track to Search Focused AI Solutions
Encodings - Ruby 1.8 and Ruby 1.9
1. Encodings
Ruby 1.8 and 1.9
Vlad ZLOTEANU
#ParisRB Software Engineer @ Dimelo
December 12, 2001
@vladzloteanu
Copyright Dimelo SA www.dimelo.com
2. Motto:
“ There Ain't No Such Thing
As Plain Text ”
Joel Spolsky
Copyright Dimelo SA www.dimelo.com
3. ASCII (1963)
historically: from telegraphic codes
7 bits to encode 128 chars
included: english alphabet, digits, punctuation
marks, control chars
what about chars from other languages?
"A".unpack("C*")
=> [65]
"a".unpack("C*")
=> [97]
"c".unpack("C*")
=> [99]
Copyright Dimelo SA www.dimelo.com
4. iso-8859-X
ideea: use the 8th bit -> 128 new positions
8-bit encoding -> 256 chars
iso-8859-1 (Latin-1), windows-1252
slots 160 to 255 for other chars
covers most WE languages: French, German, etc
default charset in many browsers
iso-8859-2
most EE languages
Copyright Dimelo SA www.dimelo.com
5. Issues
can't combine 2 different languages from 2
different encodings
most Asian languages have more than 256 chars
"café".encode('ISO-8859-1').unpack("C*")
=> [99, 97, 102, 233]
"Ionuţ".encode('ISO-8859-2').unpack("C*")
=> [73, 111, 110, 117, 254]
"Ionuţ aime le café".encode('ISO-8859-1').unpack("C*")
Encoding::UndefinedConversionError:
U+0163 from UTF-8 to ISO-8859-1
Copyright Dimelo SA www.dimelo.com
6. Unicode
the goal of Unicode was literally to provide a
character set that includes all characters in use today
each letter maps to a code point (theoretical symbol)
A is the same with A and A, but different from a
uppercase, lowercase, rules for normalization,
decomposition, etc.
codespace of 1.1M code points (from 0 to 10FFFF) (110k
chars)
from 0 to 255 -> same encoding as Latin-1 (we can
think of it like a superset of Latin-1)
Copyright Dimelo SA www.dimelo.com
7. Unicode (2)
Unicode enables processing, storage and interchange
of text data no matter what the platform, no matter
what the program, no matter the language
.. but how should we store those magical ‘code
points’?
"café".codepoints.to_a
=> [99, 97, 102, 233]
"café".encode('ISO-8859-1').unpack("C*")
=> [99, 97, 102, 233]
"Ionuţ 愛して le καφές".codepoints.to_a
=> [73, 111, 110, 117, 355, 32, 24859, 12375, 12390, 32, 108, 101, 32, 954,
945, 966, 941, 962]
Copyright Dimelo SA www.dimelo.com
8. UTF-8
encoding scheme for Unicode
every code point from 0-127 is stored in a single byte.
code points 128 and above are stored using >2 bytes
"Café".unpack("U*")
=> [67, 97, 102, 233]
"Café".encode(“UTF-8”).unpack("C*")
=> [67, 97, 102, 195, 169]
Copyright Dimelo SA www.dimelo.com
9. UTF-8 pluses & minuses
ASCII extension
can encode any Unicode char
self-synchronising, efficient to search for byte-
oriented alghs, efficient to encode
rfc2277: (inet) protocols MUST declare (supported)
charsets, protocols MUST support at least UTF-8
" コーヒー ".unpack('U*')
=> [12467, 12540, 12498, 12540]
" コーヒー ".unpack('C*')
=> [227, 130, 179, 227, 131, 188, 227, 131, 146,
227, 131, 188] # Asian languages take 1.5x more space
Copyright Dimelo SA www.dimelo.com
10. What you should remember
Text CONTENT and ENCODING are two different
concepts
Unicode is a map “symbol” ‘integer codepoint’
Latin-1 is a single byte encoding for Western
languages
UTF-8 is a multibyte encoding for Unicode
USE UTF-8!
Copyright Dimelo SA www.dimelo.com
11. Ruby 1.8 Unicode Support
string is just a collection of bytes --> dealing with
encodings is for the developer
issues: index retrieval, slicing, regexp, etc
“”.size will always count bytes(validates_size_of …)
limited unicode support (/u modifier)
"Café".size
=> 5
"Café".reverse
=> "251303faC"
"Café".scan(/./)
=> ["C", "a", "f", "303", "251"]
"Café".scan(/./u)
=> ["C", "a", "f", “é"]
Copyright Dimelo SA www.dimelo.com
12. Ruby 1.8 Unicode Support (2)
regex - aware of 4 encodings: none, EUC, Shift_JIS,
UTF-8
ways to set source encoding:
command line K param
RUBYOPT
ruby -e "puts 'Café'.scan(/./).inspect"
["C", "a", "f", "303", "251"]
ruby -Ku -e "puts 'Café'.scan(/./).inspect"
["C", "a", "f", "é"]
export RUBYOPT='-Ku'
ruby -e "puts 'Café'.scan(/./).inspect"
["C", "a", "f", "é"]
Copyright Dimelo SA www.dimelo.com
13. Ruby 1.8 - Transcoding
Iconv library – ships with Ruby, handles transcoding
TRANSLIT option
IGNORE
utf8_coffee = "Café"
=> "Café"
utf8_to_latin1 = Iconv.new("LATIN1//TRANSLIT//IGNORE", "UTF8")
=> #<Iconv:0x007f8ba1930060>
utf8_to_latin1.iconv(utf8_coffee).size
=> 4
ruby-1.9.3-p0 :049 > utf8_to_latin1.iconv("On and on… and on…")
=> "On and on... and on...”
Copyright Dimelo SA www.dimelo.com
14. Ruby 1.9 & M17N
multilingualization (M17N) - a CSI approach
Localization for more than one language on single
software should be available
More than one language should be available to use at the
same time
difference from conventional languages (java, python,
perl) (UCS philosophy)
1. Source encoding: all source files have an encoding
new __ENCODING__ keyword
Irb
ruby-1.9.3-p0 :002 > __ENCODING__
=> #<Encoding:UTF-8>
Copyright Dimelo SA www.dimelo.com
15. Ruby 1.9 – source encoding
New way to set encoding: magic comment
Priority:
.rb files:
magic comment > command-line –K option > RUBYOPT –K >
shebang –K > US-ASCII
command line / standard input:
magic comment > command-line –K option > RUBYOPT –K >
system locale
# encoding: UTF-8
puts __ENCODING__
=> UTF-8
Copyright Dimelo SA www.dimelo.com
16. Ruby 1.9 – String class
String – a collection of encoded data
each String object has an encoding
size method -> multibyte
3 new enumerator methods
"café".size
=> 4
ruby-1.9.3-p0 :025 > "café".bytesize
=> 5
"café".each_byte.map{|byte| byte}
=> [99, 97, 102, 195, 169]
"café".each_char.map{|char| char}
=> ["c", "a", "f", "é"]
"café".each_codepoint.map{|byte| byte}
=> [99, 97, 102, 233]
Copyright Dimelo SA www.dimelo.com
17. Ruby 1.9 – String class (Transcoding)
Strings with different encoding can ‘coexist’ in
same program – and can be merged
New way to transcode
latin_1_coffee = "café".encode('ISO-8859-1')
=> "cafxE9"
latin_1_coffee.bytesize
=> 4
wrong_encoded_coffee = latin_1_coffee.force_encoding('UTF-8')
=> "cafxE9"
latin_1_coffee.encoding
=> #<Encoding:UTF-8>
ruby-1.9.3-p0 :035 > wrong_encoded_coffee.scan /./
ArgumentError: invalid byte sequence in UTF-8
Copyright Dimelo SA www.dimelo.com
18. Ruby 1.9 - Internal and external encoding
> cat show_encodings.rb
open(__FILE__, "r:UTF-8:UTF-32") do |file| (that
What about non-literal Strings come from I/O)?
puts file.external_encoding.name
puts file.internal_encoding.name
2. Encoding.default_external:
file.each do |line|
p [line.encoding.name, line[0..3]]
end default for external encoding
end derived from LANG on Unix/Linux
derived from legacy system encoding on Windows
> ruby show_encodings.rb
UTF-8
UTF-32
3. Encoding.default_internal:
["UTF-32", "uFEFF"]
["UTF-32", "x00x00x00x20"]encoding
default for internal
["UTF-32", "x00x00x00x20"]
["UTF-32", "x00x00x00x20"] (≊ default external)
by default undefined
["UTF-32", "x00x00x00x20"]
["UTF-32", "x00x00x00x20"]
["UTF-32", "x00x00x00x65"]
Copyright Dimelo SA www.dimelo.com
19. What you should remember
Ruby 1.8 has limited (regexp-only) support for
Unicode
watch out on slices, sizes, reverse, etc.
transcode with Iconv
Ruby 1.9 is encoding-aware
each source file has an Encoding
each String has an Encoding
IO: internal and external encoding
New iterators on String
Copyright Dimelo SA www.dimelo.com
21. HTML – Encoding chars
Encoding types
directly in declared encoding
“é’
named char entities
"é”
numeric char entities
“é”
Copyright Dimelo SA www.dimelo.com
22. Conclusion
Use UTF8
Document (declare) encodings
Code encoding-safe
Copyright Dimelo SA www.dimelo.com
23. References
James Gray’s Encodings series
Joel Spolsky’s blog post about encodings
Design and implementation of Ruby M17N
Internationalization in Ruby 1.9
Copyright Dimelo SA www.dimelo.com