SlideShare ist ein Scribd-Unternehmen logo
1 von 24
Encodings
                          Ruby 1.8 and 1.9


                                           Vlad ZLOTEANU
      #ParisRB                      Software Engineer @ Dimelo
      December 12, 2001
                                                 @vladzloteanu


Copyright Dimelo SA                                  www.dimelo.com
Motto:
                      “ There Ain't No Such Thing
                             As Plain Text ”
                                          Joel Spolsky




Copyright Dimelo SA                                 www.dimelo.com
ASCII (1963)


                      historically: from telegraphic codes
                      7 bits to encode 128 chars
                      included: english alphabet, digits, punctuation
                      marks, control chars
                      what about chars from other languages?
    "A".unpack("C*")
     => [65]

    "a".unpack("C*")
    => [97]

    "c".unpack("C*")
     => [99]
Copyright Dimelo SA                                          www.dimelo.com
iso-8859-X


                      ideea: use the 8th bit -> 128 new positions
                      8-bit encoding -> 256 chars

                 iso-8859-1 (Latin-1), windows-1252
                    slots 160 to 255 for other chars
                    covers most WE languages: French, German, etc
                    default charset in many browsers

                 iso-8859-2
                    most EE languages
Copyright Dimelo SA                                           www.dimelo.com
Issues


                      can't combine 2 different languages from 2
                      different encodings
                      most Asian languages have more than 256 chars



    "café".encode('ISO-8859-1').unpack("C*")
              => [99, 97, 102, 233]

    "Ionuţ".encode('ISO-8859-2').unpack("C*")
              => [73, 111, 110, 117, 254]

    "Ionuţ aime le café".encode('ISO-8859-1').unpack("C*")
           Encoding::UndefinedConversionError:
           U+0163 from UTF-8 to ISO-8859-1
Copyright Dimelo SA                                       www.dimelo.com
Unicode

            the goal of Unicode was literally to provide a
            character set that includes all characters in use today

            each letter maps to a code point (theoretical symbol)
                  A is the same with A and A, but different from a
                  uppercase, lowercase, rules for normalization,
                  decomposition, etc.
                  codespace of 1.1M code points (from 0 to 10FFFF) (110k
                  chars)


            from 0 to 255 -> same encoding as Latin-1 (we can
            think of it like a superset of Latin-1)
Copyright Dimelo SA                                             www.dimelo.com
Unicode (2)


            Unicode enables processing, storage and interchange
            of text data no matter what the platform, no matter
            what the program, no matter the language
            .. but how should we store those magical ‘code
            points’?
    "café".codepoints.to_a
        => [99, 97, 102, 233]

    "café".encode('ISO-8859-1').unpack("C*")
        => [99, 97, 102, 233]

    "Ionuţ 愛して le καφές".codepoints.to_a
        => [73, 111, 110, 117, 355, 32, 24859, 12375, 12390, 32, 108, 101, 32, 954,
    945, 966, 941, 962]
Copyright Dimelo SA                                                    www.dimelo.com
UTF-8

            encoding scheme for Unicode
            every code point from 0-127 is stored in a single byte.
            code points 128 and above are stored using >2 bytes



    "Café".unpack("U*")
          => [67, 97, 102, 233]

    "Café".encode(“UTF-8”).unpack("C*")
          => [67, 97, 102, 195, 169]




Copyright Dimelo SA                                      www.dimelo.com
UTF-8 pluses & minuses

            ASCII extension
            can encode any Unicode char
            self-synchronising, efficient to search for byte-
            oriented alghs, efficient to encode
            rfc2277: (inet) protocols MUST declare (supported)
            charsets, protocols MUST support at least UTF-8

    " コーヒー ".unpack('U*')
       => [12467, 12540, 12498, 12540]

    " コーヒー ".unpack('C*')
       => [227, 130, 179, 227, 131, 188, 227, 131, 146,
    227, 131, 188] # Asian languages take 1.5x more space

Copyright Dimelo SA                                    www.dimelo.com
What you should remember


            Text CONTENT and ENCODING are two different
            concepts
            Unicode is a map “symbol”  ‘integer codepoint’
            Latin-1 is a single byte encoding for Western
            languages
            UTF-8 is a multibyte encoding for Unicode


            USE UTF-8!


Copyright Dimelo SA                                   www.dimelo.com
Ruby 1.8 Unicode Support

         string is just a collection of bytes --> dealing with
         encodings is for the developer
         issues: index retrieval, slicing, regexp, etc
             “”.size will always count bytes(validates_size_of …)
         limited unicode support (/u modifier)
   "Café".size
    => 5

   "Café".reverse
    => "251303faC"

   "Café".scan(/./)
    => ["C", "a", "f", "303", "251"]

    "Café".scan(/./u)
     => ["C", "a", "f", “é"]
Copyright Dimelo SA                                    www.dimelo.com
Ruby 1.8 Unicode Support (2)

         regex - aware of 4 encodings: none, EUC, Shift_JIS,
         UTF-8
         ways to set source encoding:
            command line K param
            RUBYOPT

   ruby -e "puts 'Café'.scan(/./).inspect"
   ["C", "a", "f", "303", "251"]

   ruby -Ku -e "puts 'Café'.scan(/./).inspect"
   ["C", "a", "f", "é"]

    export RUBYOPT='-Ku'
    ruby -e "puts 'Café'.scan(/./).inspect"
    ["C", "a", "f", "é"]
Copyright Dimelo SA                                  www.dimelo.com
Ruby 1.8 - Transcoding


            Iconv library – ships with Ruby, handles transcoding
               TRANSLIT option
               IGNORE
    utf8_coffee = "Café"
    => "Café"

    utf8_to_latin1 = Iconv.new("LATIN1//TRANSLIT//IGNORE", "UTF8")
    => #<Iconv:0x007f8ba1930060>

    utf8_to_latin1.iconv(utf8_coffee).size
    => 4

    ruby-1.9.3-p0 :049 > utf8_to_latin1.iconv("On and on… and on…")
     => "On and on... and on...”
Copyright Dimelo SA                                                   www.dimelo.com
Ruby 1.9 & M17N

            multilingualization (M17N) - a CSI approach
                  Localization for more than one language on single
                  software should be available
                  More than one language should be available to use at the
                  same time
                  difference from conventional languages (java, python,
                  perl) (UCS philosophy)


            1. Source encoding: all source files have an encoding
                  new __ENCODING__ keyword

    Irb
    ruby-1.9.3-p0 :002 > __ENCODING__
     => #<Encoding:UTF-8>
Copyright Dimelo SA                                             www.dimelo.com
Ruby 1.9 – source encoding

            New way to set encoding: magic comment

            Priority:
               .rb files:
             magic comment > command-line –K option > RUBYOPT –K >
               shebang –K > US-ASCII

               command line / standard input:
             magic comment > command-line –K option > RUBYOPT –K >
               system locale
    # encoding: UTF-8
    puts __ENCODING__
        => UTF-8
Copyright Dimelo SA                                       www.dimelo.com
Ruby 1.9 – String class

        String – a collection of encoded data
            each String object has an encoding
            size method -> multibyte
            3 new enumerator methods
    "café".size
     => 4
    ruby-1.9.3-p0 :025 > "café".bytesize
     => 5

    "café".each_byte.map{|byte| byte}
     => [99, 97, 102, 195, 169]

    "café".each_char.map{|char| char}
     => ["c", "a", "f", "é"]

       "café".each_codepoint.map{|byte| byte}
        => [99, 97, 102, 233]
Copyright Dimelo SA                              www.dimelo.com
Ruby 1.9 – String class (Transcoding)

                 Strings with different encoding can ‘coexist’ in
                 same program – and can be merged
                 New way to transcode
    latin_1_coffee = "café".encode('ISO-8859-1')
     => "cafxE9"

    latin_1_coffee.bytesize
     => 4

    wrong_encoded_coffee = latin_1_coffee.force_encoding('UTF-8')
         => "cafxE9"
    latin_1_coffee.encoding
         => #<Encoding:UTF-8>
    ruby-1.9.3-p0 :035 > wrong_encoded_coffee.scan /./
    ArgumentError: invalid byte sequence in UTF-8
Copyright Dimelo SA                                                 www.dimelo.com
Ruby 1.9 - Internal and external encoding
      > cat show_encodings.rb
      open(__FILE__, "r:UTF-8:UTF-32") do |file| (that
      What about non-literal Strings                     come from I/O)?
       puts file.external_encoding.name
       puts file.internal_encoding.name
         2. Encoding.default_external:
       file.each do |line|
         p [line.encoding.name, line[0..3]]
       end default for external encoding
      end       derived from LANG on Unix/Linux
              derived from legacy system encoding on Windows
      > ruby show_encodings.rb
      UTF-8
      UTF-32
         3. Encoding.default_internal:
      ["UTF-32", "uFEFF"]
      ["UTF-32", "x00x00x00x20"]encoding
              default for internal
      ["UTF-32", "x00x00x00x20"]
      ["UTF-32", "x00x00x00x20"] (≊ default external)
              by default undefined
      ["UTF-32", "x00x00x00x20"]
      ["UTF-32", "x00x00x00x20"]
      ["UTF-32", "x00x00x00x65"]
Copyright Dimelo SA                                               www.dimelo.com
What you should remember

            Ruby 1.8 has limited (regexp-only) support for
            Unicode
              watch out on slices, sizes, reverse, etc.
              transcode with Iconv



            Ruby 1.9 is encoding-aware
              each source file has an Encoding
              each String has an Encoding
              IO: internal and external encoding
              New iterators on String
Copyright Dimelo SA                                    www.dimelo.com
HTML/HTTP – declare encoding

            HTML/HTTP
              HTTP header
              Meta tags



    Content-Type: text/html; charset=ISO-8859-1 # HTTP Header



    <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />

    <meta charset="utf-8"/>

    <?xml version="1.0" encoding="ISO-8859-1"?>
Copyright Dimelo SA                                                 www.dimelo.com
HTML – Encoding chars


                      Encoding types
                         directly in declared encoding
                            “é’
                         named char entities
                            "&eacute;”
                         numeric char entities
                            “&#233;”




Copyright Dimelo SA                                      www.dimelo.com
Conclusion


                      Use UTF8

                      Document (declare) encodings

                      Code encoding-safe




Copyright Dimelo SA                              www.dimelo.com
References


            James Gray’s Encodings series

            Joel Spolsky’s blog post about encodings

            Design and implementation of Ruby M17N

            Internationalization in Ruby 1.9




Copyright Dimelo SA                                    www.dimelo.com
.end

                          Merci!
                        Thank you!
                        Mulţumesc
                        ありがとう




                           ?
Copyright Dimelo SA                  www.dimelo.com

Weitere ähnliche Inhalte

Ähnlich wie Encodings - Ruby 1.8 and Ruby 1.9

Character encoding standard(1)
Character encoding standard(1)Character encoding standard(1)
Character encoding standard(1)Pramila Selvaraj
 
Abap slide class4 unicode-plusfiles
Abap slide class4 unicode-plusfilesAbap slide class4 unicode-plusfiles
Abap slide class4 unicode-plusfilesMilind Patil
 
Unicode, PHP, and Character Set Collisions
Unicode, PHP, and Character Set CollisionsUnicode, PHP, and Character Set Collisions
Unicode, PHP, and Character Set CollisionsRay Paseur
 
Compiler Construction | Lecture 13 | Code Generation
Compiler Construction | Lecture 13 | Code GenerationCompiler Construction | Lecture 13 | Code Generation
Compiler Construction | Lecture 13 | Code GenerationEelco Visser
 
Creating a Fibonacci Generator in Assembly - by Willem van Ketwich
Creating a Fibonacci Generator in Assembly - by Willem van KetwichCreating a Fibonacci Generator in Assembly - by Willem van Ketwich
Creating a Fibonacci Generator in Assembly - by Willem van KetwichWillem van Ketwich
 
UTF-8: The Secret of Character Encoding
UTF-8: The Secret of Character EncodingUTF-8: The Secret of Character Encoding
UTF-8: The Secret of Character EncodingBert Pattyn
 
R tech introcomputer
R tech introcomputerR tech introcomputer
R tech introcomputerRose Rajput
 
Software Internationalization Crash Course
Software Internationalization Crash CourseSoftware Internationalization Crash Course
Software Internationalization Crash CourseWill Iverson
 
Unicode and Collations in MySQL 8.0
Unicode and Collations in MySQL 8.0Unicode and Collations in MySQL 8.0
Unicode and Collations in MySQL 8.0Bernt Marius Johnsen
 
Except UnicodeError: battling Unicode demons in Python
Except UnicodeError: battling Unicode demons in PythonExcept UnicodeError: battling Unicode demons in Python
Except UnicodeError: battling Unicode demons in PythonAram Dulyan
 
MySQL 8.0 & Unicode: Why, what & how
MySQL 8.0 & Unicode: Why, what & howMySQL 8.0 & Unicode: Why, what & how
MySQL 8.0 & Unicode: Why, what & howBernt Marius Johnsen
 
Migrating To Ruby1.9
Migrating To Ruby1.9Migrating To Ruby1.9
Migrating To Ruby1.9tomaspavelka
 
Using Python3 to Build a Cloud Computing Service for my Superboard II
Using Python3 to Build a Cloud Computing Service for my Superboard IIUsing Python3 to Build a Cloud Computing Service for my Superboard II
Using Python3 to Build a Cloud Computing Service for my Superboard IIDavid Beazley (Dabeaz LLC)
 
Data Representation in Computers
Data Representation in ComputersData Representation in Computers
Data Representation in ComputersCBAKhan
 

Ähnlich wie Encodings - Ruby 1.8 and Ruby 1.9 (20)

Unicode 101
Unicode 101Unicode 101
Unicode 101
 
Compiler
CompilerCompiler
Compiler
 
Character encoding standard(1)
Character encoding standard(1)Character encoding standard(1)
Character encoding standard(1)
 
Abap slide class4 unicode-plusfiles
Abap slide class4 unicode-plusfilesAbap slide class4 unicode-plusfiles
Abap slide class4 unicode-plusfiles
 
Unicode, PHP, and Character Set Collisions
Unicode, PHP, and Character Set CollisionsUnicode, PHP, and Character Set Collisions
Unicode, PHP, and Character Set Collisions
 
Compiler Construction | Lecture 13 | Code Generation
Compiler Construction | Lecture 13 | Code GenerationCompiler Construction | Lecture 13 | Code Generation
Compiler Construction | Lecture 13 | Code Generation
 
Creating a Fibonacci Generator in Assembly - by Willem van Ketwich
Creating a Fibonacci Generator in Assembly - by Willem van KetwichCreating a Fibonacci Generator in Assembly - by Willem van Ketwich
Creating a Fibonacci Generator in Assembly - by Willem van Ketwich
 
Uncdtalk
UncdtalkUncdtalk
Uncdtalk
 
UTF-8: The Secret of Character Encoding
UTF-8: The Secret of Character EncodingUTF-8: The Secret of Character Encoding
UTF-8: The Secret of Character Encoding
 
R tech introcomputer
R tech introcomputerR tech introcomputer
R tech introcomputer
 
Till Vollmer Presentation
Till Vollmer PresentationTill Vollmer Presentation
Till Vollmer Presentation
 
Software Internationalization Crash Course
Software Internationalization Crash CourseSoftware Internationalization Crash Course
Software Internationalization Crash Course
 
Unicode and Collations in MySQL 8.0
Unicode and Collations in MySQL 8.0Unicode and Collations in MySQL 8.0
Unicode and Collations in MySQL 8.0
 
Elixir introduction
Elixir introductionElixir introduction
Elixir introduction
 
Except UnicodeError: battling Unicode demons in Python
Except UnicodeError: battling Unicode demons in PythonExcept UnicodeError: battling Unicode demons in Python
Except UnicodeError: battling Unicode demons in Python
 
MySQL 8.0 & Unicode: Why, what & how
MySQL 8.0 & Unicode: Why, what & howMySQL 8.0 & Unicode: Why, what & how
MySQL 8.0 & Unicode: Why, what & how
 
Journey of Bsdconv
Journey of BsdconvJourney of Bsdconv
Journey of Bsdconv
 
Migrating To Ruby1.9
Migrating To Ruby1.9Migrating To Ruby1.9
Migrating To Ruby1.9
 
Using Python3 to Build a Cloud Computing Service for my Superboard II
Using Python3 to Build a Cloud Computing Service for my Superboard IIUsing Python3 to Build a Cloud Computing Service for my Superboard II
Using Python3 to Build a Cloud Computing Service for my Superboard II
 
Data Representation in Computers
Data Representation in ComputersData Representation in Computers
Data Representation in Computers
 

Kürzlich hochgeladen

FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGSujit Pal
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 

Kürzlich hochgeladen (20)

FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAG
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 

Encodings - Ruby 1.8 and Ruby 1.9

  • 1. Encodings Ruby 1.8 and 1.9 Vlad ZLOTEANU #ParisRB Software Engineer @ Dimelo December 12, 2001 @vladzloteanu Copyright Dimelo SA www.dimelo.com
  • 2. Motto: “ There Ain't No Such Thing As Plain Text ” Joel Spolsky Copyright Dimelo SA www.dimelo.com
  • 3. ASCII (1963) historically: from telegraphic codes 7 bits to encode 128 chars included: english alphabet, digits, punctuation marks, control chars what about chars from other languages? "A".unpack("C*") => [65] "a".unpack("C*") => [97] "c".unpack("C*") => [99] Copyright Dimelo SA www.dimelo.com
  • 4. iso-8859-X ideea: use the 8th bit -> 128 new positions 8-bit encoding -> 256 chars iso-8859-1 (Latin-1), windows-1252 slots 160 to 255 for other chars covers most WE languages: French, German, etc default charset in many browsers iso-8859-2 most EE languages Copyright Dimelo SA www.dimelo.com
  • 5. Issues can't combine 2 different languages from 2 different encodings most Asian languages have more than 256 chars "café".encode('ISO-8859-1').unpack("C*") => [99, 97, 102, 233] "Ionuţ".encode('ISO-8859-2').unpack("C*") => [73, 111, 110, 117, 254] "Ionuţ aime le café".encode('ISO-8859-1').unpack("C*") Encoding::UndefinedConversionError: U+0163 from UTF-8 to ISO-8859-1 Copyright Dimelo SA www.dimelo.com
  • 6. Unicode the goal of Unicode was literally to provide a character set that includes all characters in use today each letter maps to a code point (theoretical symbol) A is the same with A and A, but different from a uppercase, lowercase, rules for normalization, decomposition, etc. codespace of 1.1M code points (from 0 to 10FFFF) (110k chars) from 0 to 255 -> same encoding as Latin-1 (we can think of it like a superset of Latin-1) Copyright Dimelo SA www.dimelo.com
  • 7. Unicode (2) Unicode enables processing, storage and interchange of text data no matter what the platform, no matter what the program, no matter the language .. but how should we store those magical ‘code points’? "café".codepoints.to_a => [99, 97, 102, 233] "café".encode('ISO-8859-1').unpack("C*") => [99, 97, 102, 233] "Ionuţ 愛して le καφές".codepoints.to_a => [73, 111, 110, 117, 355, 32, 24859, 12375, 12390, 32, 108, 101, 32, 954, 945, 966, 941, 962] Copyright Dimelo SA www.dimelo.com
  • 8. UTF-8 encoding scheme for Unicode every code point from 0-127 is stored in a single byte. code points 128 and above are stored using >2 bytes "Café".unpack("U*") => [67, 97, 102, 233] "Café".encode(“UTF-8”).unpack("C*") => [67, 97, 102, 195, 169] Copyright Dimelo SA www.dimelo.com
  • 9. UTF-8 pluses & minuses ASCII extension can encode any Unicode char self-synchronising, efficient to search for byte- oriented alghs, efficient to encode rfc2277: (inet) protocols MUST declare (supported) charsets, protocols MUST support at least UTF-8 " コーヒー ".unpack('U*') => [12467, 12540, 12498, 12540] " コーヒー ".unpack('C*') => [227, 130, 179, 227, 131, 188, 227, 131, 146, 227, 131, 188] # Asian languages take 1.5x more space Copyright Dimelo SA www.dimelo.com
  • 10. What you should remember Text CONTENT and ENCODING are two different concepts Unicode is a map “symbol”  ‘integer codepoint’ Latin-1 is a single byte encoding for Western languages UTF-8 is a multibyte encoding for Unicode USE UTF-8! Copyright Dimelo SA www.dimelo.com
  • 11. Ruby 1.8 Unicode Support string is just a collection of bytes --> dealing with encodings is for the developer issues: index retrieval, slicing, regexp, etc “”.size will always count bytes(validates_size_of …) limited unicode support (/u modifier) "Café".size => 5 "Café".reverse => "251303faC" "Café".scan(/./) => ["C", "a", "f", "303", "251"] "Café".scan(/./u) => ["C", "a", "f", “é"] Copyright Dimelo SA www.dimelo.com
  • 12. Ruby 1.8 Unicode Support (2) regex - aware of 4 encodings: none, EUC, Shift_JIS, UTF-8 ways to set source encoding: command line K param RUBYOPT ruby -e "puts 'Café'.scan(/./).inspect" ["C", "a", "f", "303", "251"] ruby -Ku -e "puts 'Café'.scan(/./).inspect" ["C", "a", "f", "é"] export RUBYOPT='-Ku' ruby -e "puts 'Café'.scan(/./).inspect" ["C", "a", "f", "é"] Copyright Dimelo SA www.dimelo.com
  • 13. Ruby 1.8 - Transcoding Iconv library – ships with Ruby, handles transcoding TRANSLIT option IGNORE utf8_coffee = "Café" => "Café" utf8_to_latin1 = Iconv.new("LATIN1//TRANSLIT//IGNORE", "UTF8") => #<Iconv:0x007f8ba1930060> utf8_to_latin1.iconv(utf8_coffee).size => 4 ruby-1.9.3-p0 :049 > utf8_to_latin1.iconv("On and on… and on…") => "On and on... and on...” Copyright Dimelo SA www.dimelo.com
  • 14. Ruby 1.9 & M17N multilingualization (M17N) - a CSI approach Localization for more than one language on single software should be available More than one language should be available to use at the same time difference from conventional languages (java, python, perl) (UCS philosophy) 1. Source encoding: all source files have an encoding new __ENCODING__ keyword Irb ruby-1.9.3-p0 :002 > __ENCODING__ => #<Encoding:UTF-8> Copyright Dimelo SA www.dimelo.com
  • 15. Ruby 1.9 – source encoding New way to set encoding: magic comment Priority: .rb files: magic comment > command-line –K option > RUBYOPT –K > shebang –K > US-ASCII command line / standard input: magic comment > command-line –K option > RUBYOPT –K > system locale # encoding: UTF-8 puts __ENCODING__ => UTF-8 Copyright Dimelo SA www.dimelo.com
  • 16. Ruby 1.9 – String class String – a collection of encoded data each String object has an encoding size method -> multibyte 3 new enumerator methods "café".size => 4 ruby-1.9.3-p0 :025 > "café".bytesize => 5 "café".each_byte.map{|byte| byte} => [99, 97, 102, 195, 169] "café".each_char.map{|char| char} => ["c", "a", "f", "é"] "café".each_codepoint.map{|byte| byte} => [99, 97, 102, 233] Copyright Dimelo SA www.dimelo.com
  • 17. Ruby 1.9 – String class (Transcoding) Strings with different encoding can ‘coexist’ in same program – and can be merged New way to transcode latin_1_coffee = "café".encode('ISO-8859-1') => "cafxE9" latin_1_coffee.bytesize => 4 wrong_encoded_coffee = latin_1_coffee.force_encoding('UTF-8') => "cafxE9" latin_1_coffee.encoding => #<Encoding:UTF-8> ruby-1.9.3-p0 :035 > wrong_encoded_coffee.scan /./ ArgumentError: invalid byte sequence in UTF-8 Copyright Dimelo SA www.dimelo.com
  • 18. Ruby 1.9 - Internal and external encoding > cat show_encodings.rb open(__FILE__, "r:UTF-8:UTF-32") do |file| (that What about non-literal Strings come from I/O)? puts file.external_encoding.name puts file.internal_encoding.name 2. Encoding.default_external: file.each do |line| p [line.encoding.name, line[0..3]] end default for external encoding end derived from LANG on Unix/Linux derived from legacy system encoding on Windows > ruby show_encodings.rb UTF-8 UTF-32 3. Encoding.default_internal: ["UTF-32", "uFEFF"] ["UTF-32", "x00x00x00x20"]encoding default for internal ["UTF-32", "x00x00x00x20"] ["UTF-32", "x00x00x00x20"] (≊ default external) by default undefined ["UTF-32", "x00x00x00x20"] ["UTF-32", "x00x00x00x20"] ["UTF-32", "x00x00x00x65"] Copyright Dimelo SA www.dimelo.com
  • 19. What you should remember Ruby 1.8 has limited (regexp-only) support for Unicode watch out on slices, sizes, reverse, etc. transcode with Iconv Ruby 1.9 is encoding-aware each source file has an Encoding each String has an Encoding IO: internal and external encoding New iterators on String Copyright Dimelo SA www.dimelo.com
  • 20. HTML/HTTP – declare encoding HTML/HTTP HTTP header Meta tags Content-Type: text/html; charset=ISO-8859-1 # HTTP Header <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> <meta charset="utf-8"/> <?xml version="1.0" encoding="ISO-8859-1"?> Copyright Dimelo SA www.dimelo.com
  • 21. HTML – Encoding chars Encoding types directly in declared encoding “é’ named char entities "&eacute;” numeric char entities “&#233;” Copyright Dimelo SA www.dimelo.com
  • 22. Conclusion Use UTF8 Document (declare) encodings Code encoding-safe Copyright Dimelo SA www.dimelo.com
  • 23. References James Gray’s Encodings series Joel Spolsky’s blog post about encodings Design and implementation of Ruby M17N Internationalization in Ruby 1.9 Copyright Dimelo SA www.dimelo.com
  • 24. .end Merci! Thank you! Mulţumesc ありがとう ? Copyright Dimelo SA www.dimelo.com