SlideShare ist ein Scribd-Unternehmen logo
1 von 4
Downloaden Sie, um offline zu lesen
4/30/2010                                   A guide to Unicode and Internationaliz…




   Unicode Primer for the Uninitiated
   Internationalization Articles                                                                             May 8th, 2008

   Among our friends and clients at Lingoport, we regularly see ranges of confusion, to complete lack of awareness of
   what Unicode is. So for the less- or under-informed, perhaps this article will help. The advent of Unicode is a key
   underpinning for global software applications and websites so that they can support worldwide language scripts. So
   it’s a very important standard to be aware of, whether you’re in localization, an engineer or a business manager.


   Firstly, Unicode is a character set standard used for displaying
   and processing language data in computer applications. The
   Unicode character set is the entire world’s set of characters,
   including letters, numbers, currencies, symbols and the like,
   supporting a number of character encodings to make that all
   happen. Before your eyes glaze over, let me explain what
   character encoding means. You have to remember that for a
   computer, all information is represented in zeros and ones
   (i.e. binary values). So if you think of the letter A in the ASCII
   standard of zeros and ones it would look like this: 1000001.
   That is, a 1 then five zeros and a 1 to make a total of 7 bits.
   This binary representation for A is called A’s code point, and
   this mapping of zeros and ones to characters is called the
   character encoding. In the early days of computing, unless
   you did something very special, ASCII (7 bits per character)
   was how your data got managed. The problem is that ASCII
   doesn’t leave you enough zeros and ones to represent
   extended characters, like accents and characters specific to
   non-English alphabets, such as you find in European languages. You certainly can’t support the complex characters
   that make up Chinese, Korean and Japanese languages. These languages require 8-bit (single-byte) or 16-bit (double-
   byte) character encodings. One important note on all of these single- and double-byte encodings is that they are a
   superset of 7-bit ASCII encoding, which means that English code points will always be the same regardless the
   encoding.

   The Bad Old Days

   In the early computing days, specific character single- and double-byte encodings were developed to support
   various languages. That was very bad, as it meant that software developers needed to build a version of their
   application for every language they wanted to support that used a different encoding. You’d have the Japanese
   version, the Western European language version, the English-only version and so on. You’d end up with a hoard of
   individual software code bases, each needing their own testing, updating and ongoing maintenance and support,
   which is very expensive, and pretty near impossible for businesses to realistically support without serious digressions
   among the various language versions over time. You don’t see this problem very often for newly developed
   applications, but there are plenty of holdovers. We see it typically when a new client has turned over their source
   code to a particular country partner or marketing agent which was responsible for adapting the code to multiple
   languages. The worst case I saw was in 2004 when a particular client, who I will leave unmentioned, had a legacy
   product with 18 separate language versions and had no real idea any longer the level of functionality that varied

lingoport.com/unicode-primer-for-the-…                                                                                       1/4
4/30/2010                                A guide to Unicode and Internationaliz…
   from language to language. That’s no way to grow a corporate empire!

   ISO Latin

   A single-byte character set that we often see in applications is ISO Latin 1, which is represented in various encoding
   standards such as ISO-8859-1 for UNIX, Windows-1252 for Windows and M acRoman on guess what platform. This
   character set supports characters used in Western European languages such as French, Spanish, German, and U.K.
   English. Since each character requires only a single byte, this character set provides support for multiple languages,
   while avoiding the work required to support either Unicode or a double-byte encoding. Trouble is that still leaves
   out much of the world. For example, to support Eastern European languages you need to use a different character
   set, often referred to as Latin 2, which provides the characters that are uniquely needed for these languages.
   There are also separate character sets for Baltic languages, Turkish, Arabic, Hebrew, and on and on. When having to
   internationalize software for the first time, sometimes companies will start with just supporting ISO Latin 1 if it meets
   their immediate marketing requirements and deal with the more extensive work of supporting other languages later.
   The reason is that it’s likely these software applications will need major reworking of the encoding support in their
   database and functions, methods and classes within their source code to go beyond ISO Latin support, which means
   more time and more money – often cascading into later releases and foregone revenues. However, if the software
   company has truly global ambitions, they will need to take that plunge and provide Unicode support. I’ll argue that if
   companies are supporting global customers, and even not doing a bit of translation/localization for the interface,
   they still need to support Unicode so they can provide processing of their customer’s global data.

   Unicode

   We come back to Unicode, which as we mentioned above, is a character set created to enable support of any
   written language worldwide. Now you might find a language or two lacking Unicode support for its script but that is
   becoming extremely isolated. For instance, currently Javanese, Loma, and Tai Viet are among scripts not yet
   supported. Arcane until you need them I suppose. I remember a few years ago when we were developing a multi-
   lingual site which needed support for Khmer and Armenian, and we were thankful that Unicode had just added their
   support a few months prior. If you have a marketing requirement for your software to support Japanese or Chinese,
   think Unicode. That’s because you will need to move to a double-byte encoding at the very least, and as soon as
   you go through the trouble to do that, you might as well support Unicode and get the added benefit of support for
   all languages.

   UTF-8

   Once you’ve chosen to support Unicode, you must decide on the specific character encoding you want to use,
   which will be dependent on the application requirements and technologies. UTF-8 is one of the commonly used
   character encodings defined within the Unicode Standard, which uses a single byte for each character unless it
   needs more, in which case it can expand up to 4 bytes. People sometimes refer to this as a variable-width encoding
   since the width of the character in bytes varies depending upon the character. The advantage of this character
   encoding is that all English (ASCII) characters will remain as single-bytes, saving data space. This is especially desirable
   for web content, since the underlying HTM L markup will remain in single-byte ASCII. In general, UNIX platforms are
   optimized for UTF-8 character encoding. Concerning databases, where large amounts of application data are integral
   to the application, a developer may choose a UTF-8 encoding to save space if most of the data in the database does
   not need translation and so can remain in English (which requires only a single byte in UTF-8 encoding). Note that
   some databases will not support UTF-8, specifically M icrosoft’s SQL Server.

   UTF-16

   UTF-16 is another widely adopted encoding within the Unicode standard. It assigns two bytes for each character
   whether you need it or not. So the letter A is 00000000 01000001 or 9 zeros, a one, followed by 5 zeros and a one. If
   more than 2 bytes are needed for a character, four bytes can be combined, however you must adapt your software

lingoport.com/unicode-primer-for-the-…                                                                                            2/4
4/30/2010                                 A guide to Unicode and Internationaliz…
   to be capable of handling this four-byte combination. Java and .Net internally process strings (text and messages) as
   UTF-16.

   For many applications, you can actually support multiple Unicode encodings so that for example your data is stored
   in your database as UTF-8 but is handled within your code as UTF-16, or vice versa. There are various reasons to do
   this, such as software limitations (different software components supporting different Unicode encodings), storage
   or performance advantages, etc.. But whether that’s a good idea is one of those “it depends” kinds of questions.
   Implementing can be tricky and clients pay us good money to solve this.

   M icrosoft’s SQL Server is a bit of a special case, in that it supports UCS-2, which is like UTF-16 but without the 4-
   byte characters (only the 16-bit characters are supported).

   GB 18030

   There’s also a special-case character set when it comes to engineering for software intended for sale in China (PRC),
   which is required by the Chinese Government. This character set is GB 18030GB 18030, and it is actually a superset
   of Unicode, supporting both simplified and traditional Chinese. Similarly to UTF-16, GB 18030 character encoding
   allows 4 bytes per character to support characters beyond Unicode’s “basic” (16-bit) range, and in practice
   supporting UTF-16 (or UTF-8) is considered an acceptable approach to supporting GB 18030 (the UCS-2 encoding just
   mentioned is not, however).

   Now all of this considered, a converse question might be, what happens when you try to make your application
   support complex scripts that need Unicode, and the support isn’t there? Depending upon your system, you get
   anything from garbled and meaningless gibberish where data or messages become corrupted characters or weird
   square boxes, or the application crashes forcing a restart. Not good.

   If your application supports Unicode, you are ready to take on the world.




          Resources
                 Internationalization Articles
                 Internationalization Newsletter
                 Internationalization Whitepapers
                 Videos
                 Webinars



   Subscribe
   Subscribe to our newsletter and white papers for free internationalization news, articles, and Webinar
   announcements sent via email.
   Click Here to Subscribe


   Contact Us
        Phone: +1.303.444.8020
            Email: info@lingoport.com




lingoport.com/unicode-primer-for-the-…                                                                                      3/4
4/30/2010                                A guide to Unicode and Internationaliz…




lingoport.com/unicode-primer-for-the-…                                             4/4

Weitere ähnliche Inhalte

Mehr von Lingoport (www.lingoport.com)

Internationalization Conference, Webinars, Events, Book Discount and More!
Internationalization Conference, Webinars, Events, Book Discount and More!Internationalization Conference, Webinars, Events, Book Discount and More!
Internationalization Conference, Webinars, Events, Book Discount and More!Lingoport (www.lingoport.com)
 
LocWorld: Building an Internationalization Plan; October 2011
LocWorld: Building an Internationalization Plan; October 2011LocWorld: Building an Internationalization Plan; October 2011
LocWorld: Building an Internationalization Plan; October 2011Lingoport (www.lingoport.com)
 
Leading Globalized Software Effort: An Expert Discussion
Leading Globalized Software Effort: An Expert DiscussionLeading Globalized Software Effort: An Expert Discussion
Leading Globalized Software Effort: An Expert DiscussionLingoport (www.lingoport.com)
 
Wordware 2011: Lingoport i18n Planning & Static Analysis
Wordware 2011: Lingoport i18n Planning & Static AnalysisWordware 2011: Lingoport i18n Planning & Static Analysis
Wordware 2011: Lingoport i18n Planning & Static AnalysisLingoport (www.lingoport.com)
 
Lingoport internationalization-i18n-and-localization-l10n-e newsletter-septem...
Lingoport internationalization-i18n-and-localization-l10n-e newsletter-septem...Lingoport internationalization-i18n-and-localization-l10n-e newsletter-septem...
Lingoport internationalization-i18n-and-localization-l10n-e newsletter-septem...Lingoport (www.lingoport.com)
 
JavaScript Internationalization I18n for Efficient Software Localization
JavaScript Internationalization I18n for Efficient Software LocalizationJavaScript Internationalization I18n for Efficient Software Localization
JavaScript Internationalization I18n for Efficient Software LocalizationLingoport (www.lingoport.com)
 
Internationalization (i18n) Primer: Solving Coding Issues Equals Competitive ...
Internationalization (i18n) Primer: Solving Coding Issues Equals Competitive ...Internationalization (i18n) Primer: Solving Coding Issues Equals Competitive ...
Internationalization (i18n) Primer: Solving Coding Issues Equals Competitive ...Lingoport (www.lingoport.com)
 
Worldware: Software internationalization and globalization conference summary...
Worldware: Software internationalization and globalization conference summary...Worldware: Software internationalization and globalization conference summary...
Worldware: Software internationalization and globalization conference summary...Lingoport (www.lingoport.com)
 
Enhancing Internationalization Productivity: I18n Tools Support Software Loca...
Enhancing Internationalization Productivity: I18n Tools Support Software Loca...Enhancing Internationalization Productivity: I18n Tools Support Software Loca...
Enhancing Internationalization Productivity: I18n Tools Support Software Loca...Lingoport (www.lingoport.com)
 
Internationalization (I18n) and Localization (L10n): A Study
Internationalization (I18n) and Localization (L10n): A StudyInternationalization (I18n) and Localization (L10n): A Study
Internationalization (I18n) and Localization (L10n): A StudyLingoport (www.lingoport.com)
 
Business Perspectives on Internationalization (i18n)
Business Perspectives on Internationalization (i18n)Business Perspectives on Internationalization (i18n)
Business Perspectives on Internationalization (i18n)Lingoport (www.lingoport.com)
 

Mehr von Lingoport (www.lingoport.com) (20)

Internationalizing a Multi-Layered Application
Internationalizing a Multi-Layered ApplicationInternationalizing a Multi-Layered Application
Internationalizing a Multi-Layered Application
 
Shifting Left Webinar Slideshow
Shifting Left Webinar SlideshowShifting Left Webinar Slideshow
Shifting Left Webinar Slideshow
 
Shifting Left Webinar Slides
Shifting Left Webinar SlidesShifting Left Webinar Slides
Shifting Left Webinar Slides
 
Internationalization Conference, Webinars, Events, Book Discount and More!
Internationalization Conference, Webinars, Events, Book Discount and More!Internationalization Conference, Webinars, Events, Book Discount and More!
Internationalization Conference, Webinars, Events, Book Discount and More!
 
Keyboards and Internationalization
Keyboards and InternationalizationKeyboards and Internationalization
Keyboards and Internationalization
 
LocWorld: Building an Internationalization Plan; October 2011
LocWorld: Building an Internationalization Plan; October 2011LocWorld: Building an Internationalization Plan; October 2011
LocWorld: Building an Internationalization Plan; October 2011
 
Internationalization & Localization Process
Internationalization & Localization ProcessInternationalization & Localization Process
Internationalization & Localization Process
 
Leading Globalized Software Effort: An Expert Discussion
Leading Globalized Software Effort: An Expert DiscussionLeading Globalized Software Effort: An Expert Discussion
Leading Globalized Software Effort: An Expert Discussion
 
Unicode Primer for the Uninitiated
Unicode Primer for the UninitiatedUnicode Primer for the Uninitiated
Unicode Primer for the Uninitiated
 
Static analysis for multiple programming languages
Static analysis for multiple programming languagesStatic analysis for multiple programming languages
Static analysis for multiple programming languages
 
Wordware 2011: Lingoport i18n Planning & Static Analysis
Wordware 2011: Lingoport i18n Planning & Static AnalysisWordware 2011: Lingoport i18n Planning & Static Analysis
Wordware 2011: Lingoport i18n Planning & Static Analysis
 
Lingoport internationalization-i18n-and-localization-l10n-e newsletter-septem...
Lingoport internationalization-i18n-and-localization-l10n-e newsletter-septem...Lingoport internationalization-i18n-and-localization-l10n-e newsletter-septem...
Lingoport internationalization-i18n-and-localization-l10n-e newsletter-septem...
 
JavaScript Internationalization I18n for Efficient Software Localization
JavaScript Internationalization I18n for Efficient Software LocalizationJavaScript Internationalization I18n for Efficient Software Localization
JavaScript Internationalization I18n for Efficient Software Localization
 
Internationalization (i18n) Primer: Solving Coding Issues Equals Competitive ...
Internationalization (i18n) Primer: Solving Coding Issues Equals Competitive ...Internationalization (i18n) Primer: Solving Coding Issues Equals Competitive ...
Internationalization (i18n) Primer: Solving Coding Issues Equals Competitive ...
 
Introduction to Internationalization (I18n)
Introduction to Internationalization (I18n)Introduction to Internationalization (I18n)
Introduction to Internationalization (I18n)
 
Worldware: Software internationalization and globalization conference summary...
Worldware: Software internationalization and globalization conference summary...Worldware: Software internationalization and globalization conference summary...
Worldware: Software internationalization and globalization conference summary...
 
Enhancing Internationalization Productivity: I18n Tools Support Software Loca...
Enhancing Internationalization Productivity: I18n Tools Support Software Loca...Enhancing Internationalization Productivity: I18n Tools Support Software Loca...
Enhancing Internationalization Productivity: I18n Tools Support Software Loca...
 
Outsourcing Internationalization (i18n) Services
Outsourcing Internationalization (i18n) ServicesOutsourcing Internationalization (i18n) Services
Outsourcing Internationalization (i18n) Services
 
Internationalization (I18n) and Localization (L10n): A Study
Internationalization (I18n) and Localization (L10n): A StudyInternationalization (I18n) and Localization (L10n): A Study
Internationalization (I18n) and Localization (L10n): A Study
 
Business Perspectives on Internationalization (i18n)
Business Perspectives on Internationalization (i18n)Business Perspectives on Internationalization (i18n)
Business Perspectives on Internationalization (i18n)
 

Kürzlich hochgeladen

Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfhans926745
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 

Kürzlich hochgeladen (20)

Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 

Unicode Primer for the Uninitiated: A Guide to Unicode and Internationalization (i18n) ~ Character Encoding Software Internationalization

  • 1. 4/30/2010 A guide to Unicode and Internationaliz… Unicode Primer for the Uninitiated Internationalization Articles May 8th, 2008 Among our friends and clients at Lingoport, we regularly see ranges of confusion, to complete lack of awareness of what Unicode is. So for the less- or under-informed, perhaps this article will help. The advent of Unicode is a key underpinning for global software applications and websites so that they can support worldwide language scripts. So it’s a very important standard to be aware of, whether you’re in localization, an engineer or a business manager. Firstly, Unicode is a character set standard used for displaying and processing language data in computer applications. The Unicode character set is the entire world’s set of characters, including letters, numbers, currencies, symbols and the like, supporting a number of character encodings to make that all happen. Before your eyes glaze over, let me explain what character encoding means. You have to remember that for a computer, all information is represented in zeros and ones (i.e. binary values). So if you think of the letter A in the ASCII standard of zeros and ones it would look like this: 1000001. That is, a 1 then five zeros and a 1 to make a total of 7 bits. This binary representation for A is called A’s code point, and this mapping of zeros and ones to characters is called the character encoding. In the early days of computing, unless you did something very special, ASCII (7 bits per character) was how your data got managed. The problem is that ASCII doesn’t leave you enough zeros and ones to represent extended characters, like accents and characters specific to non-English alphabets, such as you find in European languages. You certainly can’t support the complex characters that make up Chinese, Korean and Japanese languages. These languages require 8-bit (single-byte) or 16-bit (double- byte) character encodings. One important note on all of these single- and double-byte encodings is that they are a superset of 7-bit ASCII encoding, which means that English code points will always be the same regardless the encoding. The Bad Old Days In the early computing days, specific character single- and double-byte encodings were developed to support various languages. That was very bad, as it meant that software developers needed to build a version of their application for every language they wanted to support that used a different encoding. You’d have the Japanese version, the Western European language version, the English-only version and so on. You’d end up with a hoard of individual software code bases, each needing their own testing, updating and ongoing maintenance and support, which is very expensive, and pretty near impossible for businesses to realistically support without serious digressions among the various language versions over time. You don’t see this problem very often for newly developed applications, but there are plenty of holdovers. We see it typically when a new client has turned over their source code to a particular country partner or marketing agent which was responsible for adapting the code to multiple languages. The worst case I saw was in 2004 when a particular client, who I will leave unmentioned, had a legacy product with 18 separate language versions and had no real idea any longer the level of functionality that varied lingoport.com/unicode-primer-for-the-… 1/4
  • 2. 4/30/2010 A guide to Unicode and Internationaliz… from language to language. That’s no way to grow a corporate empire! ISO Latin A single-byte character set that we often see in applications is ISO Latin 1, which is represented in various encoding standards such as ISO-8859-1 for UNIX, Windows-1252 for Windows and M acRoman on guess what platform. This character set supports characters used in Western European languages such as French, Spanish, German, and U.K. English. Since each character requires only a single byte, this character set provides support for multiple languages, while avoiding the work required to support either Unicode or a double-byte encoding. Trouble is that still leaves out much of the world. For example, to support Eastern European languages you need to use a different character set, often referred to as Latin 2, which provides the characters that are uniquely needed for these languages. There are also separate character sets for Baltic languages, Turkish, Arabic, Hebrew, and on and on. When having to internationalize software for the first time, sometimes companies will start with just supporting ISO Latin 1 if it meets their immediate marketing requirements and deal with the more extensive work of supporting other languages later. The reason is that it’s likely these software applications will need major reworking of the encoding support in their database and functions, methods and classes within their source code to go beyond ISO Latin support, which means more time and more money – often cascading into later releases and foregone revenues. However, if the software company has truly global ambitions, they will need to take that plunge and provide Unicode support. I’ll argue that if companies are supporting global customers, and even not doing a bit of translation/localization for the interface, they still need to support Unicode so they can provide processing of their customer’s global data. Unicode We come back to Unicode, which as we mentioned above, is a character set created to enable support of any written language worldwide. Now you might find a language or two lacking Unicode support for its script but that is becoming extremely isolated. For instance, currently Javanese, Loma, and Tai Viet are among scripts not yet supported. Arcane until you need them I suppose. I remember a few years ago when we were developing a multi- lingual site which needed support for Khmer and Armenian, and we were thankful that Unicode had just added their support a few months prior. If you have a marketing requirement for your software to support Japanese or Chinese, think Unicode. That’s because you will need to move to a double-byte encoding at the very least, and as soon as you go through the trouble to do that, you might as well support Unicode and get the added benefit of support for all languages. UTF-8 Once you’ve chosen to support Unicode, you must decide on the specific character encoding you want to use, which will be dependent on the application requirements and technologies. UTF-8 is one of the commonly used character encodings defined within the Unicode Standard, which uses a single byte for each character unless it needs more, in which case it can expand up to 4 bytes. People sometimes refer to this as a variable-width encoding since the width of the character in bytes varies depending upon the character. The advantage of this character encoding is that all English (ASCII) characters will remain as single-bytes, saving data space. This is especially desirable for web content, since the underlying HTM L markup will remain in single-byte ASCII. In general, UNIX platforms are optimized for UTF-8 character encoding. Concerning databases, where large amounts of application data are integral to the application, a developer may choose a UTF-8 encoding to save space if most of the data in the database does not need translation and so can remain in English (which requires only a single byte in UTF-8 encoding). Note that some databases will not support UTF-8, specifically M icrosoft’s SQL Server. UTF-16 UTF-16 is another widely adopted encoding within the Unicode standard. It assigns two bytes for each character whether you need it or not. So the letter A is 00000000 01000001 or 9 zeros, a one, followed by 5 zeros and a one. If more than 2 bytes are needed for a character, four bytes can be combined, however you must adapt your software lingoport.com/unicode-primer-for-the-… 2/4
  • 3. 4/30/2010 A guide to Unicode and Internationaliz… to be capable of handling this four-byte combination. Java and .Net internally process strings (text and messages) as UTF-16. For many applications, you can actually support multiple Unicode encodings so that for example your data is stored in your database as UTF-8 but is handled within your code as UTF-16, or vice versa. There are various reasons to do this, such as software limitations (different software components supporting different Unicode encodings), storage or performance advantages, etc.. But whether that’s a good idea is one of those “it depends” kinds of questions. Implementing can be tricky and clients pay us good money to solve this. M icrosoft’s SQL Server is a bit of a special case, in that it supports UCS-2, which is like UTF-16 but without the 4- byte characters (only the 16-bit characters are supported). GB 18030 There’s also a special-case character set when it comes to engineering for software intended for sale in China (PRC), which is required by the Chinese Government. This character set is GB 18030GB 18030, and it is actually a superset of Unicode, supporting both simplified and traditional Chinese. Similarly to UTF-16, GB 18030 character encoding allows 4 bytes per character to support characters beyond Unicode’s “basic” (16-bit) range, and in practice supporting UTF-16 (or UTF-8) is considered an acceptable approach to supporting GB 18030 (the UCS-2 encoding just mentioned is not, however). Now all of this considered, a converse question might be, what happens when you try to make your application support complex scripts that need Unicode, and the support isn’t there? Depending upon your system, you get anything from garbled and meaningless gibberish where data or messages become corrupted characters or weird square boxes, or the application crashes forcing a restart. Not good. If your application supports Unicode, you are ready to take on the world. Resources Internationalization Articles Internationalization Newsletter Internationalization Whitepapers Videos Webinars Subscribe Subscribe to our newsletter and white papers for free internationalization news, articles, and Webinar announcements sent via email. Click Here to Subscribe Contact Us Phone: +1.303.444.8020 Email: info@lingoport.com lingoport.com/unicode-primer-for-the-… 3/4
  • 4. 4/30/2010 A guide to Unicode and Internationaliz… lingoport.com/unicode-primer-for-the-… 4/4