SlideShare ist ein Scribd-Unternehmen logo
1 von 8
Binary-coded decimal
In computing and electronic systems, binary-coded decimal (BCD) is an encoding for decimal numbers in
which each digit is represented by its own binary sequence. Its main virtue is that it allows easy conversion to
decimal digits for printing or display and faster decimal calculations. Its drawbacks are the increased complexity
of circuits needed to implement mathematical operations and a relatively inefficient encoding—it occupies more
space than a pure binary representation

To BCD-encode a decimal number using the common encoding, each decimal digit is stored in a four-bit nibble.

Decimal:    0         1       2      3       4       5       6      7       8       9
BCD:     0000      0001    0010   0011    0100    0101    0110   0111    1000    1001




Extended Binary Coded Decimal Interchange Code
Extended Binary Coded Decimal Interchange Code (EBCDIC) is an 8-bit character encoding (code page)
used on IBM mainframe operating systems such as z/OS, OS/390, VM and VSE, as well as IBM midrange
computer operating systems such as OS/400 and i5/OS (see also Binary Coded Decimal). It is also employed on
various non-IBM platforms such as Fujitsu-Siemens' BS2000/OSD, HP MPE/iX, and Unisys MCP. It descended
from punched cards and the corresponding six bit binary-coded decimal code that most of IBM's computer
peripherals of the late 1950s and early 1960s used.


ASCII
American Standard Code for Information Interchange (ASCII), pronounced /ˈæski/[1] is a character encoding
based on the English alphabet. ASCII codes represent text in computers, communications equipment, and other
devices that work with text. Most modern character encodings—which support many more characters than did the
original—have a historical basis in ASCII.

Historically, ASCII developed from telegraphic codes and its first commercial use was as a seven-bit teleprinter
code promoted by Bell data services. Work on ASCII formally began October 6, 1960 with the first meeting of
the ASA X3.2 subcommittee. The first edition of the standard was published in 1963,[2][3] a major revision in
1967,[4] and the most recent update in 1986.[5] Compared to earlier telegraph codes, the proposed Bell code and
ASCII were both ordered for more convenient sorting (i.e., alphabetization) of lists, and added features for
devices other than teleprinters. Some ASCII features, including the quot;ESCape sequencequot;,[6] were due to Robert
Bemer.

ASCII includes definitions for 128 characters: 33 are non-printing, mostly obsolete control characters that affect
how text is processed; 94 are printable characters, and the space is considered an invisible graphic.[7] The ASCII
character encoding[8]—or a compatible extension—is used on nearly all common computers, especially personal
computers and workstations



The Operation of Combinational Logic Systems
We have looked extensively at the combinations of logic gates, and how we can make circuits with a
single gate as a unit. What use is this, other than an academic exercise? Logic gates are used
extensively in calculators and computers. Logic gates can be used to add binary numbers. Computers
are adding machines; they do subtraction by a process of complimentary addition, while they multiply
by serial addition.

 The circuits they use are based on the half-adder. This copes with the rules for binary addition
which are:
0+0=0

                   0+1=1

                   1+0=1

                   1 + 1 = 0 carry 1

                 (1 + 1 + 1 = 1 carry 1)

 The circuit has two outputs, a sum and a carry. The sum is the output of an exclusive OR gate (we
can’t have 1 + 1 = 1), while the carry output is that of an AND gate. The Boolean algebra is:

                         sum = A + B
                        carry = A.B

 This gives an arrangement shown below:




The circuit is shown below:
Duality Principal
     • Duality principal – each Boolean expression will
       be certified if identity of operators and elements
       are interchangeable
            + .
            10
     • Example: Given expression
            a+(b.c)=(a+b).(b+c)
       therefore duality expression is
            a.(b+c)=(a.b)+(b.c)


                              MOHD. YAMANI IDRIS/                    16
                           NOORZAILY MOHAMED NOOR




                         Duality Principal
     • Duality principal give free theorem “buy one, free
       one”. You only need to prove one theorem and get
       another one free.
     • If (x+y+z)’=x’.y’.z’ is certified, therefore the
       duality is also certified (x.y.z)’=x’+y’+z’
     • If x+1=1 is certified, therefore the duality is also
       certified x.0=0




                              MOHD. YAMANI IDRIS/                    17
                           NOORZAILY MOHAMED NOOR




Unicode
In computing, Unicode is an industry standard allowing computers to consistently represent and manipulate
                                                                                                                 
text expressed in most of the world's writing systems. Developed in tandem with the Universal Character Set
standard and published in book form as The Unicode Standard, Unicode consists of a repertoire of more
than 100,000 characters, a set of code charts for visual reference, an encoding methodology and set of
standard character encodings, an enumeration of character properties such as upper and lower case, a set of
reference data computer files, and a number of related items, such as character properties, rules for
normalization, decomposition, collation, rendering and bidirectional display order (for the correct display of
text containing both right-to-left scripts, such as Arabic or Hebrew, and left-to-right scripts).[1]

The Unicode Consortium, the non-profit organization that coordinates Unicode's development, has the
ambitious goal of eventually replacing existing character encoding schemes with Unicode and its standard
Unicode Transformation Format(UTF) schemes, as many of the existing schemes are limited in size and scope
and are incompatible with multilingual environments.
Unicode's success at unify character sets has led to its widespread and predominant use in the
                           ing
internationalization and localization of computer software. The standard has been implemented in many recent
technologies, including XML, the Java programming language, the Microsoft .NET Framework and modern
operating systems.

Unicode can be implemented by different character encodings. The most commonly used encodings are UTF-8
(which uses 1 byte for all ASCII characters, which have the same code values as in the standard ASCIIencoding,
and up to 4 bytes for other characters), the now-obsolete UCS-2 (which uses 2 bytes for all characters, but does
not include every character in the Unicode standard), and UTF-16 (which extends UCS-2, using 4 bytes to encode
characters missing from UCS-2).



Unicode Transformation Format and Universal Character Set

Unicode defines two mapping methods the Unicode Transformation Format (UTF) encodings, and the Universal
                                       :
Character Set (UCS) encodings. An encoding maps (possibly a subset of) the range of Unicode code points to
sequences of values in some fixed-size range, termed code values. The numbers in the names of the encodings
indicate the number of bits in one code value (for UTF encodings) or the number of bytes per code value (for
UCS) encodings. UTF-8 and UTF-16 are probably the most commonly used encodings. UCS-2 is an obsolete
subset of UTF-16; UCS-4 and UTF-32 are functionally equivalent.

UTF encodings include:

      UTF-1 — a retired predecessor of UTF-8, maximizes compatibility with ISO 2022, no longer part of The
       Unicode Standard
      UTF-7 — a relatively unpopular 7-bit encoding, often considered obsolete (not part of The Unicode
       Standard but rather an RFC)
      UTF-8 — an 8-bit, variable-width encoding, which maximizes compatibility with ASCII.
      UTF-EBCDIC — an 8-bit variable-width encoding, which maximizes compatibility with EBCDIC. (not
       part of The Unicode Standard)
      UTF-16 — a 16-bit, variable-width encoding
      UTF-32 — a 32-bit, fixed-width encoding

UTF-8 uses one to four bytes per code point and, being compact for Latin scripts and ASCII-compatible, provides
the de facto standard encoding for interchange of Unicode text. It is also used by most recent Linux distributions
as a direct replacement for legacy encodings in general text handling.

The UCS-2 and UTF-16 encodings specify the Unicode Byte Order Mark (BOM) for use at the beginnings of text
files, which may be used for byte ordering detection (or byte endianness detection). Some software developers
have adopted it for other encodings, including UTF-8, which does not need an indication of byte order. In this
case it attempts to mark the file as containing Unicode text. The BOM, code point U+FEFF has the important
property of unambiguity on byte reorder, regardless of the Unicode encoding used; U+FFFE (the result of byte-
swapping U+FEFF) does not equate to a legal character, and U+FEFF in other places, other than the beginning of
text, conveys the zero-width no-break space (a character with no appearance and no effect other than preventing
the formation of ligatures). Also, the units FE and FF never appear in UTF-8. The same character converted to
UTF-8 becomes the byte sequence EF BB BF.

In UTF-32 and UCS-4, one 32-bit code value serves as a fairly direct representation of anycharacter's code point
(although the endianness, which varies acrossdifferent platforms, affects how the code value actually manifests
as an octet sequence). In the other cases, each code point may be represented by a variable number of code
values. UTF-32 is widely used as internal representation of text in programs (as opposed to stored or transmitted
text), since every Unix operating system which uses the gcc compilers to generate software uses it as the standard
quot;wide characterquot; encoding. Recent versions of the Python programming language (beginning with 2.2) may also
be configured to use UTF-32 as the representation for unicode strings, effectively disseminating such encoding in
high-level coded software.

Punycode, another encoding form, enables the encoding of Unicode strngs into the limited character set
                                                                   i
supported by the ASCII-based Domain Name System. The encoding is used as part of IDNA, which is a system
enabling the use of Internationalized Domain Names in all scripts that are supported by Unicode. Earlier and now
historical proposals include UTF-5 and UTF-6.

GB18030 is another encoding form for Unicode, from the Standardization Administration of China. It is the
official character set of the People's Republic of China (PRC). BOCU-1 and SCSU are Unicode compression
schemes. The April Fools' Day RFC of 2005 specified two parody UTF encodings, UTF-9 and UTF-18.




Mapping codepoints to Unicode encoding forms
Peter Constable, 2001-06-13; 10950 reads


Note:


This is an Appendix to “Understanding Unicode™”.

See also A review of characters with compatibility decompositions.

In Section 4 of “Understanding Unicode™”, we examined each of the three character encoding forms defined
within Unicode. This appendix describes in detail the mappings from Unicode codepoints to the code unit
sequences used in each encoding form.

In this description, the mapping will be expressed in alternate forms,one of which is a mapping of bits between
the binary representation of a Unicode scalar value and the binary representation of a code unit. Even though a
coded character set encodes characters in terms of numerical values that have no specific computer representation
or data type associated with them, for purposes of describing this mapping, we are considering codepoints in the
Unicode codespace to have a width of 21 bits. This is the number of bits required for binary representation of the
entire numerical range of Unicode scalar values, 0x0 to 0x10FFFF.

1 UTF-32
The UTF-32 encoding form was formally incorporated into Unicode as part of TUS 3.1. The definitions for UTF-
32 are specified in TUS 3.1 and in UAX#19 (Davi 2001). The mapping for UTF-32 is, essentially, the identity
                                                  s
mapping: the 32-bit code unit used to encode a codepoint has the same integer value as the codepoint itself. Thus
if U represents the Unicode scalarvalue for a character and C represents the value of the 32-bit code unit then:

U=C

The mapping can also be expressed in terms of the relationships between bits in the binary representations of the
Unicode scalar values and the 32-bit code units, as shown in Table 1.

    Codepoint range                         Unicode scalar value (binary) Code units (binary)

     U+0000..U+D7FF, U+E000..U+10FFFF       xxxxxxxxxxxxxxxxxxxxx         00000000000xxxxxxxxxxxxxxxxxxxxx
Table 1 UTF-32 USV to code unit mapping


2 UTF-16
The UTF-16 encoding form was formally incorporated into Unicode as part of TUS 2.0. The current definitions for UTF-16 are specified
in TUS 3.0.1


U = (CH – D80016) * 40016 + (CL – DC0016) + 1000016

Likewise, determining the high and low surrogate values for a given Unicode scalar value is fairly
straightforward. Assuming the variables CH, CL and U as above, and that U is in the range U+10000..U+10FFFF,

CH = (U – 1000016)  40016 + D80016

CL = (U – 1000016) mod 40016 + DC0016
where “” represents integer division (returns only integer portion, rounded down), and “mod” represents the
modulo operator.

Expressing the mapping in terms of a mapping of bits between the binary representati ns of scalar values and
                                                                                   o
code units, the UTF-16 mapping is as shown in Table 2:

   Codepoint range                        Unicode scalar value (binary) Code units (binary)

    U+0000..U+D7FF,                       00000xxxxxxxxxxxxxxxx        xxxxxxxxxxxxxxxx
    U+E000..U+EFFF
    U+10000..U+10FFFF                     Uuuuuxxxxxxyyyyyyyyyy        110110wwwwxxxxxx 110111yyyyyyyyyy (where uuuuu = wwww
                                                                       + 1)
Table 2 UTF-16 USV to code unit mapping


3 UTF-8
The UTF-8 encoding form was formally incorporated in Unicode as part of TUS 2.0. The current definitions for
                                                         to
UTF-8 are specified in TUS 3.1.2 As with the other encoding forms, calculating a Unicode scalar value from the
8-bit code units in a UTF-8 sequence is a matter of simple arithmetic. In this case, however, the calculation
depends upon the number of bytes in the sequence. Similarly, the calculation of code units from a scalar value
must be expressed differently for different ranges of scalar values.

Let us consider first the relationship between bits in the binary representati n of codepoints and code units. This
                                                                             o
is shown for UTF-8 in Table 3:

   Codepoint range                          Scalar value (binary)        Byte 1       Byte 2     Byte 3     Byte 4

    U+0000..U+007F                          00000000000000xxxxxxx        0xxxxxxx
    U+0080..U+07FF                          0000000000yyyyyxxxxxx        110yyyyy     10xxxxxx
    U+0800..U+D7FF, U+E000..U+FFFF          00000zzzzyyyyyyxxxxxx        1110zzzz     10yyyyyy   10xxxxxx
    U+10000..U+10FFFF                       uuuzzzzzzyyyyyyxxxxxx        11110uuu     10zzzzzz   10yyyyyy   10xxxxxx
Table 3 UTF-8 USV to code unit mapping

Note


There is a slight difference between Unicode and ISO/IEC 10646 in how they define UTF-8 since Unicode limits
it to the roughly one million characters possible in Unicode’s codespace, while for the ISO/IEC standard, it can
access the entire 31-bit codespace. For all practical purposes, this difference is irrelevant since the ISO/IEC
codespace is effectively limited to match that of Unicode, but you may encounter differing descriptions on
occasion.

As mentioned in Section 4.2 of “Understanding Unicode™”, UTF-8 byte sequences have certain interesting
properties. These can be seen from the table above. Firstly, note the high-order bits in non-initial bytes as opposed
to sequence-initial bytes. By looking at the first two bits, you can immediately determine whether a code unit is
an initial byte in a sequence or is a following byte. Secondly, by looking at the number of non-zero high-order
bits of the first byte in the sequence, you can immediately tell how l ng the sequence is: if no high-order bits are
                                                                       o
set to one, then the sequence contains exactly one byte. Otherwise, the number of non-zero high-order bits is
equal to the total number of bytes in the sequence.

Table 3 also reveals the other interesting characteristic of UTF-8 that was described in Section 4.2 of
“Understanding Unicode™”. Note that characters in the range U+0000..U+007F are represented using a single
byte. The characters in this range match ASCII codepoint for codepoint. Thus, any data encoded in ASCII is
automatically also encoded in UTF-8.

Having seen how the bits compare, let us consider how code units can be calculated from scalar values, and vice
versa. If U represents the value of a Unicode scalar value and C1, C2, C3 and C4 represent bytes in a UTF-8 byte
sequence (in order), then the value of a Unicode scalar value U can be calculated as follows:

If a sequence has one byte, then

U = C1
Else if a sequence has two bytes, then

U = (C1 – 192) * 64 + C2 – 128

Else if a sequence has three bytes, then

U = (C1 – 224) * 4,096 + (C2 – 128) * 64 + C3 – 128

Else

U = (C1 – 240) * 262,144 + (C2 – 128) * 4,096 + (C3 – 128) * 64 + C4 – 128

End if

Going the other way, given a Unicode scalar value U, then the UTF-8 byte sequence can be calculated as follows:

If U <= U+007F, then

C1 = U

Else if U+0080 <= U <= U+07FF, then

C1 = U  64 + 192

C2 = U mod 64 + 128

Else if U+0800 <= U <= U+D7FF, or if U+E000 <= U <= U+FFFF, then

C1 = U  4,096 + 224

C2 = (U mod 4,096)  64 + 128

C3 = U mod 64 + 128

Else

C1 = U  262,144 + 240

C2 = (U mod 262,144)  4,096 + 128

C3 = (U mod 4,096)  64 + 128

C4 = U mod 64 + 128

End if

where “” represents integer division (returns only integer portion, rounded down), and “mod” represents the
modulo operator.
If you examine the mapping in Table 3 carefully, you may notice that by ignoring the range constraints in the left-
hand column, certain codepoints can potentially be represented in more than one way. For example, substituting
U+0041 LATIN CAPITAL LETTER A into the table gives the following possibilities:


   Codepoint                      Pattern                 Byte 1     Byte 2     Byte 3     Byte 4

    000000000000001000001         00000000000000xxxxxxx   01000001
    000000000000001000001         0000000000yyyyyxxxxxx   11000001   10000001
    000000000000001000001         00000zzzzyyyyyyxxxxxx   1110zzzz   10000001   10000001
    000000000000001000001         uuuzzzzzzyyyyyyxxxxxx   11110000   10000000   10000001   10000001
Table 4 “UTF-8” non-shortest sequences for U+0041
Obviously, having these alternate encoded representations for the same character is not desirable. Accordingly,
the UTF-8 specification stipulates that the shortest possible representation must be used. In TUS 3.1, this was
made more explicitly clear by specifying exactly what UTF-8byte sequences are or are not legal. Thus, in the
example above, each of the sequences other than the first is an illegal code unit sequence.

Similarly, a supplementary-plane character can be encoded directly into a four-byte UTF-8 sequence, but
someone might (possibly from misunderstanding) choose to map the codepoint into a UTF-16 surrogate pair, and
then apply the UTF-8 mapping to each of the surrogate code units to get a pair of three-byte sequences. To
illustrate, consider the following:


    Supplementary-plane codepoint        U+10011
    Normal UTF-8 byte sequence           0xF0 0x90 0x80 0x91
    UTF-16 surrogate pair                0xD800 0xDC11
    “UTF-8” mapping of surrogates        0xED 0xA0 0x80 0xED 0xB0 0x91
Table 5 UTF-8-via-surrogates representation of supplementary-plane character


Again, the Unicode Standard expects the shortest representation to be used for UTF-8. For certain reasons, non-
shortest representations of supplementary-plane characters are referred to as irregular code unit sequences
rather than illegal code unit sequences. The distinction here is subtle: software that conforms to the Unicode
Standard is allowed to interpret these irregular sequences as the corresponding supplementary-plane characters,
but is not allowed to generate these irregular sequences. In certain situations, though, software will want to reject
such irregular UTF-8 sequences (for instance, where these might otherwise be used to avoid security systems),
and in these cases the Standard allows conformant software to ignore or reject these sequences, or remove them
from a data stream.

The main motivation for making the distinction and for considering these 6-byte sequences to be irregular rather
than illegal is this: suppose a process is re-encoding a data stream from UTF-16 to UTF-8, and suppose that the
source data stream had been interrupted so that it ended with the beginning of a surrogate pair. It may be that this
segment of the data will later be re-united with the remainder of the data, it also having been re-encoded in UTF-
8. So, we are assuming that there are two segments of data out there: one ending with an unpaired high surroga   te,
and one beginning with an unpaired low surrogate.

Now, as each segment of the data is being trans-coded from UTF-16 to UTF-8, the question arises as to what
should be done with the unpaired surrogate code units. If they are ignored, then the result after the data is
reassembled will be that a character has been lost. A more graceful way to deal with the data would be for the
trans-coding process to translate the unpaired surrogate in a corresponding 3-byte UTF-8 sequence, and then
                                                            to
leave it up to a later receiving process to decide what to do with t. Then, if the receiving process gets the data
                                                                    i
segments assembled again, that character will still be part of the information content of the data. The only
problem is that now it is in a 6-byte pseudo-UTF-8 sequence. Defining these as irregular rather than illegal is
intended to allow that character to be retained over the course of this overall process in a form that conformant
software is allowed to interpret, even if it would not be allowed to generate it that way.

Weitere ähnliche Inhalte

Was ist angesagt? (20)

Gr2512211225
Gr2512211225Gr2512211225
Gr2512211225
 
Unicode
UnicodeUnicode
Unicode
 
443 449
443 449443 449
443 449
 
Lecture 2
Lecture 2Lecture 2
Lecture 2
 
Unicode
UnicodeUnicode
Unicode
 
Arduino day 2019
Arduino day 2019Arduino day 2019
Arduino day 2019
 
Character Sets
Character SetsCharacter Sets
Character Sets
 
Binary Codes
Binary CodesBinary Codes
Binary Codes
 
Multiplux
MultipluxMultiplux
Multiplux
 
Slide02 digital logic operations and functions
Slide02 digital logic operations and functionsSlide02 digital logic operations and functions
Slide02 digital logic operations and functions
 
Combinational Logic
Combinational Logic Combinational Logic
Combinational Logic
 
Coa presentation1
Coa presentation1Coa presentation1
Coa presentation1
 
Ascii codes
Ascii codesAscii codes
Ascii codes
 
Unicode
UnicodeUnicode
Unicode
 
Applied physics iii lecture3 digital_codes
Applied physics iii lecture3 digital_codesApplied physics iii lecture3 digital_codes
Applied physics iii lecture3 digital_codes
 
Mca1010 fundamentals of computer and it
Mca1010  fundamentals of computer and itMca1010  fundamentals of computer and it
Mca1010 fundamentals of computer and it
 
Ascii and Unicode (Character Codes)
Ascii and Unicode (Character Codes)Ascii and Unicode (Character Codes)
Ascii and Unicode (Character Codes)
 
HTTP 완벽가이드 16장
HTTP 완벽가이드 16장HTTP 완벽가이드 16장
HTTP 완벽가이드 16장
 
decoder and encoder
 decoder and encoder decoder and encoder
decoder and encoder
 
Programming fundamentals 3
Programming fundamentals 3Programming fundamentals 3
Programming fundamentals 3
 

Ähnlich wie Comprehasive Exam - IT

Data encryption and tokenization for international unicode
Data encryption and tokenization for international unicodeData encryption and tokenization for international unicode
Data encryption and tokenization for international unicodeUlf Mattsson
 
Abap slide class4 unicode-plusfiles
Abap slide class4 unicode-plusfilesAbap slide class4 unicode-plusfiles
Abap slide class4 unicode-plusfilesMilind Patil
 
Chapter 01 Java Programming Basic Java IDE JAVA INTELLIEJ
Chapter 01 Java Programming Basic Java IDE JAVA INTELLIEJChapter 01 Java Programming Basic Java IDE JAVA INTELLIEJ
Chapter 01 Java Programming Basic Java IDE JAVA INTELLIEJIMPERIALXGAMING
 
Introduction to programming concepts
Introduction to programming conceptsIntroduction to programming concepts
Introduction to programming conceptshermiraguilar
 
Chapter 2Hardware2.1 The System Unit2.2 Data and P
Chapter 2Hardware2.1 The System Unit2.2 Data and PChapter 2Hardware2.1 The System Unit2.2 Data and P
Chapter 2Hardware2.1 The System Unit2.2 Data and PEstelaJeffery653
 
Introduction to programming concepts
Introduction to programming conceptsIntroduction to programming concepts
Introduction to programming conceptshermiraguilar
 
Learning Area 2
Learning Area 2Learning Area 2
Learning Area 2norshipa
 
Learning Area 2
Learning Area 2Learning Area 2
Learning Area 2norshipa
 
Encodings - Ruby 1.8 and Ruby 1.9
Encodings - Ruby 1.8 and Ruby 1.9Encodings - Ruby 1.8 and Ruby 1.9
Encodings - Ruby 1.8 and Ruby 1.9Dimelo R&D Team
 
Character encoding and unicode format
Character encoding and unicode formatCharacter encoding and unicode format
Character encoding and unicode formatAdityaSharma1452
 
jhkghj
jhkghjjhkghj
jhkghjAdmin
 
test2PPT
test2PPTtest2PPT
test2PPTAdmin
 

Ähnlich wie Comprehasive Exam - IT (20)

Data encryption and tokenization for international unicode
Data encryption and tokenization for international unicodeData encryption and tokenization for international unicode
Data encryption and tokenization for international unicode
 
Abap slide class4 unicode-plusfiles
Abap slide class4 unicode-plusfilesAbap slide class4 unicode-plusfiles
Abap slide class4 unicode-plusfiles
 
Uncdtalk
UncdtalkUncdtalk
Uncdtalk
 
Chapter 2
Chapter 2Chapter 2
Chapter 2
 
Chapter 01 Java Programming Basic Java IDE JAVA INTELLIEJ
Chapter 01 Java Programming Basic Java IDE JAVA INTELLIEJChapter 01 Java Programming Basic Java IDE JAVA INTELLIEJ
Chapter 01 Java Programming Basic Java IDE JAVA INTELLIEJ
 
Introduction to programming concepts
Introduction to programming conceptsIntroduction to programming concepts
Introduction to programming concepts
 
Chapter 2Hardware2.1 The System Unit2.2 Data and P
Chapter 2Hardware2.1 The System Unit2.2 Data and PChapter 2Hardware2.1 The System Unit2.2 Data and P
Chapter 2Hardware2.1 The System Unit2.2 Data and P
 
Introduction to programming concepts
Introduction to programming conceptsIntroduction to programming concepts
Introduction to programming concepts
 
Learning Area 2
Learning Area 2Learning Area 2
Learning Area 2
 
Learning Area 2
Learning Area 2Learning Area 2
Learning Area 2
 
Encodings - Ruby 1.8 and Ruby 1.9
Encodings - Ruby 1.8 and Ruby 1.9Encodings - Ruby 1.8 and Ruby 1.9
Encodings - Ruby 1.8 and Ruby 1.9
 
Character encoding and unicode format
Character encoding and unicode formatCharacter encoding and unicode format
Character encoding and unicode format
 
jhkghj
jhkghjjhkghj
jhkghj
 
Asp net
Asp netAsp net
Asp net
 
test2PPT
test2PPTtest2PPT
test2PPT
 
Journey of Bsdconv
Journey of BsdconvJourney of Bsdconv
Journey of Bsdconv
 
Using unicode with php
Using unicode with phpUsing unicode with php
Using unicode with php
 
Asp net
Asp netAsp net
Asp net
 
Windows mobile programming
Windows mobile programmingWindows mobile programming
Windows mobile programming
 
Asp dot net
Asp dot netAsp dot net
Asp dot net
 

Kürzlich hochgeladen

microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introductionMaksud Ahmed
 
Gardella_Mateo_IntellectualProperty.pdf.
Gardella_Mateo_IntellectualProperty.pdf.Gardella_Mateo_IntellectualProperty.pdf.
Gardella_Mateo_IntellectualProperty.pdf.MateoGardella
 
Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxVishalSingh1417
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdfQucHHunhnh
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhikauryashika82
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxAreebaZafar22
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxiammrhaywood
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfAdmir Softic
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.pptRamjanShidvankar
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphThiyagu K
 
Unit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxUnit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxVishalSingh1417
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfciinovamais
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsTechSoup
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDThiyagu K
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17Celine George
 

Kürzlich hochgeladen (20)

microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
Gardella_Mateo_IntellectualProperty.pdf.
Gardella_Mateo_IntellectualProperty.pdf.Gardella_Mateo_IntellectualProperty.pdf.
Gardella_Mateo_IntellectualProperty.pdf.
 
Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptx
 
Advance Mobile Application Development class 07
Advance Mobile Application Development class 07Advance Mobile Application Development class 07
Advance Mobile Application Development class 07
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptx
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.ppt
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot Graph
 
Unit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxUnit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptx
 
Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptxINDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SD
 
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17
 

Comprehasive Exam - IT

  • 1. Binary-coded decimal In computing and electronic systems, binary-coded decimal (BCD) is an encoding for decimal numbers in which each digit is represented by its own binary sequence. Its main virtue is that it allows easy conversion to decimal digits for printing or display and faster decimal calculations. Its drawbacks are the increased complexity of circuits needed to implement mathematical operations and a relatively inefficient encoding—it occupies more space than a pure binary representation To BCD-encode a decimal number using the common encoding, each decimal digit is stored in a four-bit nibble. Decimal: 0 1 2 3 4 5 6 7 8 9 BCD: 0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 Extended Binary Coded Decimal Interchange Code Extended Binary Coded Decimal Interchange Code (EBCDIC) is an 8-bit character encoding (code page) used on IBM mainframe operating systems such as z/OS, OS/390, VM and VSE, as well as IBM midrange computer operating systems such as OS/400 and i5/OS (see also Binary Coded Decimal). It is also employed on various non-IBM platforms such as Fujitsu-Siemens' BS2000/OSD, HP MPE/iX, and Unisys MCP. It descended from punched cards and the corresponding six bit binary-coded decimal code that most of IBM's computer peripherals of the late 1950s and early 1960s used. ASCII American Standard Code for Information Interchange (ASCII), pronounced /ˈæski/[1] is a character encoding based on the English alphabet. ASCII codes represent text in computers, communications equipment, and other devices that work with text. Most modern character encodings—which support many more characters than did the original—have a historical basis in ASCII. Historically, ASCII developed from telegraphic codes and its first commercial use was as a seven-bit teleprinter code promoted by Bell data services. Work on ASCII formally began October 6, 1960 with the first meeting of the ASA X3.2 subcommittee. The first edition of the standard was published in 1963,[2][3] a major revision in 1967,[4] and the most recent update in 1986.[5] Compared to earlier telegraph codes, the proposed Bell code and ASCII were both ordered for more convenient sorting (i.e., alphabetization) of lists, and added features for devices other than teleprinters. Some ASCII features, including the quot;ESCape sequencequot;,[6] were due to Robert Bemer. ASCII includes definitions for 128 characters: 33 are non-printing, mostly obsolete control characters that affect how text is processed; 94 are printable characters, and the space is considered an invisible graphic.[7] The ASCII character encoding[8]—or a compatible extension—is used on nearly all common computers, especially personal computers and workstations The Operation of Combinational Logic Systems We have looked extensively at the combinations of logic gates, and how we can make circuits with a single gate as a unit. What use is this, other than an academic exercise? Logic gates are used extensively in calculators and computers. Logic gates can be used to add binary numbers. Computers are adding machines; they do subtraction by a process of complimentary addition, while they multiply by serial addition. The circuits they use are based on the half-adder. This copes with the rules for binary addition which are:
  • 2. 0+0=0 0+1=1 1+0=1 1 + 1 = 0 carry 1 (1 + 1 + 1 = 1 carry 1) The circuit has two outputs, a sum and a carry. The sum is the output of an exclusive OR gate (we can’t have 1 + 1 = 1), while the carry output is that of an AND gate. The Boolean algebra is:  sum = A + B  carry = A.B This gives an arrangement shown below: The circuit is shown below:
  • 3. Duality Principal • Duality principal – each Boolean expression will be certified if identity of operators and elements are interchangeable + . 10 • Example: Given expression a+(b.c)=(a+b).(b+c) therefore duality expression is a.(b+c)=(a.b)+(b.c) MOHD. YAMANI IDRIS/ 16 NOORZAILY MOHAMED NOOR Duality Principal • Duality principal give free theorem “buy one, free one”. You only need to prove one theorem and get another one free. • If (x+y+z)’=x’.y’.z’ is certified, therefore the duality is also certified (x.y.z)’=x’+y’+z’ • If x+1=1 is certified, therefore the duality is also certified x.0=0 MOHD. YAMANI IDRIS/ 17 NOORZAILY MOHAMED NOOR Unicode In computing, Unicode is an industry standard allowing computers to consistently represent and manipulate  text expressed in most of the world's writing systems. Developed in tandem with the Universal Character Set standard and published in book form as The Unicode Standard, Unicode consists of a repertoire of more than 100,000 characters, a set of code charts for visual reference, an encoding methodology and set of standard character encodings, an enumeration of character properties such as upper and lower case, a set of reference data computer files, and a number of related items, such as character properties, rules for normalization, decomposition, collation, rendering and bidirectional display order (for the correct display of text containing both right-to-left scripts, such as Arabic or Hebrew, and left-to-right scripts).[1] The Unicode Consortium, the non-profit organization that coordinates Unicode's development, has the ambitious goal of eventually replacing existing character encoding schemes with Unicode and its standard Unicode Transformation Format(UTF) schemes, as many of the existing schemes are limited in size and scope and are incompatible with multilingual environments.
  • 4. Unicode's success at unify character sets has led to its widespread and predominant use in the ing internationalization and localization of computer software. The standard has been implemented in many recent technologies, including XML, the Java programming language, the Microsoft .NET Framework and modern operating systems. Unicode can be implemented by different character encodings. The most commonly used encodings are UTF-8 (which uses 1 byte for all ASCII characters, which have the same code values as in the standard ASCIIencoding, and up to 4 bytes for other characters), the now-obsolete UCS-2 (which uses 2 bytes for all characters, but does not include every character in the Unicode standard), and UTF-16 (which extends UCS-2, using 4 bytes to encode characters missing from UCS-2). Unicode Transformation Format and Universal Character Set Unicode defines two mapping methods the Unicode Transformation Format (UTF) encodings, and the Universal : Character Set (UCS) encodings. An encoding maps (possibly a subset of) the range of Unicode code points to sequences of values in some fixed-size range, termed code values. The numbers in the names of the encodings indicate the number of bits in one code value (for UTF encodings) or the number of bytes per code value (for UCS) encodings. UTF-8 and UTF-16 are probably the most commonly used encodings. UCS-2 is an obsolete subset of UTF-16; UCS-4 and UTF-32 are functionally equivalent. UTF encodings include:  UTF-1 — a retired predecessor of UTF-8, maximizes compatibility with ISO 2022, no longer part of The Unicode Standard  UTF-7 — a relatively unpopular 7-bit encoding, often considered obsolete (not part of The Unicode Standard but rather an RFC)  UTF-8 — an 8-bit, variable-width encoding, which maximizes compatibility with ASCII.  UTF-EBCDIC — an 8-bit variable-width encoding, which maximizes compatibility with EBCDIC. (not part of The Unicode Standard)  UTF-16 — a 16-bit, variable-width encoding  UTF-32 — a 32-bit, fixed-width encoding UTF-8 uses one to four bytes per code point and, being compact for Latin scripts and ASCII-compatible, provides the de facto standard encoding for interchange of Unicode text. It is also used by most recent Linux distributions as a direct replacement for legacy encodings in general text handling. The UCS-2 and UTF-16 encodings specify the Unicode Byte Order Mark (BOM) for use at the beginnings of text files, which may be used for byte ordering detection (or byte endianness detection). Some software developers have adopted it for other encodings, including UTF-8, which does not need an indication of byte order. In this case it attempts to mark the file as containing Unicode text. The BOM, code point U+FEFF has the important property of unambiguity on byte reorder, regardless of the Unicode encoding used; U+FFFE (the result of byte- swapping U+FEFF) does not equate to a legal character, and U+FEFF in other places, other than the beginning of text, conveys the zero-width no-break space (a character with no appearance and no effect other than preventing the formation of ligatures). Also, the units FE and FF never appear in UTF-8. The same character converted to UTF-8 becomes the byte sequence EF BB BF. In UTF-32 and UCS-4, one 32-bit code value serves as a fairly direct representation of anycharacter's code point (although the endianness, which varies acrossdifferent platforms, affects how the code value actually manifests as an octet sequence). In the other cases, each code point may be represented by a variable number of code values. UTF-32 is widely used as internal representation of text in programs (as opposed to stored or transmitted text), since every Unix operating system which uses the gcc compilers to generate software uses it as the standard quot;wide characterquot; encoding. Recent versions of the Python programming language (beginning with 2.2) may also be configured to use UTF-32 as the representation for unicode strings, effectively disseminating such encoding in high-level coded software. Punycode, another encoding form, enables the encoding of Unicode strngs into the limited character set i supported by the ASCII-based Domain Name System. The encoding is used as part of IDNA, which is a system
  • 5. enabling the use of Internationalized Domain Names in all scripts that are supported by Unicode. Earlier and now historical proposals include UTF-5 and UTF-6. GB18030 is another encoding form for Unicode, from the Standardization Administration of China. It is the official character set of the People's Republic of China (PRC). BOCU-1 and SCSU are Unicode compression schemes. The April Fools' Day RFC of 2005 specified two parody UTF encodings, UTF-9 and UTF-18. Mapping codepoints to Unicode encoding forms Peter Constable, 2001-06-13; 10950 reads Note: This is an Appendix to “Understanding Unicode™”. See also A review of characters with compatibility decompositions. In Section 4 of “Understanding Unicode™”, we examined each of the three character encoding forms defined within Unicode. This appendix describes in detail the mappings from Unicode codepoints to the code unit sequences used in each encoding form. In this description, the mapping will be expressed in alternate forms,one of which is a mapping of bits between the binary representation of a Unicode scalar value and the binary representation of a code unit. Even though a coded character set encodes characters in terms of numerical values that have no specific computer representation or data type associated with them, for purposes of describing this mapping, we are considering codepoints in the Unicode codespace to have a width of 21 bits. This is the number of bits required for binary representation of the entire numerical range of Unicode scalar values, 0x0 to 0x10FFFF. 1 UTF-32 The UTF-32 encoding form was formally incorporated into Unicode as part of TUS 3.1. The definitions for UTF- 32 are specified in TUS 3.1 and in UAX#19 (Davi 2001). The mapping for UTF-32 is, essentially, the identity s mapping: the 32-bit code unit used to encode a codepoint has the same integer value as the codepoint itself. Thus if U represents the Unicode scalarvalue for a character and C represents the value of the 32-bit code unit then: U=C The mapping can also be expressed in terms of the relationships between bits in the binary representations of the Unicode scalar values and the 32-bit code units, as shown in Table 1. Codepoint range Unicode scalar value (binary) Code units (binary) U+0000..U+D7FF, U+E000..U+10FFFF xxxxxxxxxxxxxxxxxxxxx 00000000000xxxxxxxxxxxxxxxxxxxxx Table 1 UTF-32 USV to code unit mapping 2 UTF-16 The UTF-16 encoding form was formally incorporated into Unicode as part of TUS 2.0. The current definitions for UTF-16 are specified in TUS 3.0.1 U = (CH – D80016) * 40016 + (CL – DC0016) + 1000016 Likewise, determining the high and low surrogate values for a given Unicode scalar value is fairly straightforward. Assuming the variables CH, CL and U as above, and that U is in the range U+10000..U+10FFFF, CH = (U – 1000016) 40016 + D80016 CL = (U – 1000016) mod 40016 + DC0016
  • 6. where “” represents integer division (returns only integer portion, rounded down), and “mod” represents the modulo operator. Expressing the mapping in terms of a mapping of bits between the binary representati ns of scalar values and o code units, the UTF-16 mapping is as shown in Table 2: Codepoint range Unicode scalar value (binary) Code units (binary) U+0000..U+D7FF, 00000xxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxx U+E000..U+EFFF U+10000..U+10FFFF Uuuuuxxxxxxyyyyyyyyyy 110110wwwwxxxxxx 110111yyyyyyyyyy (where uuuuu = wwww + 1) Table 2 UTF-16 USV to code unit mapping 3 UTF-8 The UTF-8 encoding form was formally incorporated in Unicode as part of TUS 2.0. The current definitions for to UTF-8 are specified in TUS 3.1.2 As with the other encoding forms, calculating a Unicode scalar value from the 8-bit code units in a UTF-8 sequence is a matter of simple arithmetic. In this case, however, the calculation depends upon the number of bytes in the sequence. Similarly, the calculation of code units from a scalar value must be expressed differently for different ranges of scalar values. Let us consider first the relationship between bits in the binary representati n of codepoints and code units. This o is shown for UTF-8 in Table 3: Codepoint range Scalar value (binary) Byte 1 Byte 2 Byte 3 Byte 4 U+0000..U+007F 00000000000000xxxxxxx 0xxxxxxx U+0080..U+07FF 0000000000yyyyyxxxxxx 110yyyyy 10xxxxxx U+0800..U+D7FF, U+E000..U+FFFF 00000zzzzyyyyyyxxxxxx 1110zzzz 10yyyyyy 10xxxxxx U+10000..U+10FFFF uuuzzzzzzyyyyyyxxxxxx 11110uuu 10zzzzzz 10yyyyyy 10xxxxxx Table 3 UTF-8 USV to code unit mapping Note There is a slight difference between Unicode and ISO/IEC 10646 in how they define UTF-8 since Unicode limits it to the roughly one million characters possible in Unicode’s codespace, while for the ISO/IEC standard, it can access the entire 31-bit codespace. For all practical purposes, this difference is irrelevant since the ISO/IEC codespace is effectively limited to match that of Unicode, but you may encounter differing descriptions on occasion. As mentioned in Section 4.2 of “Understanding Unicode™”, UTF-8 byte sequences have certain interesting properties. These can be seen from the table above. Firstly, note the high-order bits in non-initial bytes as opposed to sequence-initial bytes. By looking at the first two bits, you can immediately determine whether a code unit is an initial byte in a sequence or is a following byte. Secondly, by looking at the number of non-zero high-order bits of the first byte in the sequence, you can immediately tell how l ng the sequence is: if no high-order bits are o set to one, then the sequence contains exactly one byte. Otherwise, the number of non-zero high-order bits is equal to the total number of bytes in the sequence. Table 3 also reveals the other interesting characteristic of UTF-8 that was described in Section 4.2 of “Understanding Unicode™”. Note that characters in the range U+0000..U+007F are represented using a single byte. The characters in this range match ASCII codepoint for codepoint. Thus, any data encoded in ASCII is automatically also encoded in UTF-8. Having seen how the bits compare, let us consider how code units can be calculated from scalar values, and vice versa. If U represents the value of a Unicode scalar value and C1, C2, C3 and C4 represent bytes in a UTF-8 byte sequence (in order), then the value of a Unicode scalar value U can be calculated as follows: If a sequence has one byte, then U = C1
  • 7. Else if a sequence has two bytes, then U = (C1 – 192) * 64 + C2 – 128 Else if a sequence has three bytes, then U = (C1 – 224) * 4,096 + (C2 – 128) * 64 + C3 – 128 Else U = (C1 – 240) * 262,144 + (C2 – 128) * 4,096 + (C3 – 128) * 64 + C4 – 128 End if Going the other way, given a Unicode scalar value U, then the UTF-8 byte sequence can be calculated as follows: If U <= U+007F, then C1 = U Else if U+0080 <= U <= U+07FF, then C1 = U 64 + 192 C2 = U mod 64 + 128 Else if U+0800 <= U <= U+D7FF, or if U+E000 <= U <= U+FFFF, then C1 = U 4,096 + 224 C2 = (U mod 4,096) 64 + 128 C3 = U mod 64 + 128 Else C1 = U 262,144 + 240 C2 = (U mod 262,144) 4,096 + 128 C3 = (U mod 4,096) 64 + 128 C4 = U mod 64 + 128 End if where “” represents integer division (returns only integer portion, rounded down), and “mod” represents the modulo operator. If you examine the mapping in Table 3 carefully, you may notice that by ignoring the range constraints in the left- hand column, certain codepoints can potentially be represented in more than one way. For example, substituting U+0041 LATIN CAPITAL LETTER A into the table gives the following possibilities: Codepoint Pattern Byte 1 Byte 2 Byte 3 Byte 4 000000000000001000001 00000000000000xxxxxxx 01000001 000000000000001000001 0000000000yyyyyxxxxxx 11000001 10000001 000000000000001000001 00000zzzzyyyyyyxxxxxx 1110zzzz 10000001 10000001 000000000000001000001 uuuzzzzzzyyyyyyxxxxxx 11110000 10000000 10000001 10000001 Table 4 “UTF-8” non-shortest sequences for U+0041
  • 8. Obviously, having these alternate encoded representations for the same character is not desirable. Accordingly, the UTF-8 specification stipulates that the shortest possible representation must be used. In TUS 3.1, this was made more explicitly clear by specifying exactly what UTF-8byte sequences are or are not legal. Thus, in the example above, each of the sequences other than the first is an illegal code unit sequence. Similarly, a supplementary-plane character can be encoded directly into a four-byte UTF-8 sequence, but someone might (possibly from misunderstanding) choose to map the codepoint into a UTF-16 surrogate pair, and then apply the UTF-8 mapping to each of the surrogate code units to get a pair of three-byte sequences. To illustrate, consider the following: Supplementary-plane codepoint U+10011 Normal UTF-8 byte sequence 0xF0 0x90 0x80 0x91 UTF-16 surrogate pair 0xD800 0xDC11 “UTF-8” mapping of surrogates 0xED 0xA0 0x80 0xED 0xB0 0x91 Table 5 UTF-8-via-surrogates representation of supplementary-plane character Again, the Unicode Standard expects the shortest representation to be used for UTF-8. For certain reasons, non- shortest representations of supplementary-plane characters are referred to as irregular code unit sequences rather than illegal code unit sequences. The distinction here is subtle: software that conforms to the Unicode Standard is allowed to interpret these irregular sequences as the corresponding supplementary-plane characters, but is not allowed to generate these irregular sequences. In certain situations, though, software will want to reject such irregular UTF-8 sequences (for instance, where these might otherwise be used to avoid security systems), and in these cases the Standard allows conformant software to ignore or reject these sequences, or remove them from a data stream. The main motivation for making the distinction and for considering these 6-byte sequences to be irregular rather than illegal is this: suppose a process is re-encoding a data stream from UTF-16 to UTF-8, and suppose that the source data stream had been interrupted so that it ended with the beginning of a surrogate pair. It may be that this segment of the data will later be re-united with the remainder of the data, it also having been re-encoded in UTF- 8. So, we are assuming that there are two segments of data out there: one ending with an unpaired high surroga te, and one beginning with an unpaired low surrogate. Now, as each segment of the data is being trans-coded from UTF-16 to UTF-8, the question arises as to what should be done with the unpaired surrogate code units. If they are ignored, then the result after the data is reassembled will be that a character has been lost. A more graceful way to deal with the data would be for the trans-coding process to translate the unpaired surrogate in a corresponding 3-byte UTF-8 sequence, and then to leave it up to a later receiving process to decide what to do with t. Then, if the receiving process gets the data i segments assembled again, that character will still be part of the information content of the data. The only problem is that now it is in a 6-byte pseudo-UTF-8 sequence. Defining these as irregular rather than illegal is intended to allow that character to be retained over the course of this overall process in a form that conformant software is allowed to interpret, even if it would not be allowed to generate it that way.