1. Binary-coded decimal
In computing and electronic systems, binary-coded decimal (BCD) is an encoding for decimal numbers in
which each digit is represented by its own binary sequence. Its main virtue is that it allows easy conversion to
decimal digits for printing or display and faster decimal calculations. Its drawbacks are the increased complexity
of circuits needed to implement mathematical operations and a relatively inefficient encoding—it occupies more
space than a pure binary representation
To BCD-encode a decimal number using the common encoding, each decimal digit is stored in a four-bit nibble.
Decimal: 0 1 2 3 4 5 6 7 8 9
BCD: 0000 0001 0010 0011 0100 0101 0110 0111 1000 1001
Extended Binary Coded Decimal Interchange Code
Extended Binary Coded Decimal Interchange Code (EBCDIC) is an 8-bit character encoding (code page)
used on IBM mainframe operating systems such as z/OS, OS/390, VM and VSE, as well as IBM midrange
computer operating systems such as OS/400 and i5/OS (see also Binary Coded Decimal). It is also employed on
various non-IBM platforms such as Fujitsu-Siemens' BS2000/OSD, HP MPE/iX, and Unisys MCP. It descended
from punched cards and the corresponding six bit binary-coded decimal code that most of IBM's computer
peripherals of the late 1950s and early 1960s used.
ASCII
American Standard Code for Information Interchange (ASCII), pronounced /ˈæski/[1] is a character encoding
based on the English alphabet. ASCII codes represent text in computers, communications equipment, and other
devices that work with text. Most modern character encodings—which support many more characters than did the
original—have a historical basis in ASCII.
Historically, ASCII developed from telegraphic codes and its first commercial use was as a seven-bit teleprinter
code promoted by Bell data services. Work on ASCII formally began October 6, 1960 with the first meeting of
the ASA X3.2 subcommittee. The first edition of the standard was published in 1963,[2][3] a major revision in
1967,[4] and the most recent update in 1986.[5] Compared to earlier telegraph codes, the proposed Bell code and
ASCII were both ordered for more convenient sorting (i.e., alphabetization) of lists, and added features for
devices other than teleprinters. Some ASCII features, including the quot;ESCape sequencequot;,[6] were due to Robert
Bemer.
ASCII includes definitions for 128 characters: 33 are non-printing, mostly obsolete control characters that affect
how text is processed; 94 are printable characters, and the space is considered an invisible graphic.[7] The ASCII
character encoding[8]—or a compatible extension—is used on nearly all common computers, especially personal
computers and workstations
The Operation of Combinational Logic Systems
We have looked extensively at the combinations of logic gates, and how we can make circuits with a
single gate as a unit. What use is this, other than an academic exercise? Logic gates are used
extensively in calculators and computers. Logic gates can be used to add binary numbers. Computers
are adding machines; they do subtraction by a process of complimentary addition, while they multiply
by serial addition.
The circuits they use are based on the half-adder. This copes with the rules for binary addition
which are:
2. 0+0=0
0+1=1
1+0=1
1 + 1 = 0 carry 1
(1 + 1 + 1 = 1 carry 1)
The circuit has two outputs, a sum and a carry. The sum is the output of an exclusive OR gate (we
can’t have 1 + 1 = 1), while the carry output is that of an AND gate. The Boolean algebra is:
sum = A + B
carry = A.B
This gives an arrangement shown below:
The circuit is shown below:
3. Duality Principal
• Duality principal – each Boolean expression will
be certified if identity of operators and elements
are interchangeable
+ .
10
• Example: Given expression
a+(b.c)=(a+b).(b+c)
therefore duality expression is
a.(b+c)=(a.b)+(b.c)
MOHD. YAMANI IDRIS/ 16
NOORZAILY MOHAMED NOOR
Duality Principal
• Duality principal give free theorem “buy one, free
one”. You only need to prove one theorem and get
another one free.
• If (x+y+z)’=x’.y’.z’ is certified, therefore the
duality is also certified (x.y.z)’=x’+y’+z’
• If x+1=1 is certified, therefore the duality is also
certified x.0=0
MOHD. YAMANI IDRIS/ 17
NOORZAILY MOHAMED NOOR
Unicode
In computing, Unicode is an industry standard allowing computers to consistently represent and manipulate
text expressed in most of the world's writing systems. Developed in tandem with the Universal Character Set
standard and published in book form as The Unicode Standard, Unicode consists of a repertoire of more
than 100,000 characters, a set of code charts for visual reference, an encoding methodology and set of
standard character encodings, an enumeration of character properties such as upper and lower case, a set of
reference data computer files, and a number of related items, such as character properties, rules for
normalization, decomposition, collation, rendering and bidirectional display order (for the correct display of
text containing both right-to-left scripts, such as Arabic or Hebrew, and left-to-right scripts).[1]
The Unicode Consortium, the non-profit organization that coordinates Unicode's development, has the
ambitious goal of eventually replacing existing character encoding schemes with Unicode and its standard
Unicode Transformation Format(UTF) schemes, as many of the existing schemes are limited in size and scope
and are incompatible with multilingual environments.
4. Unicode's success at unify character sets has led to its widespread and predominant use in the
ing
internationalization and localization of computer software. The standard has been implemented in many recent
technologies, including XML, the Java programming language, the Microsoft .NET Framework and modern
operating systems.
Unicode can be implemented by different character encodings. The most commonly used encodings are UTF-8
(which uses 1 byte for all ASCII characters, which have the same code values as in the standard ASCIIencoding,
and up to 4 bytes for other characters), the now-obsolete UCS-2 (which uses 2 bytes for all characters, but does
not include every character in the Unicode standard), and UTF-16 (which extends UCS-2, using 4 bytes to encode
characters missing from UCS-2).
Unicode Transformation Format and Universal Character Set
Unicode defines two mapping methods the Unicode Transformation Format (UTF) encodings, and the Universal
:
Character Set (UCS) encodings. An encoding maps (possibly a subset of) the range of Unicode code points to
sequences of values in some fixed-size range, termed code values. The numbers in the names of the encodings
indicate the number of bits in one code value (for UTF encodings) or the number of bytes per code value (for
UCS) encodings. UTF-8 and UTF-16 are probably the most commonly used encodings. UCS-2 is an obsolete
subset of UTF-16; UCS-4 and UTF-32 are functionally equivalent.
UTF encodings include:
UTF-1 — a retired predecessor of UTF-8, maximizes compatibility with ISO 2022, no longer part of The
Unicode Standard
UTF-7 — a relatively unpopular 7-bit encoding, often considered obsolete (not part of The Unicode
Standard but rather an RFC)
UTF-8 — an 8-bit, variable-width encoding, which maximizes compatibility with ASCII.
UTF-EBCDIC — an 8-bit variable-width encoding, which maximizes compatibility with EBCDIC. (not
part of The Unicode Standard)
UTF-16 — a 16-bit, variable-width encoding
UTF-32 — a 32-bit, fixed-width encoding
UTF-8 uses one to four bytes per code point and, being compact for Latin scripts and ASCII-compatible, provides
the de facto standard encoding for interchange of Unicode text. It is also used by most recent Linux distributions
as a direct replacement for legacy encodings in general text handling.
The UCS-2 and UTF-16 encodings specify the Unicode Byte Order Mark (BOM) for use at the beginnings of text
files, which may be used for byte ordering detection (or byte endianness detection). Some software developers
have adopted it for other encodings, including UTF-8, which does not need an indication of byte order. In this
case it attempts to mark the file as containing Unicode text. The BOM, code point U+FEFF has the important
property of unambiguity on byte reorder, regardless of the Unicode encoding used; U+FFFE (the result of byte-
swapping U+FEFF) does not equate to a legal character, and U+FEFF in other places, other than the beginning of
text, conveys the zero-width no-break space (a character with no appearance and no effect other than preventing
the formation of ligatures). Also, the units FE and FF never appear in UTF-8. The same character converted to
UTF-8 becomes the byte sequence EF BB BF.
In UTF-32 and UCS-4, one 32-bit code value serves as a fairly direct representation of anycharacter's code point
(although the endianness, which varies acrossdifferent platforms, affects how the code value actually manifests
as an octet sequence). In the other cases, each code point may be represented by a variable number of code
values. UTF-32 is widely used as internal representation of text in programs (as opposed to stored or transmitted
text), since every Unix operating system which uses the gcc compilers to generate software uses it as the standard
quot;wide characterquot; encoding. Recent versions of the Python programming language (beginning with 2.2) may also
be configured to use UTF-32 as the representation for unicode strings, effectively disseminating such encoding in
high-level coded software.
Punycode, another encoding form, enables the encoding of Unicode strngs into the limited character set
i
supported by the ASCII-based Domain Name System. The encoding is used as part of IDNA, which is a system
5. enabling the use of Internationalized Domain Names in all scripts that are supported by Unicode. Earlier and now
historical proposals include UTF-5 and UTF-6.
GB18030 is another encoding form for Unicode, from the Standardization Administration of China. It is the
official character set of the People's Republic of China (PRC). BOCU-1 and SCSU are Unicode compression
schemes. The April Fools' Day RFC of 2005 specified two parody UTF encodings, UTF-9 and UTF-18.
Mapping codepoints to Unicode encoding forms
Peter Constable, 2001-06-13; 10950 reads
Note:
This is an Appendix to “Understanding Unicode™”.
See also A review of characters with compatibility decompositions.
In Section 4 of “Understanding Unicode™”, we examined each of the three character encoding forms defined
within Unicode. This appendix describes in detail the mappings from Unicode codepoints to the code unit
sequences used in each encoding form.
In this description, the mapping will be expressed in alternate forms,one of which is a mapping of bits between
the binary representation of a Unicode scalar value and the binary representation of a code unit. Even though a
coded character set encodes characters in terms of numerical values that have no specific computer representation
or data type associated with them, for purposes of describing this mapping, we are considering codepoints in the
Unicode codespace to have a width of 21 bits. This is the number of bits required for binary representation of the
entire numerical range of Unicode scalar values, 0x0 to 0x10FFFF.
1 UTF-32
The UTF-32 encoding form was formally incorporated into Unicode as part of TUS 3.1. The definitions for UTF-
32 are specified in TUS 3.1 and in UAX#19 (Davi 2001). The mapping for UTF-32 is, essentially, the identity
s
mapping: the 32-bit code unit used to encode a codepoint has the same integer value as the codepoint itself. Thus
if U represents the Unicode scalarvalue for a character and C represents the value of the 32-bit code unit then:
U=C
The mapping can also be expressed in terms of the relationships between bits in the binary representations of the
Unicode scalar values and the 32-bit code units, as shown in Table 1.
Codepoint range Unicode scalar value (binary) Code units (binary)
U+0000..U+D7FF, U+E000..U+10FFFF xxxxxxxxxxxxxxxxxxxxx 00000000000xxxxxxxxxxxxxxxxxxxxx
Table 1 UTF-32 USV to code unit mapping
2 UTF-16
The UTF-16 encoding form was formally incorporated into Unicode as part of TUS 2.0. The current definitions for UTF-16 are specified
in TUS 3.0.1
U = (CH – D80016) * 40016 + (CL – DC0016) + 1000016
Likewise, determining the high and low surrogate values for a given Unicode scalar value is fairly
straightforward. Assuming the variables CH, CL and U as above, and that U is in the range U+10000..U+10FFFF,
CH = (U – 1000016) 40016 + D80016
CL = (U – 1000016) mod 40016 + DC0016
6. where “” represents integer division (returns only integer portion, rounded down), and “mod” represents the
modulo operator.
Expressing the mapping in terms of a mapping of bits between the binary representati ns of scalar values and
o
code units, the UTF-16 mapping is as shown in Table 2:
Codepoint range Unicode scalar value (binary) Code units (binary)
U+0000..U+D7FF, 00000xxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxx
U+E000..U+EFFF
U+10000..U+10FFFF Uuuuuxxxxxxyyyyyyyyyy 110110wwwwxxxxxx 110111yyyyyyyyyy (where uuuuu = wwww
+ 1)
Table 2 UTF-16 USV to code unit mapping
3 UTF-8
The UTF-8 encoding form was formally incorporated in Unicode as part of TUS 2.0. The current definitions for
to
UTF-8 are specified in TUS 3.1.2 As with the other encoding forms, calculating a Unicode scalar value from the
8-bit code units in a UTF-8 sequence is a matter of simple arithmetic. In this case, however, the calculation
depends upon the number of bytes in the sequence. Similarly, the calculation of code units from a scalar value
must be expressed differently for different ranges of scalar values.
Let us consider first the relationship between bits in the binary representati n of codepoints and code units. This
o
is shown for UTF-8 in Table 3:
Codepoint range Scalar value (binary) Byte 1 Byte 2 Byte 3 Byte 4
U+0000..U+007F 00000000000000xxxxxxx 0xxxxxxx
U+0080..U+07FF 0000000000yyyyyxxxxxx 110yyyyy 10xxxxxx
U+0800..U+D7FF, U+E000..U+FFFF 00000zzzzyyyyyyxxxxxx 1110zzzz 10yyyyyy 10xxxxxx
U+10000..U+10FFFF uuuzzzzzzyyyyyyxxxxxx 11110uuu 10zzzzzz 10yyyyyy 10xxxxxx
Table 3 UTF-8 USV to code unit mapping
Note
There is a slight difference between Unicode and ISO/IEC 10646 in how they define UTF-8 since Unicode limits
it to the roughly one million characters possible in Unicode’s codespace, while for the ISO/IEC standard, it can
access the entire 31-bit codespace. For all practical purposes, this difference is irrelevant since the ISO/IEC
codespace is effectively limited to match that of Unicode, but you may encounter differing descriptions on
occasion.
As mentioned in Section 4.2 of “Understanding Unicode™”, UTF-8 byte sequences have certain interesting
properties. These can be seen from the table above. Firstly, note the high-order bits in non-initial bytes as opposed
to sequence-initial bytes. By looking at the first two bits, you can immediately determine whether a code unit is
an initial byte in a sequence or is a following byte. Secondly, by looking at the number of non-zero high-order
bits of the first byte in the sequence, you can immediately tell how l ng the sequence is: if no high-order bits are
o
set to one, then the sequence contains exactly one byte. Otherwise, the number of non-zero high-order bits is
equal to the total number of bytes in the sequence.
Table 3 also reveals the other interesting characteristic of UTF-8 that was described in Section 4.2 of
“Understanding Unicode™”. Note that characters in the range U+0000..U+007F are represented using a single
byte. The characters in this range match ASCII codepoint for codepoint. Thus, any data encoded in ASCII is
automatically also encoded in UTF-8.
Having seen how the bits compare, let us consider how code units can be calculated from scalar values, and vice
versa. If U represents the value of a Unicode scalar value and C1, C2, C3 and C4 represent bytes in a UTF-8 byte
sequence (in order), then the value of a Unicode scalar value U can be calculated as follows:
If a sequence has one byte, then
U = C1
7. Else if a sequence has two bytes, then
U = (C1 – 192) * 64 + C2 – 128
Else if a sequence has three bytes, then
U = (C1 – 224) * 4,096 + (C2 – 128) * 64 + C3 – 128
Else
U = (C1 – 240) * 262,144 + (C2 – 128) * 4,096 + (C3 – 128) * 64 + C4 – 128
End if
Going the other way, given a Unicode scalar value U, then the UTF-8 byte sequence can be calculated as follows:
If U <= U+007F, then
C1 = U
Else if U+0080 <= U <= U+07FF, then
C1 = U 64 + 192
C2 = U mod 64 + 128
Else if U+0800 <= U <= U+D7FF, or if U+E000 <= U <= U+FFFF, then
C1 = U 4,096 + 224
C2 = (U mod 4,096) 64 + 128
C3 = U mod 64 + 128
Else
C1 = U 262,144 + 240
C2 = (U mod 262,144) 4,096 + 128
C3 = (U mod 4,096) 64 + 128
C4 = U mod 64 + 128
End if
where “” represents integer division (returns only integer portion, rounded down), and “mod” represents the
modulo operator.
If you examine the mapping in Table 3 carefully, you may notice that by ignoring the range constraints in the left-
hand column, certain codepoints can potentially be represented in more than one way. For example, substituting
U+0041 LATIN CAPITAL LETTER A into the table gives the following possibilities:
Codepoint Pattern Byte 1 Byte 2 Byte 3 Byte 4
000000000000001000001 00000000000000xxxxxxx 01000001
000000000000001000001 0000000000yyyyyxxxxxx 11000001 10000001
000000000000001000001 00000zzzzyyyyyyxxxxxx 1110zzzz 10000001 10000001
000000000000001000001 uuuzzzzzzyyyyyyxxxxxx 11110000 10000000 10000001 10000001
Table 4 “UTF-8” non-shortest sequences for U+0041
8. Obviously, having these alternate encoded representations for the same character is not desirable. Accordingly,
the UTF-8 specification stipulates that the shortest possible representation must be used. In TUS 3.1, this was
made more explicitly clear by specifying exactly what UTF-8byte sequences are or are not legal. Thus, in the
example above, each of the sequences other than the first is an illegal code unit sequence.
Similarly, a supplementary-plane character can be encoded directly into a four-byte UTF-8 sequence, but
someone might (possibly from misunderstanding) choose to map the codepoint into a UTF-16 surrogate pair, and
then apply the UTF-8 mapping to each of the surrogate code units to get a pair of three-byte sequences. To
illustrate, consider the following:
Supplementary-plane codepoint U+10011
Normal UTF-8 byte sequence 0xF0 0x90 0x80 0x91
UTF-16 surrogate pair 0xD800 0xDC11
“UTF-8” mapping of surrogates 0xED 0xA0 0x80 0xED 0xB0 0x91
Table 5 UTF-8-via-surrogates representation of supplementary-plane character
Again, the Unicode Standard expects the shortest representation to be used for UTF-8. For certain reasons, non-
shortest representations of supplementary-plane characters are referred to as irregular code unit sequences
rather than illegal code unit sequences. The distinction here is subtle: software that conforms to the Unicode
Standard is allowed to interpret these irregular sequences as the corresponding supplementary-plane characters,
but is not allowed to generate these irregular sequences. In certain situations, though, software will want to reject
such irregular UTF-8 sequences (for instance, where these might otherwise be used to avoid security systems),
and in these cases the Standard allows conformant software to ignore or reject these sequences, or remove them
from a data stream.
The main motivation for making the distinction and for considering these 6-byte sequences to be irregular rather
than illegal is this: suppose a process is re-encoding a data stream from UTF-16 to UTF-8, and suppose that the
source data stream had been interrupted so that it ended with the beginning of a surrogate pair. It may be that this
segment of the data will later be re-united with the remainder of the data, it also having been re-encoded in UTF-
8. So, we are assuming that there are two segments of data out there: one ending with an unpaired high surroga te,
and one beginning with an unpaired low surrogate.
Now, as each segment of the data is being trans-coded from UTF-16 to UTF-8, the question arises as to what
should be done with the unpaired surrogate code units. If they are ignored, then the result after the data is
reassembled will be that a character has been lost. A more graceful way to deal with the data would be for the
trans-coding process to translate the unpaired surrogate in a corresponding 3-byte UTF-8 sequence, and then
to
leave it up to a later receiving process to decide what to do with t. Then, if the receiving process gets the data
i
segments assembled again, that character will still be part of the information content of the data. The only
problem is that now it is in a 6-byte pseudo-UTF-8 sequence. Defining these as irregular rather than illegal is
intended to allow that character to be retained over the course of this overall process in a form that conformant
software is allowed to interpret, even if it would not be allowed to generate it that way.