SlideShare a Scribd company logo
1 of 72
Download to read offline
Department of Electrical Engineering and Information Technology
Computer Engineering Group
Comparative Implementation of AES-
based Crypto-Cores
Master‟s Thesis
Submitted to Prof. Dr. Sybille Hellebrand
Muhammad Asim Zahid
iii
Declaration
I declare that I have developed and written the enclosed Master Thesis completely by
myself, and have not used sources or means without declaration in the text. Any
thoughts from others or literal quotations are clearly marked. The Master Thesis was not
used in the same or in a similar version to achieve an academic grading or is being
published elsewhere.
Location, Date _____________________ Signature ___________________________
v
Contents
1 Introduction............................................................................................................... 1
1.1 Motivation: Industry 4.0 and Data Security....................................................... 1
1.2 Objectives........................................................................................................... 2
2 Theoretical and Mathematical Preliminaries ............................................................ 5
2.1 Essentials............................................................................................................ 5
2.2 Standards and Information................................................................................. 7
2.3 Mathematical Operations ................................................................................... 8
Addition and Subtraction............................................................................ 82.3.1
Multiplication.............................................................................................. 92.3.2
3 Advanced Encryption Standard .............................................................................. 11
3.1 Introduction...................................................................................................... 11
Features..................................................................................................... 113.1.1
Usage ........................................................................................................ 123.1.2
3.2 Encryption........................................................................................................ 12
SubBytes Transformation ......................................................................... 143.2.1
ShiftRows Transformation........................................................................ 153.2.2
MixColumns Transformation ................................................................... 163.2.3
AddRoundKey Transformation ................................................................ 173.2.4
3.2.4.1 Key Expansion................................................................................... 17
3.2.4.2 Round Key Addition.......................................................................... 19
3.3 Decryption........................................................................................................ 19
InverseShiftRows Transformation............................................................ 203.3.1
InverseSubBytes Transformation ............................................................. 213.3.2
InverseMixColumns Transformation........................................................ 223.3.3
AddRoundKey Transformation ................................................................ 233.3.4
4 Encryption & Authentication Modes...................................................................... 25
4.1 Encryption Modes............................................................................................ 25
Electronic Codebook (ECB) ..................................................................... 254.1.1
vi
Cipher Block Chaining (CBC) ..................................................................264.1.2
Counter Mode (CTR) ................................................................................274.1.3
4.2 Counter with CBC-MAC (CCM) .....................................................................27
Introduction ...............................................................................................274.2.1
Algorithm ..................................................................................................284.2.2
4.3 Galois/Counter Mode (GCM)...........................................................................28
Introduction ...............................................................................................284.3.1
Algorithm ..................................................................................................294.3.2
4.3.2.1 GHASH..............................................................................................29
4.3.2.2 GCTR.................................................................................................33
Authenticated Encryption..........................................................................334.3.3
Authenticated Decryption..........................................................................344.3.4
5 Literature Review....................................................................................................37
5.1 AES Designs.....................................................................................................37
Iterative Loop Structure.............................................................................385.1.1
Memory-Based AES Hardware Implementation Designs.........................385.1.2
Unfolded / Parallel Structure Based Designs ............................................405.1.3
Sub-Pipelined or Stage-Level Pipelined Structure....................................405.1.4
5.2 Galois Field Multiplier (GFM) Designs ...........................................................43
6 Design & Implementation Results ..........................................................................45
6.1 Design...............................................................................................................45
AES Core...................................................................................................456.1.1
Galois Field Multiplier (GFM) Core.........................................................476.1.2
Complete Design .......................................................................................496.1.3
6.2 Implementation and Results .............................................................................53
Implementation Platform...........................................................................536.2.1
Tests & Results..........................................................................................546.2.2
6.3 Conclusion........................................................................................................56
Appendix .........................................................................................................................57
Bibliography....................................................................................................................59
vii
List of Figures
Figure 2.1 Diagrammatic Representation of Encryption/Decryption............................... 5
Figure 2.2 Diagrammatic Representation of Authentication............................................ 6
Figure 3.1 Diagramatic Representation of a State .......................................................... 13
Figure 3.2 Flow Diagram of Round Transformations in AES Encryption..................... 14
Figure 3.3 SubBytes Transformation.............................................................................. 14
Figure 3.4 ShiftRows Transformation ............................................................................ 16
Figure 3.5 Diagramatic Representation of MixColumns................................................ 17
Figure 3.6 Flow Diagram of Key Expansion Algorithm ................................................ 18
Figure 3.7 Diagramatic Representation of RotWord...................................................... 18
Figure 3.8 Diagramatic Representation of AddRoundKey Transformation................... 19
Figure 3.9 Flow Diagram Representation of AES Decryption....................................... 20
Figure 3.10 Diagramatic Representation of InversShiftRows Transformation .............. 21
Figure 3.11 Diagramatic Representation of InverseMixColumns.................................. 23
Figure 4.1 Diagrammatic Representation of ECB Mode................................................ 26
Figure 4.2 Diagrammatic Representation of CBC mode................................................ 26
Figure 4.3 Diagrammatic Representation of CTR Mode................................................ 27
Figure 4.4 Diagrammatic Representation of CBC-MAC ............................................... 28
Figure 4.5 Diagrammatic Representation of the GHASH function................................ 30
Figure 4.6 Diagrammatic Representation of GF(2128
) Multiplication............................ 31
Figure 4.7 Diagrammatic Representation of the Hashing sequence............................... 32
Figure 4.8 Diagrammatic Representation of Authenticated Encryption......................... 34
Figure 4.9 Diagrammatic Representation of Authenticated Decryption ........................ 35
Figure 5.1 Iterative Loop Implementation...................................................................... 38
Figure 5.2 Diagrammatic Representation of Memory Based AES structure.................. 39
Figure 5.3 Unrolled AES Structure ................................................................................ 40
Figure 5.4(a)Pipelined AES encryption datapath, (b) Pipelined Key Scheduler............ 41
Figure 5.5 Stages for Sub-Pipelining.............................................................................. 42
Figure 6.1 Internal Structure of AES core ...................................................................... 46
Figure 6.2 Diagrammatic Representation of Encryption Block ..................................... 47
Figure 6.3 Diagrammatic Representation of Pipelined GF(2128
) Multiplier................... 48
Figure 6.4 Diagrammatic Representation of Authentication Block ............................... 48
viii
Figure 6.5 State Diagram of Crypto-Core State Machine...............................................52
Figure 6.6 Complete Design Layout ...............................................................................53
Figure 6.7 Graphical Comparison of Frequency and Throughput ..................................54
Figure 6.8 Graphical Comparison of Area Usage ...........................................................55
Figure 6.9 Graphical Comparison of Efficiency .............................................................55
ix
List of Tables
Table 2.1 GF(28
) Polynomial Representation................................................................... 8
Table 3.1 AES Number of Rounds with respect to Key Length..................................... 12
Table 3.2 S-box: Substitution values for byte „xy‟ (hexadecimal)................................. 15
Table 3.3 Inverse S-box: Substitution values for byte 'xy' (hexadecimal) ..................... 21
Table 6.1 I/O Signals of Crypto-Core............................................................................. 49
Table 6.2 Comparison of Operational Frequency and Throughput................................ 54
Table 6.3 Comparison of Area usage and Efficiency ..................................................... 55
1
1 Introduction
The thesis topic, as suggested by its name, comprises of a comparison of AES
(Advanced Encryption Standard) Crypto-Cores for finding a good and optimized way of
secure data transfer. Today there is a growing need of safe and secure data transfer, so
to avoid unauthorized access of data, data security and integrity is needed. Security
protocols are used for daily computing and whenever security is compromised in any of
the standard protocols they are either immediately re-designed or replaced with a
improved standard, e.g. ( Data Encryption Standard, i.e., DES was replaced by AES)[6]
.
A good security protocol should have less processing time, higher throughput, should be
reliable, must be faster and, at times, must have good authentication standards. The
purpose of encryption is that the one who is authorized to read the message can read it
by decrypting it with the help of a valid decryption algorithm.
It is the goal of every sector to choose an encryption which meets its security standards.
A relatively good encryption would be one where the deciphering of the encrypted data
is near to impossible, but in most cases even the most promising algorithm can be
broken, with a suitable attack. Therefore, choosing the right encryption is very
important keeping in mind the known threat to the data in question.
1.1 Motivation: Industry 4.0 and Data Security
The fourth industrial revolution has set the trend for exchange of data. First comes the
principle of interoperability according to which through Internet of Things (IoT) and
Internet of People (IoP), sensors, devices, people and machines communicate and are
connected. Second is the principle of Information transparency, i.e., Information
Sytems create a virtual copy of the physical world by combining higher-value context
information and raw sensor data. Third is the principle of Technical assistance which
talks of two abilities; first includes solving problems and making informed decisions in
a very short time by supporting humans through comprehensible aggregation and
visualizing of information. Second ability is to support humans physically through cyber
2
physical systems. The fourth and the last one are Decentralized decisions in which the
performance of the tasks is autonomous by cyber physical systems and can take their
own decision. In some exceptional cases the higher level is delegated with the task. [3]
The Industry 4.0 has developed a growing need for systems integrated with
cryptography due to which many organizations have suggested standards for data
security and integrity meeting the Industry 4.0 requirements. Open Platform
Communications – Unified Architecture (OPC-UA), which is a non- profit organization,
is becoming more and more popular for providing safe, platform-independent and
reliable data exchange standards. Its standard works in association with many
researchers, manufacturers and users and keeps on improving its standards to maintain
competitiveness, having a vast vision. In order to implement the vision of Industry 4.0,
OPC-UA has suggested that more effective use of energy and resources is required in
minimum time to reduce complexity. For this certain activities are required, which
includes automation and optimization of the system, shaping the digitalization of all
industrial sectors and so on. This research is based on the security standard set by OPC-
UA which includes scalable mechanisms and introduction of fast but area-efficient
crypto-cores. [1], [2]
1.2 Objectives
A secure data transfer requires the implementation of a good and reliable encryption
standard. Advanced Encryption Standard (AES) is a widely used standard which, in the
past years, has been considered as one of the most reliable encryption standards
available. Thus, AES has been chosen as the encryption standard in this research and
following are the main objectives of this research thesis:
 Studying of various AES techniques to have a better understanding for further
optimization.
 Implementation of a fast pipelined-AES technique as an IP-core with
considerably less area cost.
 Implement two AES modes, i.e., Galois/Counter Mode (GCM) and Counter with
CBC-MAC (CCM) as an IP-core for an accurate comparison.
3
 Devise a method for the implementation of Galois Field Multiplication of
GF(28
) for GCM that has considerably less area-cost but is still fast.
5
2 Theoretical and Mathematical
Preliminaries
This chapter discusses the theoretical and mathematical preliminaries that will be used
in the further chapters.
2.1 Essentials
The main idea behind this research thesis is having a security platform for hardware
which would make the information transfer from one node to another at hardware-level
much more secure. Before going in to the details of how this is done the basic essentials
should first be defined.
Encryption/Decryption: Encryption can be defined as a process to protect data in such a
way that only authorized units (people, machines, etc.) can access it. This is done by
encoding the data with a special key which can only be decoded by whoever has that
key. For example, if some information has to be exchanged between two units, A and B,
but it has to be ensured that no unauthorized unit can access that data, encryption will be
used. The data might still be accessible but it will be unreadable for anyone who doesn‟t
have the authorization key. The process to decode that encrypted data in to the original
data would be called Decryption. Figure 2.1 shows a diagrammatic environment of
encryption/decryption.
Figure 2.1 Diagrammatic Representation of Encryption/Decryption
Hello Hellof3#7r
f3#7r (unreadable)
6
Authentication: Authentication is a process in which the received data is compared with
the sent data to ensure that valid data has been received. This means that it ensures that
the data has not been changed or modified in any way during data transfer. This is
achieved by adding a verification tag to the message known as Message Authentication
Code (MAC). The MAC algorithm, on the Sender end, takes in a secret key and the
message to be sent to generate a MAC which is sent along with the message. On the
Receiver end the message is passed through a MAC algorithm to generate a MAC which
is compared with the received MAC to see whether the message received is authentic or
has been tampered with. Figure 2.2 shows the diagrammatic representation of how
authentication works.
Figure 2.2 Diagrammatic Representation of Authentication
Crypto-Core: An IP (intellectual property) core is a block of logic or data that is used in
making a field programmable gate array ( FPGA ) or application-specific integrated
circuit ( ASIC ) design. Ideally, an IP core should be entirely portable - that is, able to
easily be inserted into any vendor technology or design methodology. Universal
Asynchronous Receiver/Transmitter ( UART s), central processing units
( CPU s), Ethernet controllers, and PCI interfaces are all examples of IP cores. Crypto-
Cores are another example of an IP-Core to embed encryption within the hardware.
Sender Receiver
Message is
Authentic
Message is
Modified
Channel
7
2.2 Standards and Information
The AES algorithm takes bits (binary values) as sequences for input and output, referred
to as blocks. The number of bits contained in a block is known as block length. For an
AES algorithm a secret key, known as the Cipher Key, is required to encrypt/decrypt
data. The AES can be implemented in three different key lengths, i.e., 128-, 192- and
256-bit[5]
. The AES algorithm used in this thesis will be a 128-bit algorithm.
Bits: Within such sequences, the bits are numbered from 0 to Block Length – 1. A
sequence of 8 bits is known as a Byte which is the basic unit for an AES algorithm. A
sequence of 4-byte is known as a Word.
Index: It is the number attached to a bit. It ranges for a 128-bit
algorithm.
Galois Field: A field containing a finite set of elements is known as a Galois Field(GF).
A Galois Field is represented as GF(pn
), that denotes that it is a Galois Field of pn
elements, where p is a prime number. So, a Galois Field of 256 elements will be
represented as GF(28
). In this research thesis the GF(28
) will be taken in to
consideration since AES operations take GF(28
) as input elements.
Galois Field Polynomial: The elements in a Galois Field GF(pn
) can also be written in
the form of a polynomial of degree that is less than n. Consider an element of GF(28
)
written in binary form as {00110101}. The GF polynomial for this would be
. Table 2.1 shows a better understanding of the GF(28
) polynomial. Since the
GF(28
) has 256 elements so Table 2.1 just shows some polynomial representations to
provide an understanding.
Binary Conversion GF Polynomial
00000000 0x7
+0x6
+0x5
+0x4
+0x3
+0x2
+0x1
+0x0
0
00000001 0x7
+0x6
+0x5
+0x4
+0x3
+0x2
+0x1
+1x0
1
00000010 0x7
+0x6
+0x5
+0x4
+0x3
+0x2
+1x1
+0x0
x
00000011 0x7
+0x6
+0x5
+0x4
+0x3
+0x2
+1x1
+1x0
x+1
00000100 0x7
+0x6
+0x5
+0x4
+0x3
+1x2
+0x1
+0x0
x2
8
00000101 0x7
+0x6
+0x5
+0x4
+0x3
+1x2
+0x1
+1x0
x2
+1
00000110 0x7
+0x6
+0x5
+0x4
+0x3
+1x2
+1x1
+0x0
x2
+x
00000111 0x7
+0x6
+0x5
+0x4
+0x3
+1x2
+1x1
+1x0
x2
+x+1
……… ……………………… …………
11111101 1x7
+1x6
+1x5
+1x4
+1x3
+1x2
+0x1
+1x0
x7
+x6
+x5
+x4
+x3
+x2
+1
11111110 1x7
+1x6
+1x5
+1x4
+1x3
+1x2
+1x1
+0x0
x7
+x6
+x5
+x4
+x3
+x2
+x
11111111 1x7
+1x6
+1x5
+1x4
+1x3
+1x2
+1x1
+1x0
x7
+x6
+x5
+x4
+x3
+x2
+x+1
Table 2.1 GF(28
) Polynomial Representation
2.3 Mathematical Operations
The elements in an AES algorithm are understood as GF(28
) elements as explained
above. All finite elements can be added and multiplied; however, their operations are
not as those done for numbers. The mathematical concepts for finite field elements are
different and are explained in this section.
Addition and Subtraction2.3.1
Although it is a bit different from normal algebric addition or subtraction, the addition
or subtraction of two GF polynomials is very simple. For addition of two GF(2n
)
polynomials, the two polynomials are added and then reducing the result modulo 2.
Modulo by any integer or polynomial means to divide with that integer or polynomial
and take the remainder as the answer. Subtraction of two GF(2n
) polynomials is the
same as addition. So considering a GF(28
) field it will be addition modulo 2 or
subtraction modulo 2.
Let us take an example taking two GF(28
) polynmials, and
. The addition of these two GF polynomials would result as
shown in Equation 2.1:
( 2.1)
9
Equation 2.1 shows that the same result would be achieved by an exclusive-OR (XOR)
operation between A and B, so it can also be represented as 1 .
Thus, the addition of two GF(28
) polynomials can also be done by doing an XOR
operation between the two polynomials which would make implementation much
easier.
Multiplication2.3.2
To multiply two polynomials in Galois Field GF(2n
), initially, their corresponding
polynomials are multiplied just as in algebra (except for their coefficients that are only 0
& 1. A lot of terms will be dropped out because 1+1=0, which makes calculations
easier). The result is then modulo by an irreducable polynomial of degree n. For the
AES algorithm in GF(28
) the irreducible polynomial is shown in Equation 2.2:
( 2.2)
The multiplication is denoted by ●. Implementing multiplication of finite field elements
is somewhat more complex than addition. Modulo by m(x) ensures that the resultant
binary polynomial will of degree less than 8, and therefore can be represented by a byte.
At the byte level, there is no simple operation for multiplication, like for addition.
As an example consider take two GF(28
) polynomials, and
, and show their multiplication from Equation 2.3 to Equation 2.6.
( 2.3)
This implies:
( 2.4)
And,
1
⊕ represents XOR of two elements
10
( 2.5)
= | ⁄ |
To compute , first use as the quotient. Thus, by multiplying
with , results in:
( 2.6)
This when subtracted from gives:
( 2.7)
as the remainder. Now, the degree of the polynomial in the remainder is 10 so select
as the quotient. Multiplying with , result is:
( 2.8)
Subtracting 2.8 from 2.7 the remainder is:
( 2.9)
Since the terms with factor 2 in 2.9 will be dropped as mentioned previously and since
the final result would be the absolute value of the remainder, so the result is:
(2.10)
11
3 Advanced Encryption Standard
For the encryption of commercial and sensitive computer data, the US government
adopted Data Encryption Standard (DES), as an official Federal Information
Processing Standard (FIPS). Since this was the first encryption algorithm approved by
the US government, hence the public and private industry, requiring strong encryption,
welcomed it readily and saw its adoption in a wide variety of embedded systems, smart
cards, SIM cards and network devices. For any cipher, the most basic method of attack
is brute force, which involves trying each key until the right one is found. Therefore,
encryption strength is directly dependent upon the key size. DES uses a 64-bit key, eight
of which bits are used for parity checks, effectively limiting the key to 56-bits. Since the
DES was using the same key to encrypt / decrypt a message, as such 56-bit keys (of
DES) were considered too small compared to the processing power of modern
computers, making it susceptible to cyber-attacks and, as such, soon began losing its
usefulness. The U.S. National Institute of Standards and Technology (NIST), in 1997,
started looking for a better alternate to DES. In 2001, it selected the Advanced
Encryption Standard (AES) as a replacement.
3.1 Introduction
The Advanced Encryption Standard (AES) [5]
, also known as Rijndael after the two
Belgian cryptographers, Joan Daemen and Vincent Rijmen, was published by NIST in
2001. It is the most commonly used encryption standard, throughout the world.
AES is a symmetric block cipher that operates on 128-bit block as input and output data
and is used to protect classified information implemented in software and hardware to
encrypt sensitive data.
Features3.1.1
AES data encryption is a more mathematically efficient and elegant cryptographic
algorithm, but its main strength rests in the key length options. It is based on a design
12
principle known as a substitution-permutation network, combination of both
substitution and permutation, and is fast in both software and hardware [31]
. The
algorithm can encrypt and decrypt blocks using a secret key which has a key size of
256-bit, 192-bit, or 128-bit. One of the main features of AES is simplicity that is
achieved by repeatedly combining substitution and permutation computations at
different rounds, i.e., AES encrypts/decrypts a 128-bit plaintext/ciphertext by repeatedly
applying the same round transformation a number of times depending on the key size.
Key Length Block Length Number of Rounds
AES-128 128-bit 128-bit 10
AES-192 192-bit 128-bit 12
AES-256 256-bit 128-bit 14
Table 3.1 AES Number of Rounds with respect to Key Length
The actual key length depends on the desired security level. Today, AES-128 is
predominant and supported by most hardware implementations. It is also the standard
that will be focused on in this implementation since it is the preferred standard for
GCTR module of the AES – GCM to provide authenticity.
Usage3.1.2
This AES standard is used by concerned departments and agencies whenever it is
considered that any unclassified sensitive information is of importance and has to be
protected cryptographically.
Other cryptographic algorithms approved by FIPS are also available for use in addition
to or in lieu of this standard. Commercial and private organizations have also, in the
past years, turned this standard for security of their information and systems.
3.2 Encryption
It is understood that the basis of AES Encryption lies in the design principle which is
commonly referred to as a substitution-permutation network, a combination of
substitution and permutation both, which is called Cipher. In plain words, Cipher may
mean any method to encrypt a text, known as plaintext, so that its readability and/or
13
meaning is concealed. It is a coded or disguised way of writing a message. This coding
is known as encryption. Sometimes the encrypted text is itself also referred to as Cipher,
but generally the term used is ciphertext. It is understood that it takes its origin from the
Arabic word Sifr which means Empty or Zero. The AES operates on a matrix 4 × 4,
referred to as the state S, although certain variants of Rijndael do operate on a larger
block size having more columns in the state [5]
. Majority of AES calculations are
performed in a special finite field. For instance, if 16 bytes, b0, b1, b2, b3, b4 …….b15
are considered, they will be represented by the shown in Equation (3.1).
[ ] ( 3.1)
The diagrammatic representation of a State is shown in Figure 3.1
Figure 3.1 Diagramatic Representation of a State
Initially, the input of Cipher is copied to the State Array using the conventional method.
After initially performing a Round Key Addition, transformation of the State Array is
done by implementation of a round function 10, 12 or 14 times depending on the key
length as discussed previously.
The Cipher Algorithm of a 128-bit cipher is explained in the form of a flow diagram in
Figure 3.1. Individual transformations - AddRoundKey, ShiftRows, SubBytes and
MixColumns – are explained in detail further in the chapter.
As shown in the Figure 3.2, all rounds (Nr) are identical with the exception of the final
round (Nr = 10), which does not include the MixColumns transformation.
14
Figure 3.2 Flow Diagram of Round Transformations in AES Encryption
SubBytes Transformation3.2.1
Substitution of bytes using an 8-bit substitution table is known as SubBytes
transformation. These Sub-Bytes transformations operate on each byte independently,
using substitution table (S-box) of the State [4]
.
Figure 3.3 shows the diagrammatic State representation of how the SubBytes
transformation is done.
Figure 3.3 SubBytes Transformation
Rounds = Nr
Start
Key
Expansion
Add Round
Key
Add Round
Key
Sub
Bytes
Shift
Rows
Mix
Columns
Nr
Nr < 10
Nr = 10
Sub
Bytes
Shift
Rows
Add Round
Key
End
a0,0
a0,1
a0,2
a0,3
a1,0
a1,1
a1,2
a1,3
a2,0
a2,1
a2,2
a2,3
a3,0
a3,1
a3,2
a3,3
b0,0
b0,1
b0,2
b0,3
b1,0
b1,1
b1,2
b1,3
b2,0
b2,1
b2,2
b2,3
b3,0
b3,1
b3,2
b3,3
S-Box
ai,j
bi,j
15
The S-box is an invertible matrix and is derived by taking the multiplicative inverse in
the GF(28
) having good non-linearity properties. The element {00} is mapped to itself.
Table 3.2 shows the substitution table used in the AES encryption algorithm [5]
.
y
X
0 1 2 3 4 5 6 7 8 9 a b c d e f
0 63 7c 77 7b f2 6b 6f c5 30 01 67 2b fe d7 ab 76
1 ca 82 c9 7d fa 59 47 f0 ad d4 a2 af 9c a4 72 c0
2 b7 fd 93 26 36 3f f7 cc 34 a5 e5 f1 71 d8 31 15
3 04 c7 23 c3 18 96 05 9a 07 12 80 e2 eb 27 b2 75
4 09 83 2c 1a 1b 6e 5a a0 52 3b d6 b3 29 e3 2f 84
5 53 d1 00 ed 20 fc b1 5b 6a cb be 39 4a 4c 58 cf
6 d0 ef aa fb 43 4d 33 85 45 f9 02 7f 50 3c 9f a8
7 51 a3 40 8f 92 9d 38 f5 bc b6 da 21 10 ff f3 d2
8 cd 0c 13 ec 5f 97 44 17 c4 a7 7e 3d 64 5d 19 73
9 60 81 4f dc 22 2a 90 88 46 ee b8 14 de 5e 0b db
a e0 32 3a 0a 49 06 24 5c c2 d3 ac 62 91 95 e4 79
b e7 c8 37 6d 8d d5 4e a9 6c 56 f4 ea 65 7a ae 08
c ba 78 25 2e 1c a6 b4 c6 e8 dd 74 1f 4b bd 8b 8a
d 70 3e b5 66 48 03 f6 0e 61 35 57 b9 86 c1 1d 9e
e e1 f8 98 11 69 d9 8e 94 9b 1e 87 e9 ce 55 28 df
f 8c a1 89 0d bf e6 42 68 41 99 2d 0f b0 54 bb 16
Table 3.2 S-box: Substitution values for byte „xy‟ (hexadecimal)
The value of the byte is used as an index to find the substitution byte. For example the
byte {6d} will find the substitution byte in such a way that it will locate the byte in the
location where x = 6 and y = d, i.e, {3c}.
ShiftRows Transformation3.2.2
Within a certain offset, the ShiftRows operation cyclically shifts over the bytes in the
rows of the State. In AES, with the first row, r=0, remaining as it is, the second row
bytes are shifted to the left by an offset of 1. Similarly the third and fourth rows are
shifted by an offset of two & three, respectively [5]
.
16
Figure 3.4 ShiftRows Transformation
MixColumns Transformation3.2.3
As the MixColumns transformations has to operate column-by-column, each column is
treated as a four-term polynomial and thus the MixColumn transformation takes four
bytes as input and gives four bytes as output. Each input byte has an effect on all four
output bytes. These columns, being taken as polynomials over GF(28
), are multiplied
with modulo x4
+ 1 and a fixed polynomial, q(x), where
[5]
( 3.2)
Where, {01}, {02} and {03} are Hexadecimal values 0x01, 0x02 and 0x03,
respectively.
Let the new column (in the State) be b(x) and the original column is a(x). The
MixColumn transformation can be represented as:
( 3.3)
This can be written in matrix multiplication form [4]
:
[ ]
[ ] [ ] ( 3.4)
The four bytes in the new columns after the MixColumns operation can be calculated by
the expressions given in Equations (3.5) to (3.8).
Shift 3
Shift 2
Shift 1
No Shift
a0,0
a0,1
a0,2
a0,3
a1,0
a1,1
a1,2
a1,3
a2,0
a2,1
a2,2
a2,3
a3,0
a3,1
a3,2
a3,3
a0,0
a0,1
a0,2
a0,3
a1,0
a1,1
a1,2
a1,3
a2,0
a2,1
a2,2
a2,3
a3,0
a3,1
a3,2
a3,3
ShiftRows
17
  ( ) ( 3.5)
  ( 3.6)
  ( 3.7)
  ( 3.8)
The diagramatic representation of MixColumn Transformation is given in Figure 3.5
Figure 3.5 Diagramatic Representation of MixColumns
AddRoundKey Transformation3.2.4
In simple words, the AddRoundKey transformation XOR‟s the output from the previous
step (MixColumns in the first 9 rounds and ShiftRows in the final round) to a RoundKey
generated from the Key Expansion algorithm [4]
. To further understand the
AddRoundKey the two steps in the AddRoundKey transformation, Key Expansion and
Adding of the Round Key, are important:
3.2.4.1 Key Expansion
Considering the AES-128 the Key Expansion algorithm takes a 128-bit key as input to
generate a key schedule. The expansion of the input key in to the key schedule requires
two processes, namely SubWord and RotWord [5]
. These two processes will be explained
in detail further in this section. The 16 byte input cipher key is transferred to a word
array w[i] following the Pseudo-Code shown below [5]
.
while (i < 4)
w[i] = word(key[4*i], key[4*i+1], key[4*i+2], key[4*i+3])
a0,0
a0,1
a0,2
a0,3
a1,0
a1,1
a1,2
a1,3
a2,0
a2,1
a2,2
a2,3
a3,0
a3,1
a3,2
a3,3
a0,j
a1,j
a2,j
a3,j
b0,0
b0,1
a0,2
b0,3
b1,0
b1,1 a1,2
b1,3
b2,0
b2,1
a2,2
b2,3
b3,0
b3,1
a3,2
b3,3
b0,j
b1,j
b2,j
b3,j
MixColumns
𝑞 𝑥
18
The flow diagram representation of the Key Expansion algorithm is given in Figure 3.6.
Figure 3.6 Flow Diagram of Key Expansion Algorithm
RotWord: Performs a cyclic permutation on a 4-byte word as depicted in Figure 3.7
Figure 3.7 Diagramatic Representation of RotWord
SubWord: Takes 4-bytes as input and applies S-box substitution to all of the four bytes
to give a 4-byte output. The S-box used is the same for SubBytes transformation.
From the Pseudo Code and the flow diagram it can be deduced:
w[i]
w[i-1]
i mod 4=0?
w[i-2] w[i-3] w[i-4]
RotWord i mod 4=0?
SubWord
i mod 4=0?
Rcon[i/4]
True
False
False
True
True
False
4-bytes
4-bytes
4-bytes
4-bytes
4-bytes
4-bytes
Round Key
a0
a1
a2
a3
a0
a1
a2
a3
Cyclic Permutation
RotWord
19
 The first 4 words of the expanded key are filled with the input Cipher Key.
 Every following word is the XOR of the previous word (w[i-1]) and the word 4
positions earlier (w[i-4]).
 For words in position that are multiple of 4, the RotWord and SubWord
transformation is applied to w[i-1] and then an XOR is done with an Rcon,
before the final XOR.
3.2.4.2 Round Key Addition
Once, the Round Key, , is generated than it is added to the output of the previous
transformation, , with a simple bitwise-XOR. The diagrammatic representation of
the Round Key addition is shown in Figure 3.8:
Figure 3.8 Diagramatic Representation of AddRoundKey Transformation
3.3 Decryption
For decrypting the data using the AES algorithm the Cipher transformations stated
above can be inverted and then implemented in reverse order. The transformations used
in the decryption algorithm, or the Inverse Cipher, are InverseSubBytes,
InverseShiftRows, InverseMixColumns and AddRoundKey.
a0,0
a0,1
a0,2
a0,3
a1,0
a1,1
a1,2
a1,3
a2,0
a2,1
a2,2
a2,3
a3,0
a3,1
a3,2
a3,3
b0,0
b0,1
b0,2
b0,3
b1,0
b1,1
b1,2
b1,3
b2,0
b2,1
b2,2
b2,3
b3,0
b3,1
b3,2
b3,3
ai,j
bi,j
k0,0
k0,1
k0,2
k0,3
k1,0
k1,1
k1,2
k1,3
k2,0
k2,1
a2,2
k2,3
k3,0
k3,1
k3,2
k3,3
ki,j
20
The overall flow of the Decryption is the same as that of the Encryption other than the
fact that all the transformations are inverse of the transformations in the Encryption
algorithm [5]
. The flow diagram of the Inverse Cipher is given in Figure 3.9.
Figure 3.9 Flow Diagram Representation of AES Decryption
InverseShiftRows Transformation3.3.1
As evident by its name, it is the inverse of the ShiftRows transformation. In the
InverseShiftRows transformation the shifting over the bytes is a right-shift instead of a
left-shift as in ShiftRows transformation. The first row, r=0, is not shifted. The bottom
three rows are shifted right with an offset 1, 2 and 3 respectively. Figure 3.9 shows the
diagrammatic representation of the InverseShiftRows transformation is shown in Figure
3.10.
Rounds = Nr
Start
Key
Expansion
Add Round
Key
Add Round
Key
Inverse
ShiftRows
Inverse
SubBytes
Inverse
MixColumns
Nr
Nr < 10
Nr = 10
Inverse
ShiftRows
Inverse
SubBytes
Add Round
Key
End
21
Figure 3.10 Diagramatic Representation of InversShiftRows Transformation
InverseSubBytes Transformation3.3.2
The inverse of SubBytes transformation requires an inverse S-box. This inverse S-box is
then used for one-to-one byte substitution. Table 3.3 shows the inverse S-box [5]
.
Y
X
0 1 2 3 4 5 6 7 8 9 A b C d e f
0 52 09 6a d5 30 36 a5 38 bf 40 a3 9e 81 f3 d7 fb
1 7c e3 39 82 9b 2f ff 87 34 8e 43 44 c4 de e9 cb
2 54 7b 94 32 a6 c2 23 3d ee 4c 95 0b 42 fa c3 4e
3 08 2e a1 66 28 d9 24 b2 76 5b a2 49 6d 8b d1 25
4 72 f8 f6 64 86 68 98 16 d4 a4 5c cc 5d 65 b6 92
5 6c 70 48 50 fd ed b9 da 5e 15 46 57 a7 8d 9d 84
6 90 d8 ab 00 8c bc d3 0a f7 e4 58 05 b8 b3 45 06
7 d0 2c 1e 8f ca 3f 0f 02 c1 af bd 03 01 13 8a 6b
8 3a 91 11 41 4f 67 dc ea 97 f2 cf ce f0 b4 e6 73
9 96 ac 74 22 e7 ad 35 85 e2 f9 37 e8 1c 75 df 6e
a 47 f1 1a 71 1d 29 c5 89 6f b7 62 0e aa 18 be 1b
b fc 56 3e 4b c6 d2 79 20 9a db c0 fe 78 cd 5a f4
c 1f dd a8 33 88 07 c7 31 b1 12 10 59 27 80 ec 5f
d 60 51 7f a9 19 b5 4a 0d 2d e5 7a 9f 93 c9 9c ef
e a0 e0 3b 4d ae 2a f5 b0 c8 eb bb 3c 83 53 99 61
f 17 2b 04 7e ba 77 d6 26 e1 69 14 63 55 21 0c 7d
Table 3.3 Inverse S-box: Substitution values for byte 'xy' (hexadecimal)
Shift 3
Shift 2
Shift 1
No Shift
InvShiftRows
a0,0
a0,1
a0,2
a0,3
a1,0
a1,1
a1,2
a1,3
a2,0
a2,1
a2,2
a2,3
a3,0
a3,1
a3,2
a3,3
a0,0
a0,1
a0,2
a0,3
a3,0
a3,1
a3,2
a3,3
a2,0
a2,1
a2,2
a2,3
a1,0
a1,1
a1,2
a1,3
22
InverseMixColumns Transformation3.3.3
As for the MixColumns transformations, the InverseMixColumns transformation also
operates column-by-column with each column being treated as a four-term polynomial.
Each of the four input bytes have an effect on all four output bytes. These columns,
being taken as polynomials over GF(28
), are multiplied with modulo x4
+ 1 and a fixed
polynomial, let‟s say , where can be represented as shown in equation
3.9:
[5]
( 3.9)
Where, {0b}, {0d}, {09} and {0e} are Hexadecimal values 0x0b, 0x0d, 0x09 and 0x0e,
respectively.
Let the new column (in the State) be b-1
(x) and the original column is a-1
(x). The
InverseMixColumns transformation can be represented as:
( 3.10)
This can be written in matrix multiplication form [4]
:
[ ]
[ ]
[ ]
( 3.11)
The four bytes in the new columns after the InverseMixColumns operation can be
calculated by the expressions given in Equations (3.12) to (3.15) [5]
.
   (  ) ( 3.12)
    ( 3.13)
    ( 3.14)
    ( 3.15)
The diagramatic representation of InverseMixColumn Transformation is given in Figure
3.11.
23
Figure 3.11 Diagramatic Representation of InverseMixColumns
AddRoundKey Transformation3.3.4
The AddRoundKey transformation remains the same for decryption and encryption
since the Key Expansion algorithm doesn‟t change and the adding of the Round Key is a
simple XOR, which is the inverse of itself. Please see section 3.2.4 for detailed
explanation.
InverseMixColumns
a0,0
a0,1
a0,2
a0,3
a1,0
a1,1
a1,2
a1,3
a2,0
a2,1
a2,2
a2,3
a3,0
a3,1
a3,2
a3,3
a0,j
a1,j
a2,j
a3,j
b0,0
b0,1
a0,2
b0,3
b1,0
b1,1 a1,2
b1,3
b2,0
b2,1
a2,2
b2,3
b3,0
b3,1
a3,2
b3,3
b0,j
b1,j
b2,j
b3,j
𝑞 𝑥
25
4 Encryption & Authentication
Modes
A Cipher encrypts or decrypts data for a single block but applying a Cipher repeatedly
over large blocks of data is known as a Mode of Operation for that Cipher. Many modes
of operation for AES have been introduced over the years and some of the relevant ones
for this Thesis will be discussed in this Chapter. Along with modes of operation for
encryption, authentication modes will also be discussed.
Over the years authentication has been an integral part of information exchange for an
efficient data transfer. Many attacks involve the attacker injecting messages to the data
in question and thus there is a need for verification, whether the data was sent by the
claimed sender or someone else. A mode of operation that provides both encryption and
authentication is known as Authenticated Encryption (AE) [32]
. AES also has various
modes which provide AE. In this chapter, two AE modes namely CCM (Counter with
CBC-MAC) and GCM (Galois/Counter Mode), of AES will be discussed that are the
main comparison platforms for the Crypto-Cores implemented in this Research Thesis.
4.1 Encryption Modes
Electronic Codebook (ECB)4.1.1
Electronic Codebook (ECB) is the simplest mode of operation for AES, where a large
message is divided in to blocks depending on the key-size, i.e. 128-bits in this case, and
each block is encrypted/decrypted separately [32]
. For example, consider Figure 4.1
which explains the ECB modes diagrammatically for a message of the size . It
is divided in blocks where each block of Plaintext/Ciphertext is 128-bits in size and is
passed through the Cipher/Inverse Cipher separately with an identical key to produce an
output Plaintext/Ciphertext block of size 128-bits.
26
Figure 4.1 Diagrammatic Representation of ECB Mode
Cipher Block Chaining (CBC)4.1.2
Cipher Block Chaining (CBC) mode is an AES mode of operation in which each
Plaintext block is XORed with the previous Ciphertext block before encryption. In case
of decryption the output of the inverse cipher is XORed with the previous Ciphertext
block to get the plaintext block. Figure 4.2 shows the diagrammatic representation. For
the first block the Plaintext is XORed with an Initialization Vector (IV). An IV is a
fixed-size input,in this case 128-bit, which can be of any random value.
Figure 4.2 Diagrammatic Representation of CBC mode
B1 B2 B3 … Bm
B1
B2
B3
Bm
Cipher /
Inverse Cipher
Cipher /
Inverse Cipher
Cipher /
Inverse Cipher
Cipher /
Inverse Cipher
Input
Message
Key
A1
A2
A3
… Am
Output
Message
A1
A1
A1
A1
…
…
B1
B2
B3 … Bm
B1
B2
B3
Bm
Cipher
Plaintext
Message
Key
A1
A2
A3 … Am
Ciphertext
Message
A1
A2
A3
Am
…
…
Cipher
IV
Cipher Cipher
27
Counter Mode (CTR)4.1.3
Counter Mode (CTR) is a mode of operation for the AES which converts a Block
Cipher in to a Stream Cipher. An 96-bit IV is given as input to a counter function,
which can be any function that generates a sequence of numbers that don‟t repeat but
usually an increment-by-one counter is used, which appends the 32-bit of the counter
and generates a new 128-bit string for each iteration. These are then used as an input to
the cipher to generate a keystream, a stream of random values, which is the XORed with
the Plaintext to generate a Ciphertext. For decryption the generated keystream is
XORed with the Ciphertext to generate the Plaintext. Figure 4.3 shows the
diagrammatic representation of the CTR mode.
Figure 4.3 Diagrammatic Representation of CTR Mode
4.2 Counter with CBC-MAC (CCM)
Introduction4.2.1
As visible by its name, the Counter with CBC-MAC (CCM) mode uses the CBC mode
to generate a MAC and then CTR mode is applied over the message and the tag to
encrypt the message. This shows that CCM is a mode to apply Authenticated
Encryption (AE) to the message. CCM mode can only be applied to block ciphers of
block size 128-bits.
Generated
Keystream
B1
B2
B3 … Bm
Cipher
Plaintext
Message
Key
A1
A2
A3 … Am
Ciphertext
Message
A1
…
…
B1
Counter 1
Cipher
A2
B2
Cipher
A3
B3
Counter 3
Cipher
Am
Bm
Counter mCounter 2
28
Algorithm4.2.2
The MAC is generated using the CBC-MAC by applying the CBC mode of encryption
to the message with an IV of 128-bits of 0‟s. Each block of the CBC mode depends on
the proper encryption of the previous block, thus if an intermediate block is changed
this will be visible in the last block. The last block is used as the MAC which is sent
along with the message and is used to compare whether the message is authentic or not.
The diagrammatic representation in Figure 4.4 shows the working of the CBC-MAC.
Figure 4.4 Diagrammatic Representation of CBC-MAC
The generated MAC is encrypted along with the message using the CTR mode.
4.3 Galois/Counter Mode (GCM)
Introduction4.3.1
Galois/Counter Mode (GCM) is a block cipher mode of operation that uses universal
hashing over a binary Galois field to provide Authenticated Encryption (AE). It can be
implemented in hardware to achieve high speeds with low cost and low latency. There
is a growing need for a mode of operation that can efficiently provide authenticated
encryption at high speeds without too much area cost, and is free of Intellectual
B1
B2
B3 … Bm
B1
B2
B3
Bm
Cipher
MessageKey
A1
A2
A3
MAC
…
…
Cipher
0
Cipher Cipher
29
Property (IP) restrictions [33]
. Since the possible use case for this research thesis is the
use in Industry 4.0, achieving AE with high data rates is essential. The mode must admit
pipelined implementations and have minimal computational latency in order to be
useful at high data rates. GCM has an added advantage that it can act as a stand-alone
MAC when encryption is not required. This is a feature which is not available in any of
the other proposed AE implementations.
Algorithm4.3.2
GCM implements the Galois mode of authentication with an underlying Cipher, usually
the AES which is also used in this research as well. The underlying AES is implemented
in CTR mode [33]
. The GCM algorithm has two core functions, namely, GHASH and
GCTR which are explained below.
4.3.2.1 GHASH
The GHASH function is basically the finite field multiplication of the input with a
hashing key H over GF(2128
). The hashing key H can be treated as a fixed 128-bit
constant since it does not change if the Cipher Key doesn‟t change [7]
. It can be
calculated by applying the AES block on 128 bits of 0’s.
Algorithmically speaking, take as the input bit string where the length of is 128*m,
where m is some integer, as the hash subkey and block as the output. The
following steps explain the algorithm [33]
:
 Let represent the unique sequence of blocks such that
|| || || ||2
 Let Y0 be the “zero block,” which means is a bit string comprised by 128
binary 0„s.
 For , let ⊕ , where “ ” indicates multiplication
over finite field.
 received at the end would be the output block that would be the MAC.
2
|| represents the concatanation of two elements.
30
Following block-diagram representation of the algorithm can give a better
understanding of this algorithm.
Figure 4.5 Diagrammatic Representation of the GHASH function
The multiplication over the finite field GF(2128
) can be explained by the following
algorithm [33]
:
 Let be the 128-bit block that has to be hashed containing elements
.
 Let be the 128-bit Hash Key, i.e., 128-bits of 0‟s ciphered through the AES
block.
 Let be 128 bits of 0‟s, and be a constant 128-bit string with the
value || .
 For i = 0 to 127
{
⊕
( 4.1)
{
⊕
( 4.2)
 After these operations are done the 128-bits of would be the output of the
multiplication.
X1
X2
X3 … Xm
X1
X2
X3
Xm
𝐻
Message
Y1
Y2
Y3
Ym
…
…
𝐻 𝐻 𝐻
Y0
MAC
31
Figure 4.6 shows the diagrammatic representation of the GF(2128
) multiplication
operation.
Figure 4.6 Diagrammatic Representation of GF(2128
) Multiplication
If LSB(Ui
) = 1
If LSB(Ui) = 0
Z0 Z1
Z2
Z3
… Z127
Z128
U0
U1
U2
U3
… U127
U0
Z0
0128
H
Ui
R 11100001||0
120
Initialization:
Logic:
>>1?
>>1
Ui+1
Ui+1
R
If xi
= 1
If xi
= 0
Zi ? Zi+1
Zi+1
Ui
X x0,x1,x2,…,x127
Z128 Result
32
The finite field multiplication in GHASH has two possible implementations i.e. bit-
serial implementation and bit-parallel implementation. Simply explained, the core of the
GHASH architecture is a 128-bit multiplier over GF (2128
). The GF (2128
) multiplier
basically multiplies two 128-bit operands to generate a 128-bit output. One operand of
the GF multiplier is the hash subkey H which can be treated as a fixed 128- bit constant
for it will not change if the 128-bit key does not change. For the second operand two
values have to be kept under consideration, the 128-bit additional authenticated data
block (AAD) sequence and the Ciphertext block sequence [7]
. Figure 4.7 shows the
diagrammatic representation of the hashing sequence.
Figure 4.7 Diagrammatic Representation of the Hashing sequence
The 128-bit AAD are hashed to the GHASH through one of two inputs
of XOR gates. The 128-bit Ciphertext block sequence, , are hashed to the
same input of XOR gates following the AAD. Meanwhile, the intermediate hash value
is fed back to another input of XOR gates to generate the other operand for the GF
multiplier. Considering that it takes m clock cycles for the AAD hashing and n clock
cycles for the ciphertext block hashing then the latency for a bit-parallel multiplier
would be m+n+1 and for a bit-serial multiplier the latency would be 128*(m+n+1) [33]
.
The advantage of the bit-serial multiplier over the bit-parallel multiplier is the usage of
less logic elements but at the same time it adds more latency to the system.
128-bit Multiplier
over Galois Field
AAD and Ciphertext
hashing sequentially
Y Register
H Register
33
4.3.2.2 GCTR
GCTR is the implementation of the previously explained CTR mode with a particular
incrementing function, for generating the necessary sequence of counter blocks. The
GCM consists of an underlying block cipher and a Galois Field Multiplier with which
authenticated encryption and authenticated decryption are realized. The cipher needs to
have a block size of 128-bits. For encryption, first an initial counter is derived from an
Initialization Vector (IV). The initial counter value is then incremented which is then
encrypted and XORed with the first plaintext block. For subsequent plaintext blocks, the
counter is incremented and then encrypted. The underlying cipher is only used in the
encryption mode. GCM allows pre-computation of the block cipher function if the IV is
known ahead of time [33]
.
Authenticated Encryption4.3.3
Now that the working of the GHASH and GCTR functions are understood, they can be
combined to understand the authenticated encryption and authenticated decryption that
take place inside the GCM mode. First comes the authenticated encryption, so consider
a 128-bit AES as the underlying block cipher, the inputs would be a Plaintext , an
initialization vector and additional authenticated data . The outputs would be the
Ciphertext and the authentication MAC. Following steps explain the authenticated
encryption algorithm [33]
:
 Let is the hash subkey which is the 128 bits of 0‟s ciphered through the block
cipher i.e., .
 Define , such that, is a 128-bit string consisted of 96-bits of any value, 31
„0‟ bits, and 1 „1‟ bit.
 Let ( ) ⊕ , where would be the counter blocks in
the GCTR and would be the Plaintext data.
 Let || ⊕ .
 The resulting C is the Ciphertext and the resulting MAC is the authentication
MAC.
The flow diagram in Figure 4.8 shows how the Authenticated Encryption algorithm
works.
34
Figure 4.8 Diagrammatic Representation of Authenticated Encryption
Authenticated Decryption4.3.4
The inputs for the the authenticated decryption would be the 128-bit block cipher AES,
the initialization vector , the ciphertext , the addidtional authenticated data and the
MAC whereas the output would be the simple plaintext P or indication of inauthenticity
FAIL. Following steps explain the algorithm of the authenticated decryption [33]
:
 Let is the hash subkey which is the 128 bits of 0‟s ciphered through the block
cipher i.e., .
 Define , such that, is a 128-bit string consisted of 96-bits of any value, 31
„0‟ bits, and 1 „1‟ bit.
 Let ⊕ , where would be the counter blocks in
the GCTR and would be the ciphered data.
 Let || ⊕ .
 If , then return which would be the resultant plaintext; else
return .
Figure 4.9 shows the diagrammatic representation of Authenticated Decryption.
GCTR GHASH
MAC
AIV
AES
H
0128
C
P
inc
Key
Hash encryption
Output
35
Figure 4.9 Diagrammatic Representation of Authenticated Decryption
𝑀𝐴𝐶 ≠ 𝑀𝐴𝐶
𝑀𝐴𝐶‘
𝑀𝐴𝐶
?
PASS FAIL
𝑀𝐴𝐶 𝑀𝐴𝐶
GCTR GHASH
AIV
AES
H
0128
C
inc
Key
P
Hash encryption
Output
37
5 Literature Review
The implementation of the previously discussed algorithms in hardware in the form of
Crypto-Cores has been studied in the recent past. In this chapter the previous studies are
discussed. The hardware implementation of an AES-GCM crypto-core consists of two
core block:
 An AES core
 A Galois Field Multiplier Core
The previous studies for implementation of these core blocks will be discussed further
in this chapter.
5.1 AES Designs
The AES algorithm itself has four transformations which for a 128-bit key have to be
implemented 10 times as discussed earlier. This, when implemented in an iterative
structure, takes one clock cycle to complete each transformation of each round.
Although this implementation is simple, the efficiency of the system is very low.
Further studies were done to introduce pipelining while implementing AES on
hardware. A lot of research has been done in this area and the studies have provided
some better solutions for hardware implementation of the AES architecture.
An efficient hardware implementation would show good data throughput with very less
area usage. Throughput can be defined as:
The existing proposed designs for the hardware implementation of AES architecture can
be classified in to four groups:
 Iterative Loop Structure based Designs
38
 Memory-Based Designs
 Unfolded Structure or Parallel Structure based Designs
 Sub-Pipelined Structure based Designs
Iterative Loop Structure5.1.1
A design based on the Iterative Loop Structure for implementing the AES architecture
on hardware is simplest form of implementation. It has very less hardware utilization
since it uses a single core hardware design for a single round which is reused for all the
ten rounds. If all of the round transformations (SubBytes, ShiftRows, MixColumns and
AddRoundKey) are done in a single clock cycle, taking ten clock cycles for encryption
of a 128-bit Block, the system shows low clock frequency.
Figure 5.1 Iterative Loop Implementation
On the other hand in [18] it is discussed if each round transformation is done in separate
clock cycles using intermediate registers, a higher operational frequency is achievable
but the overall clock cycles required to encrypt a block also increases 4 times. Thus, in
both cases, the achievable throughput is not that impressive. Pipelined structure has
been implemented in many other studies and compared with other structures.
Memory-Based AES Hardware Implementation Designs5.1.2
One example of this is a Memory-Based structure of the AES where the instead of
utilizing FPGA logic units, memory blocks are utilized to perform round
transformations.
One Clock Cycle
AES RoundInput Output
39
The SBox in the SubBytes transformation and the Column Multiplication in the
MixColumns transformation are implemented on internal memory block instead of the
FPGA logic units. This is possible because each resultant element in the round
transformations is dependent on a single element of the data block. The ShiftRows is a
fixed operation and can be achieved by accurately routing values to the specific memory
blocks. Only the addition part of the MixColumns transformation and the
AddRoundKey transformation is done using the logic units. Figure 5.2 shows how the
blocks look like during a round.
Figure 5.2 Diagrammatic Representation of Memory Based AES structure
In [9] a memory-based design was suggested for an unfolded AES core where all the
rounds are processed in parallel. The logic unit utilization is reduced but the overall
frequency is dependent on the operating frequency of the internal memory and
furthermore the memory utilization in this architecture is quite high.
In [16], a fully synchronous, memory-based, single-chip FPGA implementation of the
recent AES Standard, Rijndael encryption algorithm is presented. Design partition
allowed for an iterative loop structure where the block ciphers was implemented using
the Electronic Code Book (ECB) mode of operation. The encryption RTL design
focuses on a memory-based bite-sized arithmetic pipeline structure that processes one
round at a time.
Output
ShiftRows by routing to
specific memory blocks
SBox
MixColumns
Mult.
AddRoundKey
Last Round
MixColumns
Add.
Input
Memory Block
Round Transformations
40
In [17] an AES hardware implementation comparison between a composite field
algorithm and Block RAM to realize the SubBytes and MixColumns module was
introduced. Based on the composite field algorithm, a lower efficiency
(throughput/area) was realized, whereas for the Block RAM based design the maximum
frequency was limited to the maximum frequency of the Block RAM‟s
Unfolded / Parallel Structure Based Designs5.1.3
A parallel structure based design is used to achieve very high throughputs but
accordingly there is very high area cost as well. All the rounds are performed in a single
iteration, which means that substantially the loops are unfolded or unrolled and all ten
cores operate in parallel.
Figure 5.3 Unrolled AES Structure
The unrolled AES structures have been widely used for high implementations where
hardware cost is not an issue. Studies like [7] and [13] have implemented unrolled
architectures for AES for Galois/Counter Mode. Since the parallel structure of AES
computes all the rounds in one clock cycle, it shows very high throughput, even the the
achievable frequency is not that high.
Sub-Pipelined or Stage-Level Pipelined Structure5.1.4
A sub-pipelined structured is one where the internal blocks of a round in an aes
structure are pipelined. In [10], architecture for pipelining the AES rounds efficiently is
introduced. It is suggested that multiple packets be performed in parallel and registers
be introduced in critical path of the stages. This would increase the optimal frequency of
the system as the latency increases with a significantly less area usage on the hardware
and very less internal memory usage. This system suggested is considerably faster than
a normal iterative structure and is known as a sub-pipelined structure. This system was
Unrolled
128-bit
Output
128-bit
Input Round 1 Round 2 Round 3 Round 10.....
41
introduced because simple pipelining was not efficient for the CBC mode due to its
feedback nature.
The sub-pipelined structure works by pipelining the internal transformations of the AES
algorithm. This is done by adding registers to the critical paths. In [10], a stage
pipelining of an AES system was introduced for an AES-CCM based design. The basic
idea behind is to insert registers to the critical path. This allows 4 blocks of data to be
encrypted by one AES-core. The working flow diagram of the encryption data-path is
shown in Figure 5.4.
Figure 5.4(a)Pipelined AES encryption datapath, (b) Pipelined Key Scheduler
In the stage for reg2, SubBytes and ShiftRows transformation is implemented. The
SubBytes can be implemented in two ways. The first way is a 256-byte lookup table
(LUT) and the second way is to logically calculate sub-byte transformation. Calculating
logically would have substantial area cost and also a larger latency, therefore it is easier,
cheaper and faster to implement an LUT. The ShiftRows is simply implemented by
accurately routing from the Sbox to the MixColumns transformation, thus not requiring
a separate step.
RCON
Key4
Key2
Key1
Key0
Output
Input
Round Key
r
e
g
1
r
e
g
2
r
e
g
3
r
e
g
4
MixColumn
Add (part)
MixColumn
Add (part)
AddRoundKey
MixColu
mn
Mult(2's)
Sbox
LUT
r
e
g
5
r
e
g
7
r
e
g
8
Sbox
LUT
(a)
(b)
r
e
g
6
42
In MixColumn transformation addition and multiplication occurs in GF (28
). As
previously discussed in MixColumns transformation the multiplication of the
polynomial with and is required. The multiplication of the polynomial with
can be achieved by multiplying the polynomial with and then adding the
original polynomial to the result, therefore, only multiplication with needs to be
implemented. MixColumns transformation is achieved in three stages. Following are the
stages:
 Multiplying polynomial with {02}
 Addition process
 AddRoundKey performed
The 4-stage pipelined structure works in a way such that 4 blocks of data can be
computed with a delay of one-clock cycle. When one stage has computed the result for
one block it is ready to take the next block of data. In this way four pipes are working as
shown in Figure 5.5.
Figure 5.5 Stages for Sub-Pipelining
It takes 40 clock cycles to compute one block but at the same time four blocks are being
processed. This increases the operating frequency of the system with very less area-cost
[10]
.
Stage 1 Stage 2 Stage 3 Stage 4
Stage 1 Stage 2 Stage 3 Stage 4
Stage 1 Stage 2 Stage 3 Stage 4
Stage 1 Stage 2 Stage 3 Stage 4
Pipeline 1
Pipeline 2
Pipeline 3
Pipeline 4
43
5.2 Galois Field Multiplier (GFM) Designs
In the past years GCM has been increasingly adopted for hardware since it has proven
to be fast and efficient. Since GCM can be parallelized and pipelined (unlike CBC due
to its feedback nature) it is very desirable when hardware implementations are
concerned. Many ideas have been suggested to implement GCM as a Crypto-Core and
the main computational complexity usually in GCM, compared to any other mode, is
the multiplication in the GF(2128
) field for hashing.
Conventionally two methods have been suggested to implement the finite field
multiplication required for hashing on the hardware.
 Bit-Serial Multiplier
 Bit-Parallel Multiplier
The Bit-Serial Multiplier is implementation of the multiplier where each iteration of ,
as depicted in 4.1 and 4.2, is calculated serially. This design compromises the latency of
the system for very low area usage on the hardware.
The Bit-Parallel Multiplier calculates the in 1 clock cycle but the hardware cost
due to the implementation of the 128-bit operands in GF(2128
) is very much. The high
complexity of the Multiplier also reduces the achievable operating frequency of the
system when processed in parallel.
For implementing Galois Field multiplication on hardware a lot of studies have been
done to find a suitable solution. In [8] a method for implementing a parallel multiplier
was introduced known as Mastrovito multiplier. It is the most widely used method for
implementing Galois Field multiplication on hardware. The design is essentially a brute
force multiplier in the sense that the matrix vector product, shown in Equation 4.1 and
4.2, is computed like traditional matrix multiplication. Elements are in GF(2m
), so
and gates are used for element wise multiplication and addition respectively.
Although the Mastrovito multiplier is fast, the area usage on the hardware is
considerably large thus increasing the cost of hardware.
44
Another method was introduced in [29] for implementing GF multiplication on
hardware. The idea was to use the multiplication method introduced in [22] by A.
Karatsuba, thus naming the multiplier Karatsuba Multiplier. The idea behind the
Karatsuba multiplier was to decrease the number of multiplication operation while
increasing the addition operations. Since, the addition operation require less area as
compared to multiplication, the area cost for a Karatsuba multiplier is less than a
Mastrovito multiplier. This, however, comes with a delay cost due to which the
Karatsuba based multipliers are much slower than the Mastrovito multipliers.
Recently, a method which serves as a compromise between the Karatsuba and
Mastrovito multipliers was introduced, called the Fan-Hasan (FH) Multiplier [25], [27], [28]
.
The FH-Multiplier is considerably faster than the Karatsuba multiplier but still has a
larger delay than the Mastrovito multiplier. However, the area overhead is much less
than the Mastrovito multiplier. A comparison between the three types of multipliers has
been provided in [30].
Since the Mastrovito multiplier is still the most widely used multiplier for the GFM for
hardware implementations of GCM, this study will also look for a good GFM solution
using the Mastrovito multiplier. The problem in the Mastrovito multiplier is the large
area overhead due to the parallel matrix vector product. This also causes a low
achievable operational frequency. In [11] a method was introduced to use Mastrovito
parallel multiplication in a much more efficient way. The idea was to introduce pipeline
in the multiplier and doing the multiplication in multiple iterations instead of doing the
complete multiplication in one. Using this, a higher operational frequency was
achievable and the area cost was introduced but this also created a latency in computing
the result.
45
6 Design & Implementation Results
6.1 Design
The final design was made by looking at the various studies in the past and considering
the best route to be taken. The authentication mode chosen was the Galois/Counter
Mode because of its flexibility when implementing on hardware. Another advantage of
GCM over any other mode is the fact that the authentication core (GMAC) can operate
as a separate entity for messages that just need authentication.
The design has two cores, an underlying AES core and a Galois Field Multiplier (GFM)
core, that are controlled by a state machine. Each core will be explained separately and
then the state machine will be explained3
.
AES Core6.1.1
The AES core is based on the stage-level pipelining design as explained in [10].
Although this design was suggested to introduce a pipelining method for the AES-CCM
mode, for which normal pipelining is not possible, it has shown good operating
frequency for the system. Due to the feedback nature of CBC-mode there is a
requirement for waiting for the result of the previous block which increases the
complexity of the system. In AES-GCM since the mode of operation is the CTR mode
this can be simplified by using a single Initialization Vector and using an increment
function for the next iterations.
The AES core takes the Initialization Vector as input data along with a Cipher Key. A
data enable bit identifies the core that a data block is available for encryption. A Key
Generator core, as shown in Figure 5.4(b), generates the key for encryption. A counter
function is present to notify the number of pipelines that are being utilized.
3
See Appendix for source codes.
46
When the core receives data it sends the data to the first stage of the round
transformation. Each stage takes 1 clock cycle, thus in the next clock cycle another data
block is sent to the round transformation. On each input data the pipeline counter is
incremented (max. 4). Once all the pipelines are computed the AES core send a done
signal to enable next blocks of data to be computed. Figure 6.1 shows the internal
structure of the AES core.
Figure 6.1 Internal Structure of AES core
The AES core is the central part of the Encryption block of the AES-GCM core.
Following are the characteristics of the Encryption Block:
 The mode of operation for the Encryption Block is the CTR mode.
 The inputs for the Encryption Block are an initialization vector , a Cipher Key
and the input plaintext P.
 The is passed through an increment block incr to generate a sequence of
values depending on the size of the P.
 The AES core inside the Encryption Block takes the and the Cipher
Key as input and generates a Keystream as an output.
 The output Keystream is then XORed with P and the result is given out as a
ciphertext C.
Round Enable
Input Data
Data Enable
Key Generator
AES Round
Pipeline
Counter
Round
Register
Round Data
Input
Pipeline
Count
Intermediate
Round
Output
Generated
Key
Last Round
47
Figure 6.2 shows the diagrammatic representation of the Encryption Block.
Figure 6.2 Diagrammatic Representation of Encryption Block
Galois Field Multiplier (GFM) Core6.1.2
The GF(2128
) multiplier core takes in AAD and C sequentially and applies GF(2128
)
multiplication with hashing key . is required to generate the Matrix as shown in
4.2. Multiplying two 128-bit operands in a GF(2128
) field takes a lot of area and causes a
lower achievable frequency due to its complexity.
In this design it is proposed that the Hashing Matrix is computed and stored as a
memory block. This is done by using a constant key and calculating beforehand. All
128 elements of the Matrix are calculated using Equation 4.2 and then stored as
memory blocks. This solves the problem of high area usage that is seen in a basic
Mastrovito bit-parallel multiplier and because most of the complex operations are
reduced, leaving just XOR and AND operations, a considerably better operational
frequency is achievable.
Figure 6.3 represents the diagrammatic representation of this design.
AES
Core
incr
Encryption Block
IV
Key
P
Keystream
C
48
Figure 6.3 Diagrammatic Representation of Pipelined GF(2128
) Multiplier
Following are the characteristics of the Authentication Block:
 The GF multiplier core has a 128-bit input for AAD and C generated from the
encryption block being entered sequentially.
 The subsequent bits from Matrix are called from the memory to be XORed
with the bits from the data input.
 All the XOR are done in 1 clock cycles (Mastrovito parallel multiplier).
 The output of the GF Multiplier is looped back and XORed with the next block
to be sent as input.
 After the final block the output 128-bits are XORed with the keystream
generated from the AES core.
 The resultant value is sent as MAC.
Figure 6.4 shows the diagrammatic representation of the Authentication Block:
Figure 6.4 Diagrammatic Representation of Authentication Block
Matrix U
Memory
Block
GF(2
128
)
MultiplierInput 128-bits
Output 128-bits
Z register 128-bits
MACKeystream
GF(2128
) Multiplier
Core
Matrix U
Z
AAD &
C
Authentication Block
49
Complete Design6.1.3
The above explained AES-core and GFM core are implemented together to form a
Crypto-Core. A state-machine is implemented to utilize the two cores according to the
requirements. To understand the state machine of the Crypto-Core, first the signals of
the Crypto-Core have to be defined:
Signal Type Description
data_in(128-bit) Input 128-bit Data input interface
data_in_valid(4-bit) Input Notifies that data is available on data_in
data_in_type(1-bit) Input Notifies data type. (1=AAD, 0=Plaintext)
data_in_not_ready(1-bit) Output Output busy bit
data_in_last_word(1-bit) Input Notifies the last block of input data
data_in_size(4-bit) Input Notifies size of data (0 – 15 => 8 bit – 128 bit)
start(1-bit) Input Notifies start operation
IV_valid(1-bit) Input Notifies that data on data_in is IV
data_out(128-bit) Output 128-bit Data Output Interface
data_out_valid(1-bit) Output Notifies that data is available on data_out
data_out_size(4-bit) Output Notifies size of data (0 – 15 => 8 bit – 128 bit)
data_out_last_word(1-bit) Output Notifies the last block of output data
tag_valid(1-bit) Output Notifies that Tag is available on data_out
Table 6.1 I/O Signals of Crypto-Core
The Top layer of the AES-GCM core controls the working of the two core with the help
of a state machine. The state machine defines how the AES-GCM core operates for the
incoming data blocks. The description of each state in the state machine is given as
follows:
IDLE: In the IDLE state the system checks for the start and IV_valid signals. Once both
are high it stores the IV available on the data_in port to a register (Yi) and changes the
state to INIT_COUNTER.
INIT_COUNTER: In the INIT_COUNTER state the systems sends the value from Yi to
the AES core to generate the keystream. The state is changed to ENCRYPT_Y0.
50
ENCRYPT_Y0: The system waits for AES core to generate the keystream which is
stored in a register (EkY0). The state is changed to DATA_ACCEPT.
DATA_ACCEPT: The data_in_not_ready signal is set to „0‟. The system checks for
data_in_valid[0], if it is high it checks for data_in_type. For data_in_type = 1 (AAD) it
starts the GFM count register and sends the data to the GFM input, the state is changed
to GFM_MULT. For data_in_type = 0 (Plaintext) the system increments the Yi register
by 1 and changes the state to INC_COUNTER1.
INC_COUNTER_1: The system sends value from Yi to input of the AES core and
data_in_valid[0] is sent to data enable of AES core. The data on data_in is stored to a
register. The system checks data_in_valid[1] signal; if it is „0‟ the state is changed to
ENCRYPT. If it is „1‟ the system increments the Yi register by 1 and a stream register
by 1. The state is changed to INC_COUNTER_2.
INC_COUNTER_2: The system sends value from Yi to input of the AES core and
data_in_valid[1] is sent to data enable of AES core. The data on data_in is stored to a
register. The system checks data_in_valid[2] signal; if it is „1‟ increments the Yi
register by 1 and stream register by 1. The state is changed to INC_COUNTER_3. If it
is „0‟ the state is changed to ENCRYPT.
INC_COUNTER_3: The system sends value from Yi to input of the AES core and
data_in_valid[2] is sent to data enable of AES core. The data on data_in is stored to a
register. The system checks data_in_valid[3] signal; if it is „1‟ the system stores
increments the Yi register by 1 and stream register by 1. The state is changed to
INC_COUNTER_4. If it is „0‟ the state is changed to ENCRYPT.
INC_COUNTER_4: The system sends value from Yi to input of the AES core and
data_in_valid[3] is sent to data enable of AES core. The data on data_in is stored to a
register. The state is changed to ENCRYPT.
ENCRYPT1: The system waits for AES core to generate keystream and the generated
keystream is XORed with the value in r_datain0. The data is sent to data_out and
data_out_valid is set to 1. The output data is also sent as input to the GFM. The system
51
checks the stream register. If it is not „0‟ it decrements the stream register by 1 and the
state is changed to ENCRYPT2 otherwise it is changed to GFM_MULT.
ENCRYPT2: The system waits for AES core to generate keystream and the generated
keystream is XORed with the value in r_datain1. The data is sent to data_out and
data_out_valid is set to 1. The output data is also sent as input to the GFM. The system
checks the stream register. If it is not „0‟ it decrements the stream register by 1 and the
state is changed to ENCRYPT3 otherwise it is changed to GFM_MULT.
ENCRYPT3: The system waits for AES core to generate keystream and the generated
keystream is XORed with the value in r_datain2. The data is sent to data_out and
data_out_valid is set to 1. The output data is also sent as input to the GFM. The system
checks the stream register. If it is not „0‟ it decrements the stream register by 1 and the
state is changed to ENCRYPT4 otherwise it is changed to GFM_MULT.
ENCRYPT4: The system waits for AES core to generate keystream and the generated
keystream is XORed with the value in r_datain3. The data is sent to data_out and
data_out_valid is set to 1. This data is sent as input to the GFM and the state is changed
to GFM_MULT.
GFM_MULT: The checks the data_in_last_word signal. If it is „0‟ the state is changed
back to DATA_ACCEPT. If it is „1‟ the state is changed to PRE_TAG_CALC.
PRE_TAG_CALC: The GFM count register is reset to „0‟ and the state is changed to
TAG_CALC.
TAG_CALC: The output of the multiplier is XORed with the value in EkY0 to generate
Tag. The generated Tag is sent to the data_out port and the tag_valid signal is set to „1‟.
The state diagram in Figure 6.5 shows the flow of the state machine.
52
Figure 6.5 State Diagram of Crypto-Core State Machine
Following are the characteristics of the Final completed design:
 The inputs for the Crypto-Core are a 96-bit initialization vector , a 128-bit
input Plaintext and a 128-bit input additional authenticated AAD.
 The outputs are a 128-bit output Ciphertext , and a 128-bit output MAC.
 The 128-bit Cipher Key, 256-byte and the 1 Kbyte Matrix are memory
units and are accessed from the Memory Block.
 A Control Block controls all the operations for the Crypto-core.
The diagrammatic representation of the complete design is shown in Figure 6.5:
AES Done
Start AES core for
next block
data_in_last_word = 0
data_in_last_word = 0
data_in_valid = 1
data_in_type = 0
data_in_valid = 1
data_in_type = 1
Start AES core
start = 1
IV_valid = 1
IDLE
INIT_COUNTER
DATA_ACCEPT
INC_COUNTER(x4)
GFM_MULT
ENCRYPT(x4)
PRE_TAG_CALC
TAG_CALC
ENCRYPT_Y0
53
Figure 6.6 Complete Design Layout
6.2 Implementation and Results
Implementation Platform6.2.1
The above explained design has been implemented and tested on Altera based hardware.
The board used for the hardware implementation is the DB5CGXFC7 provided by
Devboards GmbH4
. The DB5CGXFC7 Board is based on the Altera Cyclone V GX
Device and has an Altera EP5CGXFC7C6F23C7N FPGA chip which contains 150K
Logic Elements and a 7Mbit RAM. This is a low end device of the Altera FPGA family
and is selected because the idea behind implementation of this Crypto-Core was to have
a design that is suitable for low-end devices as well. The design was also simulated
using the ModelSim – Altera 10.4b software for test_bench.v (See Appendix).
4
http://www.devboards.de/en/home/boards/product-details/article/db5cgxfc7/
AAD & C
Keystream
C
Keystream
C
AES-GCM Crypto-Core
Sbox
IV
P
Encryption
Block
Key
Authentication
Block
Control
Block
Memory
Block
P
IV
AAD
Matrix U
MAC
54
Tests & Results6.2.2
The system has been tested and compared with previous studies to provide a better
understanding of the performance. The designs were tested for area usage, achievable
operational frequency, throughput and efficiency.
The proposed design is simulated on ModelSim to see the total delay in receiving the
output. Then it is compared with the design in [10], implementing both on the same
hardware platform. Another implementation is also compared, {1} implementing the
Mastrovito multiplier in four steps as described in [11] without implementing Matrix U
in memory and implementing the AES round transformations in sub-pipelined mode.
Table 6.2 shows the comparison results of the designs with respect to achievable
operational frequency, clock cycles takes for 4 blocks of data and throughput. The
graphical representation of the comparison is given in Figure 6.6.
Operational
Frequency
Clock Cycles
per 4 blocks
Throughput
Proposed Design 140.17 MHz 44 1.63 GHz
[10] 146.34 MHz 44 1.87 GHz
{1} 159.32 MHz 56 1.46 GHz
Table 6.2 Comparison of Operational Frequency and Throughput
Figure 6.7 Graphical Comparison of Frequency and Throughput
0.1401 0.14634 0.15932
1.63
1.87
1.46
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
Proposed Work [10] [11]
Frequency (GHz)
Throughput (GHz)
55
The comparison for area usage in Adaptive Logic Module (ALM) used and the
efficiency (area/throughput) is shown in table 6.3. Figure 6.7 shows the Graphical
comparison of the Area usage and 6.8 shows the Graphical comparison of the
efficiencies between the designs.
Area (ALM) Efficiency(Mbps/ALM)
Proposed Design 1176 1.39
[10] 1582 1.18
{1} 2784 0.52
Table 6.3 Comparison of Area usage and Efficiency
Figure 6.8 Graphical Comparison of Area Usage
Figure 6.9 Graphical Comparison of Efficiency
1176
1582
2784
0
500
1000
1500
2000
2500
3000
Proposed Work [10] {1}
Area (ALMs)
Area (ALMs)
1.39
1.18
0.52
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
Proposed Work [10] {1}
Efficiency (Mbps/ALMs)
Efficiency (Mbps/ALMs)
56
6.3 Conclusion
The results shown in the previous section show that although in {1} a higher frequency
is achievable but, due to the latency created because of the 4 clock cycles takes by
dividing the multiplication in four iterations, the throughput achieved is much less. The
area usage due to the implementation of the calculation of Matrix U in logic increases
thus causing the overall efficiency of the system to decrease. On the other hand in the
proposed design the Matrix U is pre-calculated and stored as memory block, thus
decreasing the area cost. The memory usage, even after storing the Matrix U and SBox
as memory blocks, is nearly 1%. The efficiency of the proposed design is much better
than the designs in comparison. Even though, due to the constant key, there is a security
risk but for implementation in hardware level for an automated system this design is
very suitable since the cryptographic key can be kept constant and is not user provided.
57
Appendix
The source codes are provided in a CD along with the Thesis. Following is a brief
description of the source code files:
Name Desciption
AESGCM.v Top-level file for AES-GCM Crypto-Core
GFM.v Implementation of Galois Field Multiplier
AESConstants.v Necessary definitions for the core
AESCore.v[10]
Top-Level for AES Core
AESRound.v[10]
Implementation of Round Transformations
Counter.v[10]
Counter function to count round number
defineAES.v[10]
Pipeline definitions
lib.v[10]
Other primitives
MixColumnAddKey.v[10]
MixColumn and AddRoundKey Implementation
KeyGenerator.v[10]
Key Generation implementation
SBox1.v[10]
SBox as LUT
SBox2.v[10]
SBox and polynomial multiply by 02 as LUT
SBox1_2LUT.v[10]
SBox Top-Level
59
Bibliography
[1] OPC Unified Architecture, Interoperability for Industry 4.0 and the Internet of
Things, OPC Foundation Forum.
[2] Klaus Shwab, The Fourth Industrial Revolution, World Economic Forum, 2016.
[3] Mario Hermann, Tobias Pentek and Boris Otto, Design Principles for Industrie
4.0 Scenarios, System Sciences (HICSS), 2016.
[4] J. Daemen and V. Rijmen, AES Proposal: Rijndael, AES Algorithm Submission,
September 3, 1999.
[5] Federal Information Processing Standards (FIPS), Specification for the
Advanced Encryption Standard (AES), FIPS Publication 197, November 26,
2001.
[6] Harris Nover, Algebraic Cryptanalysis of AES: An Overview, Department of
Mathematics, University of Wisconsin.
[7] Sheng Wang, An Architecture for AES-GCM Security Standard, Master‟s Thesis
presented at University of Waterloo, Canada, 2006.
[8] E. D. Mastrovito, VLSI Designs for Multiplication over Finite Fields GF(2m
), in
Proc. Sixth International Conference, Applied Algebra, Algebric Algorithms and
Error-Correcting Codes (AAECC-6), Rome, July 1988.
[9] Ricardo Chaves, Georgi Kuzmanov, Stamatis Vassiliadis and Leonel Sousa,
Reconfigurable Memory based AES Co-Processor, Parallel and Distributed
Processing Symposium, 2006.
60
[10] Haeyoung Rha and Hae-wook Choi, Efficient Pipelined Multistream AES CCMP
Architecture for Wireless LAN, Paper submitted at Korea Advanced Institute of
Science & Technology (KAIST), 2012.
[11] Bryce Barcelo and John Taylor, Crypto Acceleration Using Asynchronous
FPGAs, Submitted to the faculty of Worcester Polytechnic Institute.
[12] Cheng Wang and Howard M. Heys, Using a Pipelined SBox in Compact AES
Hardware Implementations, IEEE NEWCAS2010, pp. 101-104, 2010.
[13] Arash Reyhani-Masoleh, Mehran Mozaffari-Kermani, Efficient and High-
Performance Parallel Hardware Architectures for the AES-GCM, IEEE
Transactions on Computers, vol. 61, no. , pp. 1165-1178, Aug. 2012.
[14] Muhammad H. Rais and Syed M. Qasim, Efficient Hardware Realization of
Advanced Encryption Standard Algorithm using Virtex-5 FPGA, International
Journal of Computer Science and Network Security (IJCSNS) Vol. 9 No. 9,
September 2009.
[15] Abolfazl Soltani and Saeed Sharifian, An Ultra-High Throughput and fully
Pipelined Implementation of AES algorithm on FPGA, Journal:
Microprocessors and Microsystems Vol. 39 Issue 7, Amsterdam, October 2015.
[16] A. Brokalakis and H. Michail, A High-Speed and Area-Efficient Hardware
Implementation of AES-128 Encryption Standard, 5th
WSEAS Conference on
Multimedia, Internet and Video Technologies, Greece, August 2005.
[17] D. Chen, G. Shou, Y. Hu and Z. Guo, Efficient Architecture and
Implementations of AES, IEEE ICACTE2010, pp. V6-295-V6-298, 2010.
[18] Nadia Nedjah, Luiza de Macedo Mourelle, Marco Paulo Cardoso, A Compact
Pipelined hardware Implementation of the AES-128 Cipher, IEEE ITNG2006,
2006.
61
[19] Kenneth Stevens, Otmane A. Mohamed, Single-chip FPGA Implementation of a
Pipelined, Memory-Based AES Rijndael Encryption Design, IEEE ECE2005,pp.
1296- 1299, 2005.
[20] Deen Kotturi, Seong-Moo Yoo, and John Blizzard, AES Crypto Chip Utilizing
High-Speed Parallel Pipelined Architecture, IEEE ISCAS2005,pp. 4653-4656
vol.5, 2005.
[21] J. Guajardo, T. Güneysu, Sandeep S. Kumar, C. Paar and J. Pelzl, Efficient
Hardware Implementation of Finite Fields with Applications to Cryptography,
Acta Appl Math (2006) 93: 75–118, September 2006.
[22] A. Karatsuba and Y. Ofman, Multiplication of Multidigit Numbers on Automata,
Soviet Physics Doklady, 7:595, 1963.
[23] Emilia Käsper and Peter Schwabe, Faster and Timing-Attack Resistant AES-
GCM, Katholieke Universiteit Leuven.
[24] Bo Yang, Sambit Mishra and Ramesh Karri, High Speed Architecture for
Galois/Counter Mode of Operation (GCM), Polytechnic University, Brooklyn,
New York.
[25] H. Fan and M.A. Hasan, A New Approach to Subquadratic Space Complexity
Parallel Multipliers for Extended Binary Fields, IEEE Transactions on
Computers, 56(2):224–233, 2007.
[26] A. Satoh, High-Speed Parallel Hardware Architecture for Galois Counter
Mode, IEEE International Symposium on Circuits and Systems(ISCAS 2007),
pages 1863–1866, 2007.
[27] M.A. Hasan, Matrix-vector Product based Subquadratic Arithmetic Complexity
Schemes for Field Multiplication, Proceedings of SPIE, 6697:669702, 2007.
62
[28] H. Fan and Y. Dai, Fast Bit-Parallel GF (2n
) Multiplier for All Trinomials, IEEE
Transactions on Computers, 54(4):485–490, 2005.
[29] C. Paar, A new Architecture for a Parallel Finite Field Multiplier with low
Complexity based on Composite Fields, IEEE Transactions on Computers,
45(7):856–861, 1996.
[30] Pujan Patel, Parallel Multiplier Designs for the Galois/Counter Mode of
Operation, A thesis presented to the University of Waterloo, Waterloo, Ontario,
Canada, 2008.
[31] Bruce Schneier, John Kelsey, Doug Whiting, David Wagner, Chris Hall, Niels
Ferguson, Tadayoshi Kohn , The Twofish Team's Final Comments on AES
Selection, May 2000.
[32] NIST Computer Security Division's (CSD) Security Technology Group (STG),
Proposed modes, Cryptographic Toolkit, NIST, April 14, 2013.
[33] David A. McGrew, John Viega, The Galois/Counter Mode of Operation, NIST,
2005.

More Related Content

What's hot

3rd Year Formula Student Frame Project Report
3rd Year Formula Student Frame Project Report3rd Year Formula Student Frame Project Report
3rd Year Formula Student Frame Project Report
Jessica Byrne
 
ES410 Report
ES410 ReportES410 Report
ES410 Report
Matt Dent
 
Mysql tutorial-excerpt-5.1-en
Mysql tutorial-excerpt-5.1-enMysql tutorial-excerpt-5.1-en
Mysql tutorial-excerpt-5.1-en
Rifky Rachman
 
Modelsim Tuttranslate
Modelsim TuttranslateModelsim Tuttranslate
Modelsim Tuttranslate
guest2d20022
 
ChucK_manual
ChucK_manualChucK_manual
ChucK_manual
ber-yann
 
Peachpit mastering xcode 4 develop and design sep 2011
Peachpit mastering xcode 4 develop and design sep 2011Peachpit mastering xcode 4 develop and design sep 2011
Peachpit mastering xcode 4 develop and design sep 2011
Jose Erickson
 
RMI Golf Cart Report
RMI Golf Cart ReportRMI Golf Cart Report
RMI Golf Cart Report
Mike Penso
 

What's hot (20)

Master thesis
Master thesisMaster thesis
Master thesis
 
Slabs producing process
Slabs producing processSlabs producing process
Slabs producing process
 
Cryoserver V8 admin guide
Cryoserver V8 admin guideCryoserver V8 admin guide
Cryoserver V8 admin guide
 
Weld reference
Weld referenceWeld reference
Weld reference
 
3rd Year Formula Student Frame Project Report
3rd Year Formula Student Frame Project Report3rd Year Formula Student Frame Project Report
3rd Year Formula Student Frame Project Report
 
Cryoserver v7 Administration Guide
Cryoserver v7 Administration GuideCryoserver v7 Administration Guide
Cryoserver v7 Administration Guide
 
Elevator pitch
Elevator pitchElevator pitch
Elevator pitch
 
ES410 Report
ES410 ReportES410 Report
ES410 Report
 
Wind Power - Full Report
Wind Power - Full ReportWind Power - Full Report
Wind Power - Full Report
 
SSTRM - StrategicReviewGroup.ca - Workshop 2: Power/Energy and Sustainability...
SSTRM - StrategicReviewGroup.ca - Workshop 2: Power/Energy and Sustainability...SSTRM - StrategicReviewGroup.ca - Workshop 2: Power/Energy and Sustainability...
SSTRM - StrategicReviewGroup.ca - Workshop 2: Power/Energy and Sustainability...
 
Mysql tutorial-excerpt-5.1-en
Mysql tutorial-excerpt-5.1-enMysql tutorial-excerpt-5.1-en
Mysql tutorial-excerpt-5.1-en
 
Modelsim Tuttranslate
Modelsim TuttranslateModelsim Tuttranslate
Modelsim Tuttranslate
 
ChucK_manual
ChucK_manualChucK_manual
ChucK_manual
 
Manufacturing of liquid insulators
Manufacturing of liquid insulatorsManufacturing of liquid insulators
Manufacturing of liquid insulators
 
Cs tocpp a-somewhatshortguide
Cs tocpp a-somewhatshortguideCs tocpp a-somewhatshortguide
Cs tocpp a-somewhatshortguide
 
Employers’ Toolkit: Making Ontario Workplaces Accessible to People With Disab...
Employers’ Toolkit: Making Ontario Workplaces Accessible to People With Disab...Employers’ Toolkit: Making Ontario Workplaces Accessible to People With Disab...
Employers’ Toolkit: Making Ontario Workplaces Accessible to People With Disab...
 
TEST UPLOAD
TEST UPLOADTEST UPLOAD
TEST UPLOAD
 
Energy Systems Optimization Of A Shopping Mall
Energy Systems Optimization Of A Shopping MallEnergy Systems Optimization Of A Shopping Mall
Energy Systems Optimization Of A Shopping Mall
 
Peachpit mastering xcode 4 develop and design sep 2011
Peachpit mastering xcode 4 develop and design sep 2011Peachpit mastering xcode 4 develop and design sep 2011
Peachpit mastering xcode 4 develop and design sep 2011
 
RMI Golf Cart Report
RMI Golf Cart ReportRMI Golf Cart Report
RMI Golf Cart Report
 

Similar to Comparative Implementation of AES-based Crypto-Cores

bkremer-report-final
bkremer-report-finalbkremer-report-final
bkremer-report-final
Ben Kremer
 
ImplementationOFDMFPGA
ImplementationOFDMFPGAImplementationOFDMFPGA
ImplementationOFDMFPGA
Nikita Pinto
 
Essbase database administrator's guide
Essbase database administrator's guideEssbase database administrator's guide
Essbase database administrator's guide
Chanukya Mekala
 
Thesis - Nora Szepes - Design and Implementation of an Educational Support Sy...
Thesis - Nora Szepes - Design and Implementation of an Educational Support Sy...Thesis - Nora Szepes - Design and Implementation of an Educational Support Sy...
Thesis - Nora Szepes - Design and Implementation of an Educational Support Sy...
Nóra Szepes
 
293 Tips For Producing And Managing Flash Based E Learning Content
293 Tips For Producing And Managing Flash Based E Learning Content293 Tips For Producing And Managing Flash Based E Learning Content
293 Tips For Producing And Managing Flash Based E Learning Content
Hidayathulla NS
 
Total beginner companion_document
Total beginner companion_documentTotal beginner companion_document
Total beginner companion_document
fujimiyaaya
 
Total beginner companion_document
Total beginner companion_documentTotal beginner companion_document
Total beginner companion_document
mdsumonkhan
 
An Analysis of Component-based Software Development -Maximize the reuse of ex...
An Analysis of Component-based Software Development -Maximize the reuse of ex...An Analysis of Component-based Software Development -Maximize the reuse of ex...
An Analysis of Component-based Software Development -Maximize the reuse of ex...
Mohammad Salah uddin
 

Similar to Comparative Implementation of AES-based Crypto-Cores (20)

An Optical Character Recognition Engine For Graphical Processing Units
An Optical Character Recognition Engine For Graphical Processing UnitsAn Optical Character Recognition Engine For Graphical Processing Units
An Optical Character Recognition Engine For Graphical Processing Units
 
orcad-tutorial.pdf
orcad-tutorial.pdforcad-tutorial.pdf
orcad-tutorial.pdf
 
Building a Simple Network - Study Notes
Building a Simple Network - Study NotesBuilding a Simple Network - Study Notes
Building a Simple Network - Study Notes
 
bkremer-report-final
bkremer-report-finalbkremer-report-final
bkremer-report-final
 
ImplementationOFDMFPGA
ImplementationOFDMFPGAImplementationOFDMFPGA
ImplementationOFDMFPGA
 
Master Arbeit_Chand _Piyush
Master Arbeit_Chand _PiyushMaster Arbeit_Chand _Piyush
Master Arbeit_Chand _Piyush
 
Project final report
Project final reportProject final report
Project final report
 
Programming
ProgrammingProgramming
Programming
 
Essbase database administrator's guide
Essbase database administrator's guideEssbase database administrator's guide
Essbase database administrator's guide
 
Thesis - Nora Szepes - Design and Implementation of an Educational Support Sy...
Thesis - Nora Szepes - Design and Implementation of an Educational Support Sy...Thesis - Nora Szepes - Design and Implementation of an Educational Support Sy...
Thesis - Nora Szepes - Design and Implementation of an Educational Support Sy...
 
Using Open Source Tools For STR7XX Cross Development
Using Open Source Tools For STR7XX Cross DevelopmentUsing Open Source Tools For STR7XX Cross Development
Using Open Source Tools For STR7XX Cross Development
 
thesis_report
thesis_reportthesis_report
thesis_report
 
293 Tips For Producing And Managing Flash Based E Learning Content
293 Tips For Producing And Managing Flash Based E Learning Content293 Tips For Producing And Managing Flash Based E Learning Content
293 Tips For Producing And Managing Flash Based E Learning Content
 
Open edX Building and Running a Course
Open edX Building and Running a CourseOpen edX Building and Running a Course
Open edX Building and Running a Course
 
Oezluek_PhD_Dissertation
Oezluek_PhD_DissertationOezluek_PhD_Dissertation
Oezluek_PhD_Dissertation
 
32412865.pdf
32412865.pdf32412865.pdf
32412865.pdf
 
Total beginner companion_document
Total beginner companion_documentTotal beginner companion_document
Total beginner companion_document
 
Total beginner companion_document
Total beginner companion_documentTotal beginner companion_document
Total beginner companion_document
 
Hung_thesis
Hung_thesisHung_thesis
Hung_thesis
 
An Analysis of Component-based Software Development -Maximize the reuse of ex...
An Analysis of Component-based Software Development -Maximize the reuse of ex...An Analysis of Component-based Software Development -Maximize the reuse of ex...
An Analysis of Component-based Software Development -Maximize the reuse of ex...
 

Comparative Implementation of AES-based Crypto-Cores

  • 1. Department of Electrical Engineering and Information Technology Computer Engineering Group Comparative Implementation of AES- based Crypto-Cores Master‟s Thesis Submitted to Prof. Dr. Sybille Hellebrand Muhammad Asim Zahid
  • 2.
  • 3. iii Declaration I declare that I have developed and written the enclosed Master Thesis completely by myself, and have not used sources or means without declaration in the text. Any thoughts from others or literal quotations are clearly marked. The Master Thesis was not used in the same or in a similar version to achieve an academic grading or is being published elsewhere. Location, Date _____________________ Signature ___________________________
  • 4.
  • 5. v Contents 1 Introduction............................................................................................................... 1 1.1 Motivation: Industry 4.0 and Data Security....................................................... 1 1.2 Objectives........................................................................................................... 2 2 Theoretical and Mathematical Preliminaries ............................................................ 5 2.1 Essentials............................................................................................................ 5 2.2 Standards and Information................................................................................. 7 2.3 Mathematical Operations ................................................................................... 8 Addition and Subtraction............................................................................ 82.3.1 Multiplication.............................................................................................. 92.3.2 3 Advanced Encryption Standard .............................................................................. 11 3.1 Introduction...................................................................................................... 11 Features..................................................................................................... 113.1.1 Usage ........................................................................................................ 123.1.2 3.2 Encryption........................................................................................................ 12 SubBytes Transformation ......................................................................... 143.2.1 ShiftRows Transformation........................................................................ 153.2.2 MixColumns Transformation ................................................................... 163.2.3 AddRoundKey Transformation ................................................................ 173.2.4 3.2.4.1 Key Expansion................................................................................... 17 3.2.4.2 Round Key Addition.......................................................................... 19 3.3 Decryption........................................................................................................ 19 InverseShiftRows Transformation............................................................ 203.3.1 InverseSubBytes Transformation ............................................................. 213.3.2 InverseMixColumns Transformation........................................................ 223.3.3 AddRoundKey Transformation ................................................................ 233.3.4 4 Encryption & Authentication Modes...................................................................... 25 4.1 Encryption Modes............................................................................................ 25 Electronic Codebook (ECB) ..................................................................... 254.1.1
  • 6. vi Cipher Block Chaining (CBC) ..................................................................264.1.2 Counter Mode (CTR) ................................................................................274.1.3 4.2 Counter with CBC-MAC (CCM) .....................................................................27 Introduction ...............................................................................................274.2.1 Algorithm ..................................................................................................284.2.2 4.3 Galois/Counter Mode (GCM)...........................................................................28 Introduction ...............................................................................................284.3.1 Algorithm ..................................................................................................294.3.2 4.3.2.1 GHASH..............................................................................................29 4.3.2.2 GCTR.................................................................................................33 Authenticated Encryption..........................................................................334.3.3 Authenticated Decryption..........................................................................344.3.4 5 Literature Review....................................................................................................37 5.1 AES Designs.....................................................................................................37 Iterative Loop Structure.............................................................................385.1.1 Memory-Based AES Hardware Implementation Designs.........................385.1.2 Unfolded / Parallel Structure Based Designs ............................................405.1.3 Sub-Pipelined or Stage-Level Pipelined Structure....................................405.1.4 5.2 Galois Field Multiplier (GFM) Designs ...........................................................43 6 Design & Implementation Results ..........................................................................45 6.1 Design...............................................................................................................45 AES Core...................................................................................................456.1.1 Galois Field Multiplier (GFM) Core.........................................................476.1.2 Complete Design .......................................................................................496.1.3 6.2 Implementation and Results .............................................................................53 Implementation Platform...........................................................................536.2.1 Tests & Results..........................................................................................546.2.2 6.3 Conclusion........................................................................................................56 Appendix .........................................................................................................................57 Bibliography....................................................................................................................59
  • 7. vii List of Figures Figure 2.1 Diagrammatic Representation of Encryption/Decryption............................... 5 Figure 2.2 Diagrammatic Representation of Authentication............................................ 6 Figure 3.1 Diagramatic Representation of a State .......................................................... 13 Figure 3.2 Flow Diagram of Round Transformations in AES Encryption..................... 14 Figure 3.3 SubBytes Transformation.............................................................................. 14 Figure 3.4 ShiftRows Transformation ............................................................................ 16 Figure 3.5 Diagramatic Representation of MixColumns................................................ 17 Figure 3.6 Flow Diagram of Key Expansion Algorithm ................................................ 18 Figure 3.7 Diagramatic Representation of RotWord...................................................... 18 Figure 3.8 Diagramatic Representation of AddRoundKey Transformation................... 19 Figure 3.9 Flow Diagram Representation of AES Decryption....................................... 20 Figure 3.10 Diagramatic Representation of InversShiftRows Transformation .............. 21 Figure 3.11 Diagramatic Representation of InverseMixColumns.................................. 23 Figure 4.1 Diagrammatic Representation of ECB Mode................................................ 26 Figure 4.2 Diagrammatic Representation of CBC mode................................................ 26 Figure 4.3 Diagrammatic Representation of CTR Mode................................................ 27 Figure 4.4 Diagrammatic Representation of CBC-MAC ............................................... 28 Figure 4.5 Diagrammatic Representation of the GHASH function................................ 30 Figure 4.6 Diagrammatic Representation of GF(2128 ) Multiplication............................ 31 Figure 4.7 Diagrammatic Representation of the Hashing sequence............................... 32 Figure 4.8 Diagrammatic Representation of Authenticated Encryption......................... 34 Figure 4.9 Diagrammatic Representation of Authenticated Decryption ........................ 35 Figure 5.1 Iterative Loop Implementation...................................................................... 38 Figure 5.2 Diagrammatic Representation of Memory Based AES structure.................. 39 Figure 5.3 Unrolled AES Structure ................................................................................ 40 Figure 5.4(a)Pipelined AES encryption datapath, (b) Pipelined Key Scheduler............ 41 Figure 5.5 Stages for Sub-Pipelining.............................................................................. 42 Figure 6.1 Internal Structure of AES core ...................................................................... 46 Figure 6.2 Diagrammatic Representation of Encryption Block ..................................... 47 Figure 6.3 Diagrammatic Representation of Pipelined GF(2128 ) Multiplier................... 48 Figure 6.4 Diagrammatic Representation of Authentication Block ............................... 48
  • 8. viii Figure 6.5 State Diagram of Crypto-Core State Machine...............................................52 Figure 6.6 Complete Design Layout ...............................................................................53 Figure 6.7 Graphical Comparison of Frequency and Throughput ..................................54 Figure 6.8 Graphical Comparison of Area Usage ...........................................................55 Figure 6.9 Graphical Comparison of Efficiency .............................................................55
  • 9. ix List of Tables Table 2.1 GF(28 ) Polynomial Representation................................................................... 8 Table 3.1 AES Number of Rounds with respect to Key Length..................................... 12 Table 3.2 S-box: Substitution values for byte „xy‟ (hexadecimal)................................. 15 Table 3.3 Inverse S-box: Substitution values for byte 'xy' (hexadecimal) ..................... 21 Table 6.1 I/O Signals of Crypto-Core............................................................................. 49 Table 6.2 Comparison of Operational Frequency and Throughput................................ 54 Table 6.3 Comparison of Area usage and Efficiency ..................................................... 55
  • 10.
  • 11. 1 1 Introduction The thesis topic, as suggested by its name, comprises of a comparison of AES (Advanced Encryption Standard) Crypto-Cores for finding a good and optimized way of secure data transfer. Today there is a growing need of safe and secure data transfer, so to avoid unauthorized access of data, data security and integrity is needed. Security protocols are used for daily computing and whenever security is compromised in any of the standard protocols they are either immediately re-designed or replaced with a improved standard, e.g. ( Data Encryption Standard, i.e., DES was replaced by AES)[6] . A good security protocol should have less processing time, higher throughput, should be reliable, must be faster and, at times, must have good authentication standards. The purpose of encryption is that the one who is authorized to read the message can read it by decrypting it with the help of a valid decryption algorithm. It is the goal of every sector to choose an encryption which meets its security standards. A relatively good encryption would be one where the deciphering of the encrypted data is near to impossible, but in most cases even the most promising algorithm can be broken, with a suitable attack. Therefore, choosing the right encryption is very important keeping in mind the known threat to the data in question. 1.1 Motivation: Industry 4.0 and Data Security The fourth industrial revolution has set the trend for exchange of data. First comes the principle of interoperability according to which through Internet of Things (IoT) and Internet of People (IoP), sensors, devices, people and machines communicate and are connected. Second is the principle of Information transparency, i.e., Information Sytems create a virtual copy of the physical world by combining higher-value context information and raw sensor data. Third is the principle of Technical assistance which talks of two abilities; first includes solving problems and making informed decisions in a very short time by supporting humans through comprehensible aggregation and visualizing of information. Second ability is to support humans physically through cyber
  • 12. 2 physical systems. The fourth and the last one are Decentralized decisions in which the performance of the tasks is autonomous by cyber physical systems and can take their own decision. In some exceptional cases the higher level is delegated with the task. [3] The Industry 4.0 has developed a growing need for systems integrated with cryptography due to which many organizations have suggested standards for data security and integrity meeting the Industry 4.0 requirements. Open Platform Communications – Unified Architecture (OPC-UA), which is a non- profit organization, is becoming more and more popular for providing safe, platform-independent and reliable data exchange standards. Its standard works in association with many researchers, manufacturers and users and keeps on improving its standards to maintain competitiveness, having a vast vision. In order to implement the vision of Industry 4.0, OPC-UA has suggested that more effective use of energy and resources is required in minimum time to reduce complexity. For this certain activities are required, which includes automation and optimization of the system, shaping the digitalization of all industrial sectors and so on. This research is based on the security standard set by OPC- UA which includes scalable mechanisms and introduction of fast but area-efficient crypto-cores. [1], [2] 1.2 Objectives A secure data transfer requires the implementation of a good and reliable encryption standard. Advanced Encryption Standard (AES) is a widely used standard which, in the past years, has been considered as one of the most reliable encryption standards available. Thus, AES has been chosen as the encryption standard in this research and following are the main objectives of this research thesis:  Studying of various AES techniques to have a better understanding for further optimization.  Implementation of a fast pipelined-AES technique as an IP-core with considerably less area cost.  Implement two AES modes, i.e., Galois/Counter Mode (GCM) and Counter with CBC-MAC (CCM) as an IP-core for an accurate comparison.
  • 13. 3  Devise a method for the implementation of Galois Field Multiplication of GF(28 ) for GCM that has considerably less area-cost but is still fast.
  • 14.
  • 15. 5 2 Theoretical and Mathematical Preliminaries This chapter discusses the theoretical and mathematical preliminaries that will be used in the further chapters. 2.1 Essentials The main idea behind this research thesis is having a security platform for hardware which would make the information transfer from one node to another at hardware-level much more secure. Before going in to the details of how this is done the basic essentials should first be defined. Encryption/Decryption: Encryption can be defined as a process to protect data in such a way that only authorized units (people, machines, etc.) can access it. This is done by encoding the data with a special key which can only be decoded by whoever has that key. For example, if some information has to be exchanged between two units, A and B, but it has to be ensured that no unauthorized unit can access that data, encryption will be used. The data might still be accessible but it will be unreadable for anyone who doesn‟t have the authorization key. The process to decode that encrypted data in to the original data would be called Decryption. Figure 2.1 shows a diagrammatic environment of encryption/decryption. Figure 2.1 Diagrammatic Representation of Encryption/Decryption Hello Hellof3#7r f3#7r (unreadable)
  • 16. 6 Authentication: Authentication is a process in which the received data is compared with the sent data to ensure that valid data has been received. This means that it ensures that the data has not been changed or modified in any way during data transfer. This is achieved by adding a verification tag to the message known as Message Authentication Code (MAC). The MAC algorithm, on the Sender end, takes in a secret key and the message to be sent to generate a MAC which is sent along with the message. On the Receiver end the message is passed through a MAC algorithm to generate a MAC which is compared with the received MAC to see whether the message received is authentic or has been tampered with. Figure 2.2 shows the diagrammatic representation of how authentication works. Figure 2.2 Diagrammatic Representation of Authentication Crypto-Core: An IP (intellectual property) core is a block of logic or data that is used in making a field programmable gate array ( FPGA ) or application-specific integrated circuit ( ASIC ) design. Ideally, an IP core should be entirely portable - that is, able to easily be inserted into any vendor technology or design methodology. Universal Asynchronous Receiver/Transmitter ( UART s), central processing units ( CPU s), Ethernet controllers, and PCI interfaces are all examples of IP cores. Crypto- Cores are another example of an IP-Core to embed encryption within the hardware. Sender Receiver Message is Authentic Message is Modified Channel
  • 17. 7 2.2 Standards and Information The AES algorithm takes bits (binary values) as sequences for input and output, referred to as blocks. The number of bits contained in a block is known as block length. For an AES algorithm a secret key, known as the Cipher Key, is required to encrypt/decrypt data. The AES can be implemented in three different key lengths, i.e., 128-, 192- and 256-bit[5] . The AES algorithm used in this thesis will be a 128-bit algorithm. Bits: Within such sequences, the bits are numbered from 0 to Block Length – 1. A sequence of 8 bits is known as a Byte which is the basic unit for an AES algorithm. A sequence of 4-byte is known as a Word. Index: It is the number attached to a bit. It ranges for a 128-bit algorithm. Galois Field: A field containing a finite set of elements is known as a Galois Field(GF). A Galois Field is represented as GF(pn ), that denotes that it is a Galois Field of pn elements, where p is a prime number. So, a Galois Field of 256 elements will be represented as GF(28 ). In this research thesis the GF(28 ) will be taken in to consideration since AES operations take GF(28 ) as input elements. Galois Field Polynomial: The elements in a Galois Field GF(pn ) can also be written in the form of a polynomial of degree that is less than n. Consider an element of GF(28 ) written in binary form as {00110101}. The GF polynomial for this would be . Table 2.1 shows a better understanding of the GF(28 ) polynomial. Since the GF(28 ) has 256 elements so Table 2.1 just shows some polynomial representations to provide an understanding. Binary Conversion GF Polynomial 00000000 0x7 +0x6 +0x5 +0x4 +0x3 +0x2 +0x1 +0x0 0 00000001 0x7 +0x6 +0x5 +0x4 +0x3 +0x2 +0x1 +1x0 1 00000010 0x7 +0x6 +0x5 +0x4 +0x3 +0x2 +1x1 +0x0 x 00000011 0x7 +0x6 +0x5 +0x4 +0x3 +0x2 +1x1 +1x0 x+1 00000100 0x7 +0x6 +0x5 +0x4 +0x3 +1x2 +0x1 +0x0 x2
  • 18. 8 00000101 0x7 +0x6 +0x5 +0x4 +0x3 +1x2 +0x1 +1x0 x2 +1 00000110 0x7 +0x6 +0x5 +0x4 +0x3 +1x2 +1x1 +0x0 x2 +x 00000111 0x7 +0x6 +0x5 +0x4 +0x3 +1x2 +1x1 +1x0 x2 +x+1 ……… ……………………… ………… 11111101 1x7 +1x6 +1x5 +1x4 +1x3 +1x2 +0x1 +1x0 x7 +x6 +x5 +x4 +x3 +x2 +1 11111110 1x7 +1x6 +1x5 +1x4 +1x3 +1x2 +1x1 +0x0 x7 +x6 +x5 +x4 +x3 +x2 +x 11111111 1x7 +1x6 +1x5 +1x4 +1x3 +1x2 +1x1 +1x0 x7 +x6 +x5 +x4 +x3 +x2 +x+1 Table 2.1 GF(28 ) Polynomial Representation 2.3 Mathematical Operations The elements in an AES algorithm are understood as GF(28 ) elements as explained above. All finite elements can be added and multiplied; however, their operations are not as those done for numbers. The mathematical concepts for finite field elements are different and are explained in this section. Addition and Subtraction2.3.1 Although it is a bit different from normal algebric addition or subtraction, the addition or subtraction of two GF polynomials is very simple. For addition of two GF(2n ) polynomials, the two polynomials are added and then reducing the result modulo 2. Modulo by any integer or polynomial means to divide with that integer or polynomial and take the remainder as the answer. Subtraction of two GF(2n ) polynomials is the same as addition. So considering a GF(28 ) field it will be addition modulo 2 or subtraction modulo 2. Let us take an example taking two GF(28 ) polynmials, and . The addition of these two GF polynomials would result as shown in Equation 2.1: ( 2.1)
  • 19. 9 Equation 2.1 shows that the same result would be achieved by an exclusive-OR (XOR) operation between A and B, so it can also be represented as 1 . Thus, the addition of two GF(28 ) polynomials can also be done by doing an XOR operation between the two polynomials which would make implementation much easier. Multiplication2.3.2 To multiply two polynomials in Galois Field GF(2n ), initially, their corresponding polynomials are multiplied just as in algebra (except for their coefficients that are only 0 & 1. A lot of terms will be dropped out because 1+1=0, which makes calculations easier). The result is then modulo by an irreducable polynomial of degree n. For the AES algorithm in GF(28 ) the irreducible polynomial is shown in Equation 2.2: ( 2.2) The multiplication is denoted by ●. Implementing multiplication of finite field elements is somewhat more complex than addition. Modulo by m(x) ensures that the resultant binary polynomial will of degree less than 8, and therefore can be represented by a byte. At the byte level, there is no simple operation for multiplication, like for addition. As an example consider take two GF(28 ) polynomials, and , and show their multiplication from Equation 2.3 to Equation 2.6. ( 2.3) This implies: ( 2.4) And, 1 ⊕ represents XOR of two elements
  • 20. 10 ( 2.5) = | ⁄ | To compute , first use as the quotient. Thus, by multiplying with , results in: ( 2.6) This when subtracted from gives: ( 2.7) as the remainder. Now, the degree of the polynomial in the remainder is 10 so select as the quotient. Multiplying with , result is: ( 2.8) Subtracting 2.8 from 2.7 the remainder is: ( 2.9) Since the terms with factor 2 in 2.9 will be dropped as mentioned previously and since the final result would be the absolute value of the remainder, so the result is: (2.10)
  • 21. 11 3 Advanced Encryption Standard For the encryption of commercial and sensitive computer data, the US government adopted Data Encryption Standard (DES), as an official Federal Information Processing Standard (FIPS). Since this was the first encryption algorithm approved by the US government, hence the public and private industry, requiring strong encryption, welcomed it readily and saw its adoption in a wide variety of embedded systems, smart cards, SIM cards and network devices. For any cipher, the most basic method of attack is brute force, which involves trying each key until the right one is found. Therefore, encryption strength is directly dependent upon the key size. DES uses a 64-bit key, eight of which bits are used for parity checks, effectively limiting the key to 56-bits. Since the DES was using the same key to encrypt / decrypt a message, as such 56-bit keys (of DES) were considered too small compared to the processing power of modern computers, making it susceptible to cyber-attacks and, as such, soon began losing its usefulness. The U.S. National Institute of Standards and Technology (NIST), in 1997, started looking for a better alternate to DES. In 2001, it selected the Advanced Encryption Standard (AES) as a replacement. 3.1 Introduction The Advanced Encryption Standard (AES) [5] , also known as Rijndael after the two Belgian cryptographers, Joan Daemen and Vincent Rijmen, was published by NIST in 2001. It is the most commonly used encryption standard, throughout the world. AES is a symmetric block cipher that operates on 128-bit block as input and output data and is used to protect classified information implemented in software and hardware to encrypt sensitive data. Features3.1.1 AES data encryption is a more mathematically efficient and elegant cryptographic algorithm, but its main strength rests in the key length options. It is based on a design
  • 22. 12 principle known as a substitution-permutation network, combination of both substitution and permutation, and is fast in both software and hardware [31] . The algorithm can encrypt and decrypt blocks using a secret key which has a key size of 256-bit, 192-bit, or 128-bit. One of the main features of AES is simplicity that is achieved by repeatedly combining substitution and permutation computations at different rounds, i.e., AES encrypts/decrypts a 128-bit plaintext/ciphertext by repeatedly applying the same round transformation a number of times depending on the key size. Key Length Block Length Number of Rounds AES-128 128-bit 128-bit 10 AES-192 192-bit 128-bit 12 AES-256 256-bit 128-bit 14 Table 3.1 AES Number of Rounds with respect to Key Length The actual key length depends on the desired security level. Today, AES-128 is predominant and supported by most hardware implementations. It is also the standard that will be focused on in this implementation since it is the preferred standard for GCTR module of the AES – GCM to provide authenticity. Usage3.1.2 This AES standard is used by concerned departments and agencies whenever it is considered that any unclassified sensitive information is of importance and has to be protected cryptographically. Other cryptographic algorithms approved by FIPS are also available for use in addition to or in lieu of this standard. Commercial and private organizations have also, in the past years, turned this standard for security of their information and systems. 3.2 Encryption It is understood that the basis of AES Encryption lies in the design principle which is commonly referred to as a substitution-permutation network, a combination of substitution and permutation both, which is called Cipher. In plain words, Cipher may mean any method to encrypt a text, known as plaintext, so that its readability and/or
  • 23. 13 meaning is concealed. It is a coded or disguised way of writing a message. This coding is known as encryption. Sometimes the encrypted text is itself also referred to as Cipher, but generally the term used is ciphertext. It is understood that it takes its origin from the Arabic word Sifr which means Empty or Zero. The AES operates on a matrix 4 × 4, referred to as the state S, although certain variants of Rijndael do operate on a larger block size having more columns in the state [5] . Majority of AES calculations are performed in a special finite field. For instance, if 16 bytes, b0, b1, b2, b3, b4 …….b15 are considered, they will be represented by the shown in Equation (3.1). [ ] ( 3.1) The diagrammatic representation of a State is shown in Figure 3.1 Figure 3.1 Diagramatic Representation of a State Initially, the input of Cipher is copied to the State Array using the conventional method. After initially performing a Round Key Addition, transformation of the State Array is done by implementation of a round function 10, 12 or 14 times depending on the key length as discussed previously. The Cipher Algorithm of a 128-bit cipher is explained in the form of a flow diagram in Figure 3.1. Individual transformations - AddRoundKey, ShiftRows, SubBytes and MixColumns – are explained in detail further in the chapter. As shown in the Figure 3.2, all rounds (Nr) are identical with the exception of the final round (Nr = 10), which does not include the MixColumns transformation.
  • 24. 14 Figure 3.2 Flow Diagram of Round Transformations in AES Encryption SubBytes Transformation3.2.1 Substitution of bytes using an 8-bit substitution table is known as SubBytes transformation. These Sub-Bytes transformations operate on each byte independently, using substitution table (S-box) of the State [4] . Figure 3.3 shows the diagrammatic State representation of how the SubBytes transformation is done. Figure 3.3 SubBytes Transformation Rounds = Nr Start Key Expansion Add Round Key Add Round Key Sub Bytes Shift Rows Mix Columns Nr Nr < 10 Nr = 10 Sub Bytes Shift Rows Add Round Key End a0,0 a0,1 a0,2 a0,3 a1,0 a1,1 a1,2 a1,3 a2,0 a2,1 a2,2 a2,3 a3,0 a3,1 a3,2 a3,3 b0,0 b0,1 b0,2 b0,3 b1,0 b1,1 b1,2 b1,3 b2,0 b2,1 b2,2 b2,3 b3,0 b3,1 b3,2 b3,3 S-Box ai,j bi,j
  • 25. 15 The S-box is an invertible matrix and is derived by taking the multiplicative inverse in the GF(28 ) having good non-linearity properties. The element {00} is mapped to itself. Table 3.2 shows the substitution table used in the AES encryption algorithm [5] . y X 0 1 2 3 4 5 6 7 8 9 a b c d e f 0 63 7c 77 7b f2 6b 6f c5 30 01 67 2b fe d7 ab 76 1 ca 82 c9 7d fa 59 47 f0 ad d4 a2 af 9c a4 72 c0 2 b7 fd 93 26 36 3f f7 cc 34 a5 e5 f1 71 d8 31 15 3 04 c7 23 c3 18 96 05 9a 07 12 80 e2 eb 27 b2 75 4 09 83 2c 1a 1b 6e 5a a0 52 3b d6 b3 29 e3 2f 84 5 53 d1 00 ed 20 fc b1 5b 6a cb be 39 4a 4c 58 cf 6 d0 ef aa fb 43 4d 33 85 45 f9 02 7f 50 3c 9f a8 7 51 a3 40 8f 92 9d 38 f5 bc b6 da 21 10 ff f3 d2 8 cd 0c 13 ec 5f 97 44 17 c4 a7 7e 3d 64 5d 19 73 9 60 81 4f dc 22 2a 90 88 46 ee b8 14 de 5e 0b db a e0 32 3a 0a 49 06 24 5c c2 d3 ac 62 91 95 e4 79 b e7 c8 37 6d 8d d5 4e a9 6c 56 f4 ea 65 7a ae 08 c ba 78 25 2e 1c a6 b4 c6 e8 dd 74 1f 4b bd 8b 8a d 70 3e b5 66 48 03 f6 0e 61 35 57 b9 86 c1 1d 9e e e1 f8 98 11 69 d9 8e 94 9b 1e 87 e9 ce 55 28 df f 8c a1 89 0d bf e6 42 68 41 99 2d 0f b0 54 bb 16 Table 3.2 S-box: Substitution values for byte „xy‟ (hexadecimal) The value of the byte is used as an index to find the substitution byte. For example the byte {6d} will find the substitution byte in such a way that it will locate the byte in the location where x = 6 and y = d, i.e, {3c}. ShiftRows Transformation3.2.2 Within a certain offset, the ShiftRows operation cyclically shifts over the bytes in the rows of the State. In AES, with the first row, r=0, remaining as it is, the second row bytes are shifted to the left by an offset of 1. Similarly the third and fourth rows are shifted by an offset of two & three, respectively [5] .
  • 26. 16 Figure 3.4 ShiftRows Transformation MixColumns Transformation3.2.3 As the MixColumns transformations has to operate column-by-column, each column is treated as a four-term polynomial and thus the MixColumn transformation takes four bytes as input and gives four bytes as output. Each input byte has an effect on all four output bytes. These columns, being taken as polynomials over GF(28 ), are multiplied with modulo x4 + 1 and a fixed polynomial, q(x), where [5] ( 3.2) Where, {01}, {02} and {03} are Hexadecimal values 0x01, 0x02 and 0x03, respectively. Let the new column (in the State) be b(x) and the original column is a(x). The MixColumn transformation can be represented as: ( 3.3) This can be written in matrix multiplication form [4] : [ ] [ ] [ ] ( 3.4) The four bytes in the new columns after the MixColumns operation can be calculated by the expressions given in Equations (3.5) to (3.8). Shift 3 Shift 2 Shift 1 No Shift a0,0 a0,1 a0,2 a0,3 a1,0 a1,1 a1,2 a1,3 a2,0 a2,1 a2,2 a2,3 a3,0 a3,1 a3,2 a3,3 a0,0 a0,1 a0,2 a0,3 a1,0 a1,1 a1,2 a1,3 a2,0 a2,1 a2,2 a2,3 a3,0 a3,1 a3,2 a3,3 ShiftRows
  • 27. 17   ( ) ( 3.5)   ( 3.6)   ( 3.7)   ( 3.8) The diagramatic representation of MixColumn Transformation is given in Figure 3.5 Figure 3.5 Diagramatic Representation of MixColumns AddRoundKey Transformation3.2.4 In simple words, the AddRoundKey transformation XOR‟s the output from the previous step (MixColumns in the first 9 rounds and ShiftRows in the final round) to a RoundKey generated from the Key Expansion algorithm [4] . To further understand the AddRoundKey the two steps in the AddRoundKey transformation, Key Expansion and Adding of the Round Key, are important: 3.2.4.1 Key Expansion Considering the AES-128 the Key Expansion algorithm takes a 128-bit key as input to generate a key schedule. The expansion of the input key in to the key schedule requires two processes, namely SubWord and RotWord [5] . These two processes will be explained in detail further in this section. The 16 byte input cipher key is transferred to a word array w[i] following the Pseudo-Code shown below [5] . while (i < 4) w[i] = word(key[4*i], key[4*i+1], key[4*i+2], key[4*i+3]) a0,0 a0,1 a0,2 a0,3 a1,0 a1,1 a1,2 a1,3 a2,0 a2,1 a2,2 a2,3 a3,0 a3,1 a3,2 a3,3 a0,j a1,j a2,j a3,j b0,0 b0,1 a0,2 b0,3 b1,0 b1,1 a1,2 b1,3 b2,0 b2,1 a2,2 b2,3 b3,0 b3,1 a3,2 b3,3 b0,j b1,j b2,j b3,j MixColumns 𝑞 𝑥
  • 28. 18 The flow diagram representation of the Key Expansion algorithm is given in Figure 3.6. Figure 3.6 Flow Diagram of Key Expansion Algorithm RotWord: Performs a cyclic permutation on a 4-byte word as depicted in Figure 3.7 Figure 3.7 Diagramatic Representation of RotWord SubWord: Takes 4-bytes as input and applies S-box substitution to all of the four bytes to give a 4-byte output. The S-box used is the same for SubBytes transformation. From the Pseudo Code and the flow diagram it can be deduced: w[i] w[i-1] i mod 4=0? w[i-2] w[i-3] w[i-4] RotWord i mod 4=0? SubWord i mod 4=0? Rcon[i/4] True False False True True False 4-bytes 4-bytes 4-bytes 4-bytes 4-bytes 4-bytes Round Key a0 a1 a2 a3 a0 a1 a2 a3 Cyclic Permutation RotWord
  • 29. 19  The first 4 words of the expanded key are filled with the input Cipher Key.  Every following word is the XOR of the previous word (w[i-1]) and the word 4 positions earlier (w[i-4]).  For words in position that are multiple of 4, the RotWord and SubWord transformation is applied to w[i-1] and then an XOR is done with an Rcon, before the final XOR. 3.2.4.2 Round Key Addition Once, the Round Key, , is generated than it is added to the output of the previous transformation, , with a simple bitwise-XOR. The diagrammatic representation of the Round Key addition is shown in Figure 3.8: Figure 3.8 Diagramatic Representation of AddRoundKey Transformation 3.3 Decryption For decrypting the data using the AES algorithm the Cipher transformations stated above can be inverted and then implemented in reverse order. The transformations used in the decryption algorithm, or the Inverse Cipher, are InverseSubBytes, InverseShiftRows, InverseMixColumns and AddRoundKey. a0,0 a0,1 a0,2 a0,3 a1,0 a1,1 a1,2 a1,3 a2,0 a2,1 a2,2 a2,3 a3,0 a3,1 a3,2 a3,3 b0,0 b0,1 b0,2 b0,3 b1,0 b1,1 b1,2 b1,3 b2,0 b2,1 b2,2 b2,3 b3,0 b3,1 b3,2 b3,3 ai,j bi,j k0,0 k0,1 k0,2 k0,3 k1,0 k1,1 k1,2 k1,3 k2,0 k2,1 a2,2 k2,3 k3,0 k3,1 k3,2 k3,3 ki,j
  • 30. 20 The overall flow of the Decryption is the same as that of the Encryption other than the fact that all the transformations are inverse of the transformations in the Encryption algorithm [5] . The flow diagram of the Inverse Cipher is given in Figure 3.9. Figure 3.9 Flow Diagram Representation of AES Decryption InverseShiftRows Transformation3.3.1 As evident by its name, it is the inverse of the ShiftRows transformation. In the InverseShiftRows transformation the shifting over the bytes is a right-shift instead of a left-shift as in ShiftRows transformation. The first row, r=0, is not shifted. The bottom three rows are shifted right with an offset 1, 2 and 3 respectively. Figure 3.9 shows the diagrammatic representation of the InverseShiftRows transformation is shown in Figure 3.10. Rounds = Nr Start Key Expansion Add Round Key Add Round Key Inverse ShiftRows Inverse SubBytes Inverse MixColumns Nr Nr < 10 Nr = 10 Inverse ShiftRows Inverse SubBytes Add Round Key End
  • 31. 21 Figure 3.10 Diagramatic Representation of InversShiftRows Transformation InverseSubBytes Transformation3.3.2 The inverse of SubBytes transformation requires an inverse S-box. This inverse S-box is then used for one-to-one byte substitution. Table 3.3 shows the inverse S-box [5] . Y X 0 1 2 3 4 5 6 7 8 9 A b C d e f 0 52 09 6a d5 30 36 a5 38 bf 40 a3 9e 81 f3 d7 fb 1 7c e3 39 82 9b 2f ff 87 34 8e 43 44 c4 de e9 cb 2 54 7b 94 32 a6 c2 23 3d ee 4c 95 0b 42 fa c3 4e 3 08 2e a1 66 28 d9 24 b2 76 5b a2 49 6d 8b d1 25 4 72 f8 f6 64 86 68 98 16 d4 a4 5c cc 5d 65 b6 92 5 6c 70 48 50 fd ed b9 da 5e 15 46 57 a7 8d 9d 84 6 90 d8 ab 00 8c bc d3 0a f7 e4 58 05 b8 b3 45 06 7 d0 2c 1e 8f ca 3f 0f 02 c1 af bd 03 01 13 8a 6b 8 3a 91 11 41 4f 67 dc ea 97 f2 cf ce f0 b4 e6 73 9 96 ac 74 22 e7 ad 35 85 e2 f9 37 e8 1c 75 df 6e a 47 f1 1a 71 1d 29 c5 89 6f b7 62 0e aa 18 be 1b b fc 56 3e 4b c6 d2 79 20 9a db c0 fe 78 cd 5a f4 c 1f dd a8 33 88 07 c7 31 b1 12 10 59 27 80 ec 5f d 60 51 7f a9 19 b5 4a 0d 2d e5 7a 9f 93 c9 9c ef e a0 e0 3b 4d ae 2a f5 b0 c8 eb bb 3c 83 53 99 61 f 17 2b 04 7e ba 77 d6 26 e1 69 14 63 55 21 0c 7d Table 3.3 Inverse S-box: Substitution values for byte 'xy' (hexadecimal) Shift 3 Shift 2 Shift 1 No Shift InvShiftRows a0,0 a0,1 a0,2 a0,3 a1,0 a1,1 a1,2 a1,3 a2,0 a2,1 a2,2 a2,3 a3,0 a3,1 a3,2 a3,3 a0,0 a0,1 a0,2 a0,3 a3,0 a3,1 a3,2 a3,3 a2,0 a2,1 a2,2 a2,3 a1,0 a1,1 a1,2 a1,3
  • 32. 22 InverseMixColumns Transformation3.3.3 As for the MixColumns transformations, the InverseMixColumns transformation also operates column-by-column with each column being treated as a four-term polynomial. Each of the four input bytes have an effect on all four output bytes. These columns, being taken as polynomials over GF(28 ), are multiplied with modulo x4 + 1 and a fixed polynomial, let‟s say , where can be represented as shown in equation 3.9: [5] ( 3.9) Where, {0b}, {0d}, {09} and {0e} are Hexadecimal values 0x0b, 0x0d, 0x09 and 0x0e, respectively. Let the new column (in the State) be b-1 (x) and the original column is a-1 (x). The InverseMixColumns transformation can be represented as: ( 3.10) This can be written in matrix multiplication form [4] : [ ] [ ] [ ] ( 3.11) The four bytes in the new columns after the InverseMixColumns operation can be calculated by the expressions given in Equations (3.12) to (3.15) [5] .    (  ) ( 3.12)     ( 3.13)     ( 3.14)     ( 3.15) The diagramatic representation of InverseMixColumn Transformation is given in Figure 3.11.
  • 33. 23 Figure 3.11 Diagramatic Representation of InverseMixColumns AddRoundKey Transformation3.3.4 The AddRoundKey transformation remains the same for decryption and encryption since the Key Expansion algorithm doesn‟t change and the adding of the Round Key is a simple XOR, which is the inverse of itself. Please see section 3.2.4 for detailed explanation. InverseMixColumns a0,0 a0,1 a0,2 a0,3 a1,0 a1,1 a1,2 a1,3 a2,0 a2,1 a2,2 a2,3 a3,0 a3,1 a3,2 a3,3 a0,j a1,j a2,j a3,j b0,0 b0,1 a0,2 b0,3 b1,0 b1,1 a1,2 b1,3 b2,0 b2,1 a2,2 b2,3 b3,0 b3,1 a3,2 b3,3 b0,j b1,j b2,j b3,j 𝑞 𝑥
  • 34.
  • 35. 25 4 Encryption & Authentication Modes A Cipher encrypts or decrypts data for a single block but applying a Cipher repeatedly over large blocks of data is known as a Mode of Operation for that Cipher. Many modes of operation for AES have been introduced over the years and some of the relevant ones for this Thesis will be discussed in this Chapter. Along with modes of operation for encryption, authentication modes will also be discussed. Over the years authentication has been an integral part of information exchange for an efficient data transfer. Many attacks involve the attacker injecting messages to the data in question and thus there is a need for verification, whether the data was sent by the claimed sender or someone else. A mode of operation that provides both encryption and authentication is known as Authenticated Encryption (AE) [32] . AES also has various modes which provide AE. In this chapter, two AE modes namely CCM (Counter with CBC-MAC) and GCM (Galois/Counter Mode), of AES will be discussed that are the main comparison platforms for the Crypto-Cores implemented in this Research Thesis. 4.1 Encryption Modes Electronic Codebook (ECB)4.1.1 Electronic Codebook (ECB) is the simplest mode of operation for AES, where a large message is divided in to blocks depending on the key-size, i.e. 128-bits in this case, and each block is encrypted/decrypted separately [32] . For example, consider Figure 4.1 which explains the ECB modes diagrammatically for a message of the size . It is divided in blocks where each block of Plaintext/Ciphertext is 128-bits in size and is passed through the Cipher/Inverse Cipher separately with an identical key to produce an output Plaintext/Ciphertext block of size 128-bits.
  • 36. 26 Figure 4.1 Diagrammatic Representation of ECB Mode Cipher Block Chaining (CBC)4.1.2 Cipher Block Chaining (CBC) mode is an AES mode of operation in which each Plaintext block is XORed with the previous Ciphertext block before encryption. In case of decryption the output of the inverse cipher is XORed with the previous Ciphertext block to get the plaintext block. Figure 4.2 shows the diagrammatic representation. For the first block the Plaintext is XORed with an Initialization Vector (IV). An IV is a fixed-size input,in this case 128-bit, which can be of any random value. Figure 4.2 Diagrammatic Representation of CBC mode B1 B2 B3 … Bm B1 B2 B3 Bm Cipher / Inverse Cipher Cipher / Inverse Cipher Cipher / Inverse Cipher Cipher / Inverse Cipher Input Message Key A1 A2 A3 … Am Output Message A1 A1 A1 A1 … … B1 B2 B3 … Bm B1 B2 B3 Bm Cipher Plaintext Message Key A1 A2 A3 … Am Ciphertext Message A1 A2 A3 Am … … Cipher IV Cipher Cipher
  • 37. 27 Counter Mode (CTR)4.1.3 Counter Mode (CTR) is a mode of operation for the AES which converts a Block Cipher in to a Stream Cipher. An 96-bit IV is given as input to a counter function, which can be any function that generates a sequence of numbers that don‟t repeat but usually an increment-by-one counter is used, which appends the 32-bit of the counter and generates a new 128-bit string for each iteration. These are then used as an input to the cipher to generate a keystream, a stream of random values, which is the XORed with the Plaintext to generate a Ciphertext. For decryption the generated keystream is XORed with the Ciphertext to generate the Plaintext. Figure 4.3 shows the diagrammatic representation of the CTR mode. Figure 4.3 Diagrammatic Representation of CTR Mode 4.2 Counter with CBC-MAC (CCM) Introduction4.2.1 As visible by its name, the Counter with CBC-MAC (CCM) mode uses the CBC mode to generate a MAC and then CTR mode is applied over the message and the tag to encrypt the message. This shows that CCM is a mode to apply Authenticated Encryption (AE) to the message. CCM mode can only be applied to block ciphers of block size 128-bits. Generated Keystream B1 B2 B3 … Bm Cipher Plaintext Message Key A1 A2 A3 … Am Ciphertext Message A1 … … B1 Counter 1 Cipher A2 B2 Cipher A3 B3 Counter 3 Cipher Am Bm Counter mCounter 2
  • 38. 28 Algorithm4.2.2 The MAC is generated using the CBC-MAC by applying the CBC mode of encryption to the message with an IV of 128-bits of 0‟s. Each block of the CBC mode depends on the proper encryption of the previous block, thus if an intermediate block is changed this will be visible in the last block. The last block is used as the MAC which is sent along with the message and is used to compare whether the message is authentic or not. The diagrammatic representation in Figure 4.4 shows the working of the CBC-MAC. Figure 4.4 Diagrammatic Representation of CBC-MAC The generated MAC is encrypted along with the message using the CTR mode. 4.3 Galois/Counter Mode (GCM) Introduction4.3.1 Galois/Counter Mode (GCM) is a block cipher mode of operation that uses universal hashing over a binary Galois field to provide Authenticated Encryption (AE). It can be implemented in hardware to achieve high speeds with low cost and low latency. There is a growing need for a mode of operation that can efficiently provide authenticated encryption at high speeds without too much area cost, and is free of Intellectual B1 B2 B3 … Bm B1 B2 B3 Bm Cipher MessageKey A1 A2 A3 MAC … … Cipher 0 Cipher Cipher
  • 39. 29 Property (IP) restrictions [33] . Since the possible use case for this research thesis is the use in Industry 4.0, achieving AE with high data rates is essential. The mode must admit pipelined implementations and have minimal computational latency in order to be useful at high data rates. GCM has an added advantage that it can act as a stand-alone MAC when encryption is not required. This is a feature which is not available in any of the other proposed AE implementations. Algorithm4.3.2 GCM implements the Galois mode of authentication with an underlying Cipher, usually the AES which is also used in this research as well. The underlying AES is implemented in CTR mode [33] . The GCM algorithm has two core functions, namely, GHASH and GCTR which are explained below. 4.3.2.1 GHASH The GHASH function is basically the finite field multiplication of the input with a hashing key H over GF(2128 ). The hashing key H can be treated as a fixed 128-bit constant since it does not change if the Cipher Key doesn‟t change [7] . It can be calculated by applying the AES block on 128 bits of 0’s. Algorithmically speaking, take as the input bit string where the length of is 128*m, where m is some integer, as the hash subkey and block as the output. The following steps explain the algorithm [33] :  Let represent the unique sequence of blocks such that || || || ||2  Let Y0 be the “zero block,” which means is a bit string comprised by 128 binary 0„s.  For , let ⊕ , where “ ” indicates multiplication over finite field.  received at the end would be the output block that would be the MAC. 2 || represents the concatanation of two elements.
  • 40. 30 Following block-diagram representation of the algorithm can give a better understanding of this algorithm. Figure 4.5 Diagrammatic Representation of the GHASH function The multiplication over the finite field GF(2128 ) can be explained by the following algorithm [33] :  Let be the 128-bit block that has to be hashed containing elements .  Let be the 128-bit Hash Key, i.e., 128-bits of 0‟s ciphered through the AES block.  Let be 128 bits of 0‟s, and be a constant 128-bit string with the value || .  For i = 0 to 127 { ⊕ ( 4.1) { ⊕ ( 4.2)  After these operations are done the 128-bits of would be the output of the multiplication. X1 X2 X3 … Xm X1 X2 X3 Xm 𝐻 Message Y1 Y2 Y3 Ym … … 𝐻 𝐻 𝐻 Y0 MAC
  • 41. 31 Figure 4.6 shows the diagrammatic representation of the GF(2128 ) multiplication operation. Figure 4.6 Diagrammatic Representation of GF(2128 ) Multiplication If LSB(Ui ) = 1 If LSB(Ui) = 0 Z0 Z1 Z2 Z3 … Z127 Z128 U0 U1 U2 U3 … U127 U0 Z0 0128 H Ui R 11100001||0 120 Initialization: Logic: >>1? >>1 Ui+1 Ui+1 R If xi = 1 If xi = 0 Zi ? Zi+1 Zi+1 Ui X x0,x1,x2,…,x127 Z128 Result
  • 42. 32 The finite field multiplication in GHASH has two possible implementations i.e. bit- serial implementation and bit-parallel implementation. Simply explained, the core of the GHASH architecture is a 128-bit multiplier over GF (2128 ). The GF (2128 ) multiplier basically multiplies two 128-bit operands to generate a 128-bit output. One operand of the GF multiplier is the hash subkey H which can be treated as a fixed 128- bit constant for it will not change if the 128-bit key does not change. For the second operand two values have to be kept under consideration, the 128-bit additional authenticated data block (AAD) sequence and the Ciphertext block sequence [7] . Figure 4.7 shows the diagrammatic representation of the hashing sequence. Figure 4.7 Diagrammatic Representation of the Hashing sequence The 128-bit AAD are hashed to the GHASH through one of two inputs of XOR gates. The 128-bit Ciphertext block sequence, , are hashed to the same input of XOR gates following the AAD. Meanwhile, the intermediate hash value is fed back to another input of XOR gates to generate the other operand for the GF multiplier. Considering that it takes m clock cycles for the AAD hashing and n clock cycles for the ciphertext block hashing then the latency for a bit-parallel multiplier would be m+n+1 and for a bit-serial multiplier the latency would be 128*(m+n+1) [33] . The advantage of the bit-serial multiplier over the bit-parallel multiplier is the usage of less logic elements but at the same time it adds more latency to the system. 128-bit Multiplier over Galois Field AAD and Ciphertext hashing sequentially Y Register H Register
  • 43. 33 4.3.2.2 GCTR GCTR is the implementation of the previously explained CTR mode with a particular incrementing function, for generating the necessary sequence of counter blocks. The GCM consists of an underlying block cipher and a Galois Field Multiplier with which authenticated encryption and authenticated decryption are realized. The cipher needs to have a block size of 128-bits. For encryption, first an initial counter is derived from an Initialization Vector (IV). The initial counter value is then incremented which is then encrypted and XORed with the first plaintext block. For subsequent plaintext blocks, the counter is incremented and then encrypted. The underlying cipher is only used in the encryption mode. GCM allows pre-computation of the block cipher function if the IV is known ahead of time [33] . Authenticated Encryption4.3.3 Now that the working of the GHASH and GCTR functions are understood, they can be combined to understand the authenticated encryption and authenticated decryption that take place inside the GCM mode. First comes the authenticated encryption, so consider a 128-bit AES as the underlying block cipher, the inputs would be a Plaintext , an initialization vector and additional authenticated data . The outputs would be the Ciphertext and the authentication MAC. Following steps explain the authenticated encryption algorithm [33] :  Let is the hash subkey which is the 128 bits of 0‟s ciphered through the block cipher i.e., .  Define , such that, is a 128-bit string consisted of 96-bits of any value, 31 „0‟ bits, and 1 „1‟ bit.  Let ( ) ⊕ , where would be the counter blocks in the GCTR and would be the Plaintext data.  Let || ⊕ .  The resulting C is the Ciphertext and the resulting MAC is the authentication MAC. The flow diagram in Figure 4.8 shows how the Authenticated Encryption algorithm works.
  • 44. 34 Figure 4.8 Diagrammatic Representation of Authenticated Encryption Authenticated Decryption4.3.4 The inputs for the the authenticated decryption would be the 128-bit block cipher AES, the initialization vector , the ciphertext , the addidtional authenticated data and the MAC whereas the output would be the simple plaintext P or indication of inauthenticity FAIL. Following steps explain the algorithm of the authenticated decryption [33] :  Let is the hash subkey which is the 128 bits of 0‟s ciphered through the block cipher i.e., .  Define , such that, is a 128-bit string consisted of 96-bits of any value, 31 „0‟ bits, and 1 „1‟ bit.  Let ⊕ , where would be the counter blocks in the GCTR and would be the ciphered data.  Let || ⊕ .  If , then return which would be the resultant plaintext; else return . Figure 4.9 shows the diagrammatic representation of Authenticated Decryption. GCTR GHASH MAC AIV AES H 0128 C P inc Key Hash encryption Output
  • 45. 35 Figure 4.9 Diagrammatic Representation of Authenticated Decryption 𝑀𝐴𝐶 ≠ 𝑀𝐴𝐶 𝑀𝐴𝐶‘ 𝑀𝐴𝐶 ? PASS FAIL 𝑀𝐴𝐶 𝑀𝐴𝐶 GCTR GHASH AIV AES H 0128 C inc Key P Hash encryption Output
  • 46.
  • 47. 37 5 Literature Review The implementation of the previously discussed algorithms in hardware in the form of Crypto-Cores has been studied in the recent past. In this chapter the previous studies are discussed. The hardware implementation of an AES-GCM crypto-core consists of two core block:  An AES core  A Galois Field Multiplier Core The previous studies for implementation of these core blocks will be discussed further in this chapter. 5.1 AES Designs The AES algorithm itself has four transformations which for a 128-bit key have to be implemented 10 times as discussed earlier. This, when implemented in an iterative structure, takes one clock cycle to complete each transformation of each round. Although this implementation is simple, the efficiency of the system is very low. Further studies were done to introduce pipelining while implementing AES on hardware. A lot of research has been done in this area and the studies have provided some better solutions for hardware implementation of the AES architecture. An efficient hardware implementation would show good data throughput with very less area usage. Throughput can be defined as: The existing proposed designs for the hardware implementation of AES architecture can be classified in to four groups:  Iterative Loop Structure based Designs
  • 48. 38  Memory-Based Designs  Unfolded Structure or Parallel Structure based Designs  Sub-Pipelined Structure based Designs Iterative Loop Structure5.1.1 A design based on the Iterative Loop Structure for implementing the AES architecture on hardware is simplest form of implementation. It has very less hardware utilization since it uses a single core hardware design for a single round which is reused for all the ten rounds. If all of the round transformations (SubBytes, ShiftRows, MixColumns and AddRoundKey) are done in a single clock cycle, taking ten clock cycles for encryption of a 128-bit Block, the system shows low clock frequency. Figure 5.1 Iterative Loop Implementation On the other hand in [18] it is discussed if each round transformation is done in separate clock cycles using intermediate registers, a higher operational frequency is achievable but the overall clock cycles required to encrypt a block also increases 4 times. Thus, in both cases, the achievable throughput is not that impressive. Pipelined structure has been implemented in many other studies and compared with other structures. Memory-Based AES Hardware Implementation Designs5.1.2 One example of this is a Memory-Based structure of the AES where the instead of utilizing FPGA logic units, memory blocks are utilized to perform round transformations. One Clock Cycle AES RoundInput Output
  • 49. 39 The SBox in the SubBytes transformation and the Column Multiplication in the MixColumns transformation are implemented on internal memory block instead of the FPGA logic units. This is possible because each resultant element in the round transformations is dependent on a single element of the data block. The ShiftRows is a fixed operation and can be achieved by accurately routing values to the specific memory blocks. Only the addition part of the MixColumns transformation and the AddRoundKey transformation is done using the logic units. Figure 5.2 shows how the blocks look like during a round. Figure 5.2 Diagrammatic Representation of Memory Based AES structure In [9] a memory-based design was suggested for an unfolded AES core where all the rounds are processed in parallel. The logic unit utilization is reduced but the overall frequency is dependent on the operating frequency of the internal memory and furthermore the memory utilization in this architecture is quite high. In [16], a fully synchronous, memory-based, single-chip FPGA implementation of the recent AES Standard, Rijndael encryption algorithm is presented. Design partition allowed for an iterative loop structure where the block ciphers was implemented using the Electronic Code Book (ECB) mode of operation. The encryption RTL design focuses on a memory-based bite-sized arithmetic pipeline structure that processes one round at a time. Output ShiftRows by routing to specific memory blocks SBox MixColumns Mult. AddRoundKey Last Round MixColumns Add. Input Memory Block Round Transformations
  • 50. 40 In [17] an AES hardware implementation comparison between a composite field algorithm and Block RAM to realize the SubBytes and MixColumns module was introduced. Based on the composite field algorithm, a lower efficiency (throughput/area) was realized, whereas for the Block RAM based design the maximum frequency was limited to the maximum frequency of the Block RAM‟s Unfolded / Parallel Structure Based Designs5.1.3 A parallel structure based design is used to achieve very high throughputs but accordingly there is very high area cost as well. All the rounds are performed in a single iteration, which means that substantially the loops are unfolded or unrolled and all ten cores operate in parallel. Figure 5.3 Unrolled AES Structure The unrolled AES structures have been widely used for high implementations where hardware cost is not an issue. Studies like [7] and [13] have implemented unrolled architectures for AES for Galois/Counter Mode. Since the parallel structure of AES computes all the rounds in one clock cycle, it shows very high throughput, even the the achievable frequency is not that high. Sub-Pipelined or Stage-Level Pipelined Structure5.1.4 A sub-pipelined structured is one where the internal blocks of a round in an aes structure are pipelined. In [10], architecture for pipelining the AES rounds efficiently is introduced. It is suggested that multiple packets be performed in parallel and registers be introduced in critical path of the stages. This would increase the optimal frequency of the system as the latency increases with a significantly less area usage on the hardware and very less internal memory usage. This system suggested is considerably faster than a normal iterative structure and is known as a sub-pipelined structure. This system was Unrolled 128-bit Output 128-bit Input Round 1 Round 2 Round 3 Round 10.....
  • 51. 41 introduced because simple pipelining was not efficient for the CBC mode due to its feedback nature. The sub-pipelined structure works by pipelining the internal transformations of the AES algorithm. This is done by adding registers to the critical paths. In [10], a stage pipelining of an AES system was introduced for an AES-CCM based design. The basic idea behind is to insert registers to the critical path. This allows 4 blocks of data to be encrypted by one AES-core. The working flow diagram of the encryption data-path is shown in Figure 5.4. Figure 5.4(a)Pipelined AES encryption datapath, (b) Pipelined Key Scheduler In the stage for reg2, SubBytes and ShiftRows transformation is implemented. The SubBytes can be implemented in two ways. The first way is a 256-byte lookup table (LUT) and the second way is to logically calculate sub-byte transformation. Calculating logically would have substantial area cost and also a larger latency, therefore it is easier, cheaper and faster to implement an LUT. The ShiftRows is simply implemented by accurately routing from the Sbox to the MixColumns transformation, thus not requiring a separate step. RCON Key4 Key2 Key1 Key0 Output Input Round Key r e g 1 r e g 2 r e g 3 r e g 4 MixColumn Add (part) MixColumn Add (part) AddRoundKey MixColu mn Mult(2's) Sbox LUT r e g 5 r e g 7 r e g 8 Sbox LUT (a) (b) r e g 6
  • 52. 42 In MixColumn transformation addition and multiplication occurs in GF (28 ). As previously discussed in MixColumns transformation the multiplication of the polynomial with and is required. The multiplication of the polynomial with can be achieved by multiplying the polynomial with and then adding the original polynomial to the result, therefore, only multiplication with needs to be implemented. MixColumns transformation is achieved in three stages. Following are the stages:  Multiplying polynomial with {02}  Addition process  AddRoundKey performed The 4-stage pipelined structure works in a way such that 4 blocks of data can be computed with a delay of one-clock cycle. When one stage has computed the result for one block it is ready to take the next block of data. In this way four pipes are working as shown in Figure 5.5. Figure 5.5 Stages for Sub-Pipelining It takes 40 clock cycles to compute one block but at the same time four blocks are being processed. This increases the operating frequency of the system with very less area-cost [10] . Stage 1 Stage 2 Stage 3 Stage 4 Stage 1 Stage 2 Stage 3 Stage 4 Stage 1 Stage 2 Stage 3 Stage 4 Stage 1 Stage 2 Stage 3 Stage 4 Pipeline 1 Pipeline 2 Pipeline 3 Pipeline 4
  • 53. 43 5.2 Galois Field Multiplier (GFM) Designs In the past years GCM has been increasingly adopted for hardware since it has proven to be fast and efficient. Since GCM can be parallelized and pipelined (unlike CBC due to its feedback nature) it is very desirable when hardware implementations are concerned. Many ideas have been suggested to implement GCM as a Crypto-Core and the main computational complexity usually in GCM, compared to any other mode, is the multiplication in the GF(2128 ) field for hashing. Conventionally two methods have been suggested to implement the finite field multiplication required for hashing on the hardware.  Bit-Serial Multiplier  Bit-Parallel Multiplier The Bit-Serial Multiplier is implementation of the multiplier where each iteration of , as depicted in 4.1 and 4.2, is calculated serially. This design compromises the latency of the system for very low area usage on the hardware. The Bit-Parallel Multiplier calculates the in 1 clock cycle but the hardware cost due to the implementation of the 128-bit operands in GF(2128 ) is very much. The high complexity of the Multiplier also reduces the achievable operating frequency of the system when processed in parallel. For implementing Galois Field multiplication on hardware a lot of studies have been done to find a suitable solution. In [8] a method for implementing a parallel multiplier was introduced known as Mastrovito multiplier. It is the most widely used method for implementing Galois Field multiplication on hardware. The design is essentially a brute force multiplier in the sense that the matrix vector product, shown in Equation 4.1 and 4.2, is computed like traditional matrix multiplication. Elements are in GF(2m ), so and gates are used for element wise multiplication and addition respectively. Although the Mastrovito multiplier is fast, the area usage on the hardware is considerably large thus increasing the cost of hardware.
  • 54. 44 Another method was introduced in [29] for implementing GF multiplication on hardware. The idea was to use the multiplication method introduced in [22] by A. Karatsuba, thus naming the multiplier Karatsuba Multiplier. The idea behind the Karatsuba multiplier was to decrease the number of multiplication operation while increasing the addition operations. Since, the addition operation require less area as compared to multiplication, the area cost for a Karatsuba multiplier is less than a Mastrovito multiplier. This, however, comes with a delay cost due to which the Karatsuba based multipliers are much slower than the Mastrovito multipliers. Recently, a method which serves as a compromise between the Karatsuba and Mastrovito multipliers was introduced, called the Fan-Hasan (FH) Multiplier [25], [27], [28] . The FH-Multiplier is considerably faster than the Karatsuba multiplier but still has a larger delay than the Mastrovito multiplier. However, the area overhead is much less than the Mastrovito multiplier. A comparison between the three types of multipliers has been provided in [30]. Since the Mastrovito multiplier is still the most widely used multiplier for the GFM for hardware implementations of GCM, this study will also look for a good GFM solution using the Mastrovito multiplier. The problem in the Mastrovito multiplier is the large area overhead due to the parallel matrix vector product. This also causes a low achievable operational frequency. In [11] a method was introduced to use Mastrovito parallel multiplication in a much more efficient way. The idea was to introduce pipeline in the multiplier and doing the multiplication in multiple iterations instead of doing the complete multiplication in one. Using this, a higher operational frequency was achievable and the area cost was introduced but this also created a latency in computing the result.
  • 55. 45 6 Design & Implementation Results 6.1 Design The final design was made by looking at the various studies in the past and considering the best route to be taken. The authentication mode chosen was the Galois/Counter Mode because of its flexibility when implementing on hardware. Another advantage of GCM over any other mode is the fact that the authentication core (GMAC) can operate as a separate entity for messages that just need authentication. The design has two cores, an underlying AES core and a Galois Field Multiplier (GFM) core, that are controlled by a state machine. Each core will be explained separately and then the state machine will be explained3 . AES Core6.1.1 The AES core is based on the stage-level pipelining design as explained in [10]. Although this design was suggested to introduce a pipelining method for the AES-CCM mode, for which normal pipelining is not possible, it has shown good operating frequency for the system. Due to the feedback nature of CBC-mode there is a requirement for waiting for the result of the previous block which increases the complexity of the system. In AES-GCM since the mode of operation is the CTR mode this can be simplified by using a single Initialization Vector and using an increment function for the next iterations. The AES core takes the Initialization Vector as input data along with a Cipher Key. A data enable bit identifies the core that a data block is available for encryption. A Key Generator core, as shown in Figure 5.4(b), generates the key for encryption. A counter function is present to notify the number of pipelines that are being utilized. 3 See Appendix for source codes.
  • 56. 46 When the core receives data it sends the data to the first stage of the round transformation. Each stage takes 1 clock cycle, thus in the next clock cycle another data block is sent to the round transformation. On each input data the pipeline counter is incremented (max. 4). Once all the pipelines are computed the AES core send a done signal to enable next blocks of data to be computed. Figure 6.1 shows the internal structure of the AES core. Figure 6.1 Internal Structure of AES core The AES core is the central part of the Encryption block of the AES-GCM core. Following are the characteristics of the Encryption Block:  The mode of operation for the Encryption Block is the CTR mode.  The inputs for the Encryption Block are an initialization vector , a Cipher Key and the input plaintext P.  The is passed through an increment block incr to generate a sequence of values depending on the size of the P.  The AES core inside the Encryption Block takes the and the Cipher Key as input and generates a Keystream as an output.  The output Keystream is then XORed with P and the result is given out as a ciphertext C. Round Enable Input Data Data Enable Key Generator AES Round Pipeline Counter Round Register Round Data Input Pipeline Count Intermediate Round Output Generated Key Last Round
  • 57. 47 Figure 6.2 shows the diagrammatic representation of the Encryption Block. Figure 6.2 Diagrammatic Representation of Encryption Block Galois Field Multiplier (GFM) Core6.1.2 The GF(2128 ) multiplier core takes in AAD and C sequentially and applies GF(2128 ) multiplication with hashing key . is required to generate the Matrix as shown in 4.2. Multiplying two 128-bit operands in a GF(2128 ) field takes a lot of area and causes a lower achievable frequency due to its complexity. In this design it is proposed that the Hashing Matrix is computed and stored as a memory block. This is done by using a constant key and calculating beforehand. All 128 elements of the Matrix are calculated using Equation 4.2 and then stored as memory blocks. This solves the problem of high area usage that is seen in a basic Mastrovito bit-parallel multiplier and because most of the complex operations are reduced, leaving just XOR and AND operations, a considerably better operational frequency is achievable. Figure 6.3 represents the diagrammatic representation of this design. AES Core incr Encryption Block IV Key P Keystream C
  • 58. 48 Figure 6.3 Diagrammatic Representation of Pipelined GF(2128 ) Multiplier Following are the characteristics of the Authentication Block:  The GF multiplier core has a 128-bit input for AAD and C generated from the encryption block being entered sequentially.  The subsequent bits from Matrix are called from the memory to be XORed with the bits from the data input.  All the XOR are done in 1 clock cycles (Mastrovito parallel multiplier).  The output of the GF Multiplier is looped back and XORed with the next block to be sent as input.  After the final block the output 128-bits are XORed with the keystream generated from the AES core.  The resultant value is sent as MAC. Figure 6.4 shows the diagrammatic representation of the Authentication Block: Figure 6.4 Diagrammatic Representation of Authentication Block Matrix U Memory Block GF(2 128 ) MultiplierInput 128-bits Output 128-bits Z register 128-bits MACKeystream GF(2128 ) Multiplier Core Matrix U Z AAD & C Authentication Block
  • 59. 49 Complete Design6.1.3 The above explained AES-core and GFM core are implemented together to form a Crypto-Core. A state-machine is implemented to utilize the two cores according to the requirements. To understand the state machine of the Crypto-Core, first the signals of the Crypto-Core have to be defined: Signal Type Description data_in(128-bit) Input 128-bit Data input interface data_in_valid(4-bit) Input Notifies that data is available on data_in data_in_type(1-bit) Input Notifies data type. (1=AAD, 0=Plaintext) data_in_not_ready(1-bit) Output Output busy bit data_in_last_word(1-bit) Input Notifies the last block of input data data_in_size(4-bit) Input Notifies size of data (0 – 15 => 8 bit – 128 bit) start(1-bit) Input Notifies start operation IV_valid(1-bit) Input Notifies that data on data_in is IV data_out(128-bit) Output 128-bit Data Output Interface data_out_valid(1-bit) Output Notifies that data is available on data_out data_out_size(4-bit) Output Notifies size of data (0 – 15 => 8 bit – 128 bit) data_out_last_word(1-bit) Output Notifies the last block of output data tag_valid(1-bit) Output Notifies that Tag is available on data_out Table 6.1 I/O Signals of Crypto-Core The Top layer of the AES-GCM core controls the working of the two core with the help of a state machine. The state machine defines how the AES-GCM core operates for the incoming data blocks. The description of each state in the state machine is given as follows: IDLE: In the IDLE state the system checks for the start and IV_valid signals. Once both are high it stores the IV available on the data_in port to a register (Yi) and changes the state to INIT_COUNTER. INIT_COUNTER: In the INIT_COUNTER state the systems sends the value from Yi to the AES core to generate the keystream. The state is changed to ENCRYPT_Y0.
  • 60. 50 ENCRYPT_Y0: The system waits for AES core to generate the keystream which is stored in a register (EkY0). The state is changed to DATA_ACCEPT. DATA_ACCEPT: The data_in_not_ready signal is set to „0‟. The system checks for data_in_valid[0], if it is high it checks for data_in_type. For data_in_type = 1 (AAD) it starts the GFM count register and sends the data to the GFM input, the state is changed to GFM_MULT. For data_in_type = 0 (Plaintext) the system increments the Yi register by 1 and changes the state to INC_COUNTER1. INC_COUNTER_1: The system sends value from Yi to input of the AES core and data_in_valid[0] is sent to data enable of AES core. The data on data_in is stored to a register. The system checks data_in_valid[1] signal; if it is „0‟ the state is changed to ENCRYPT. If it is „1‟ the system increments the Yi register by 1 and a stream register by 1. The state is changed to INC_COUNTER_2. INC_COUNTER_2: The system sends value from Yi to input of the AES core and data_in_valid[1] is sent to data enable of AES core. The data on data_in is stored to a register. The system checks data_in_valid[2] signal; if it is „1‟ increments the Yi register by 1 and stream register by 1. The state is changed to INC_COUNTER_3. If it is „0‟ the state is changed to ENCRYPT. INC_COUNTER_3: The system sends value from Yi to input of the AES core and data_in_valid[2] is sent to data enable of AES core. The data on data_in is stored to a register. The system checks data_in_valid[3] signal; if it is „1‟ the system stores increments the Yi register by 1 and stream register by 1. The state is changed to INC_COUNTER_4. If it is „0‟ the state is changed to ENCRYPT. INC_COUNTER_4: The system sends value from Yi to input of the AES core and data_in_valid[3] is sent to data enable of AES core. The data on data_in is stored to a register. The state is changed to ENCRYPT. ENCRYPT1: The system waits for AES core to generate keystream and the generated keystream is XORed with the value in r_datain0. The data is sent to data_out and data_out_valid is set to 1. The output data is also sent as input to the GFM. The system
  • 61. 51 checks the stream register. If it is not „0‟ it decrements the stream register by 1 and the state is changed to ENCRYPT2 otherwise it is changed to GFM_MULT. ENCRYPT2: The system waits for AES core to generate keystream and the generated keystream is XORed with the value in r_datain1. The data is sent to data_out and data_out_valid is set to 1. The output data is also sent as input to the GFM. The system checks the stream register. If it is not „0‟ it decrements the stream register by 1 and the state is changed to ENCRYPT3 otherwise it is changed to GFM_MULT. ENCRYPT3: The system waits for AES core to generate keystream and the generated keystream is XORed with the value in r_datain2. The data is sent to data_out and data_out_valid is set to 1. The output data is also sent as input to the GFM. The system checks the stream register. If it is not „0‟ it decrements the stream register by 1 and the state is changed to ENCRYPT4 otherwise it is changed to GFM_MULT. ENCRYPT4: The system waits for AES core to generate keystream and the generated keystream is XORed with the value in r_datain3. The data is sent to data_out and data_out_valid is set to 1. This data is sent as input to the GFM and the state is changed to GFM_MULT. GFM_MULT: The checks the data_in_last_word signal. If it is „0‟ the state is changed back to DATA_ACCEPT. If it is „1‟ the state is changed to PRE_TAG_CALC. PRE_TAG_CALC: The GFM count register is reset to „0‟ and the state is changed to TAG_CALC. TAG_CALC: The output of the multiplier is XORed with the value in EkY0 to generate Tag. The generated Tag is sent to the data_out port and the tag_valid signal is set to „1‟. The state diagram in Figure 6.5 shows the flow of the state machine.
  • 62. 52 Figure 6.5 State Diagram of Crypto-Core State Machine Following are the characteristics of the Final completed design:  The inputs for the Crypto-Core are a 96-bit initialization vector , a 128-bit input Plaintext and a 128-bit input additional authenticated AAD.  The outputs are a 128-bit output Ciphertext , and a 128-bit output MAC.  The 128-bit Cipher Key, 256-byte and the 1 Kbyte Matrix are memory units and are accessed from the Memory Block.  A Control Block controls all the operations for the Crypto-core. The diagrammatic representation of the complete design is shown in Figure 6.5: AES Done Start AES core for next block data_in_last_word = 0 data_in_last_word = 0 data_in_valid = 1 data_in_type = 0 data_in_valid = 1 data_in_type = 1 Start AES core start = 1 IV_valid = 1 IDLE INIT_COUNTER DATA_ACCEPT INC_COUNTER(x4) GFM_MULT ENCRYPT(x4) PRE_TAG_CALC TAG_CALC ENCRYPT_Y0
  • 63. 53 Figure 6.6 Complete Design Layout 6.2 Implementation and Results Implementation Platform6.2.1 The above explained design has been implemented and tested on Altera based hardware. The board used for the hardware implementation is the DB5CGXFC7 provided by Devboards GmbH4 . The DB5CGXFC7 Board is based on the Altera Cyclone V GX Device and has an Altera EP5CGXFC7C6F23C7N FPGA chip which contains 150K Logic Elements and a 7Mbit RAM. This is a low end device of the Altera FPGA family and is selected because the idea behind implementation of this Crypto-Core was to have a design that is suitable for low-end devices as well. The design was also simulated using the ModelSim – Altera 10.4b software for test_bench.v (See Appendix). 4 http://www.devboards.de/en/home/boards/product-details/article/db5cgxfc7/ AAD & C Keystream C Keystream C AES-GCM Crypto-Core Sbox IV P Encryption Block Key Authentication Block Control Block Memory Block P IV AAD Matrix U MAC
  • 64. 54 Tests & Results6.2.2 The system has been tested and compared with previous studies to provide a better understanding of the performance. The designs were tested for area usage, achievable operational frequency, throughput and efficiency. The proposed design is simulated on ModelSim to see the total delay in receiving the output. Then it is compared with the design in [10], implementing both on the same hardware platform. Another implementation is also compared, {1} implementing the Mastrovito multiplier in four steps as described in [11] without implementing Matrix U in memory and implementing the AES round transformations in sub-pipelined mode. Table 6.2 shows the comparison results of the designs with respect to achievable operational frequency, clock cycles takes for 4 blocks of data and throughput. The graphical representation of the comparison is given in Figure 6.6. Operational Frequency Clock Cycles per 4 blocks Throughput Proposed Design 140.17 MHz 44 1.63 GHz [10] 146.34 MHz 44 1.87 GHz {1} 159.32 MHz 56 1.46 GHz Table 6.2 Comparison of Operational Frequency and Throughput Figure 6.7 Graphical Comparison of Frequency and Throughput 0.1401 0.14634 0.15932 1.63 1.87 1.46 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 Proposed Work [10] [11] Frequency (GHz) Throughput (GHz)
  • 65. 55 The comparison for area usage in Adaptive Logic Module (ALM) used and the efficiency (area/throughput) is shown in table 6.3. Figure 6.7 shows the Graphical comparison of the Area usage and 6.8 shows the Graphical comparison of the efficiencies between the designs. Area (ALM) Efficiency(Mbps/ALM) Proposed Design 1176 1.39 [10] 1582 1.18 {1} 2784 0.52 Table 6.3 Comparison of Area usage and Efficiency Figure 6.8 Graphical Comparison of Area Usage Figure 6.9 Graphical Comparison of Efficiency 1176 1582 2784 0 500 1000 1500 2000 2500 3000 Proposed Work [10] {1} Area (ALMs) Area (ALMs) 1.39 1.18 0.52 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 Proposed Work [10] {1} Efficiency (Mbps/ALMs) Efficiency (Mbps/ALMs)
  • 66. 56 6.3 Conclusion The results shown in the previous section show that although in {1} a higher frequency is achievable but, due to the latency created because of the 4 clock cycles takes by dividing the multiplication in four iterations, the throughput achieved is much less. The area usage due to the implementation of the calculation of Matrix U in logic increases thus causing the overall efficiency of the system to decrease. On the other hand in the proposed design the Matrix U is pre-calculated and stored as memory block, thus decreasing the area cost. The memory usage, even after storing the Matrix U and SBox as memory blocks, is nearly 1%. The efficiency of the proposed design is much better than the designs in comparison. Even though, due to the constant key, there is a security risk but for implementation in hardware level for an automated system this design is very suitable since the cryptographic key can be kept constant and is not user provided.
  • 67. 57 Appendix The source codes are provided in a CD along with the Thesis. Following is a brief description of the source code files: Name Desciption AESGCM.v Top-level file for AES-GCM Crypto-Core GFM.v Implementation of Galois Field Multiplier AESConstants.v Necessary definitions for the core AESCore.v[10] Top-Level for AES Core AESRound.v[10] Implementation of Round Transformations Counter.v[10] Counter function to count round number defineAES.v[10] Pipeline definitions lib.v[10] Other primitives MixColumnAddKey.v[10] MixColumn and AddRoundKey Implementation KeyGenerator.v[10] Key Generation implementation SBox1.v[10] SBox as LUT SBox2.v[10] SBox and polynomial multiply by 02 as LUT SBox1_2LUT.v[10] SBox Top-Level
  • 68.
  • 69. 59 Bibliography [1] OPC Unified Architecture, Interoperability for Industry 4.0 and the Internet of Things, OPC Foundation Forum. [2] Klaus Shwab, The Fourth Industrial Revolution, World Economic Forum, 2016. [3] Mario Hermann, Tobias Pentek and Boris Otto, Design Principles for Industrie 4.0 Scenarios, System Sciences (HICSS), 2016. [4] J. Daemen and V. Rijmen, AES Proposal: Rijndael, AES Algorithm Submission, September 3, 1999. [5] Federal Information Processing Standards (FIPS), Specification for the Advanced Encryption Standard (AES), FIPS Publication 197, November 26, 2001. [6] Harris Nover, Algebraic Cryptanalysis of AES: An Overview, Department of Mathematics, University of Wisconsin. [7] Sheng Wang, An Architecture for AES-GCM Security Standard, Master‟s Thesis presented at University of Waterloo, Canada, 2006. [8] E. D. Mastrovito, VLSI Designs for Multiplication over Finite Fields GF(2m ), in Proc. Sixth International Conference, Applied Algebra, Algebric Algorithms and Error-Correcting Codes (AAECC-6), Rome, July 1988. [9] Ricardo Chaves, Georgi Kuzmanov, Stamatis Vassiliadis and Leonel Sousa, Reconfigurable Memory based AES Co-Processor, Parallel and Distributed Processing Symposium, 2006.
  • 70. 60 [10] Haeyoung Rha and Hae-wook Choi, Efficient Pipelined Multistream AES CCMP Architecture for Wireless LAN, Paper submitted at Korea Advanced Institute of Science & Technology (KAIST), 2012. [11] Bryce Barcelo and John Taylor, Crypto Acceleration Using Asynchronous FPGAs, Submitted to the faculty of Worcester Polytechnic Institute. [12] Cheng Wang and Howard M. Heys, Using a Pipelined SBox in Compact AES Hardware Implementations, IEEE NEWCAS2010, pp. 101-104, 2010. [13] Arash Reyhani-Masoleh, Mehran Mozaffari-Kermani, Efficient and High- Performance Parallel Hardware Architectures for the AES-GCM, IEEE Transactions on Computers, vol. 61, no. , pp. 1165-1178, Aug. 2012. [14] Muhammad H. Rais and Syed M. Qasim, Efficient Hardware Realization of Advanced Encryption Standard Algorithm using Virtex-5 FPGA, International Journal of Computer Science and Network Security (IJCSNS) Vol. 9 No. 9, September 2009. [15] Abolfazl Soltani and Saeed Sharifian, An Ultra-High Throughput and fully Pipelined Implementation of AES algorithm on FPGA, Journal: Microprocessors and Microsystems Vol. 39 Issue 7, Amsterdam, October 2015. [16] A. Brokalakis and H. Michail, A High-Speed and Area-Efficient Hardware Implementation of AES-128 Encryption Standard, 5th WSEAS Conference on Multimedia, Internet and Video Technologies, Greece, August 2005. [17] D. Chen, G. Shou, Y. Hu and Z. Guo, Efficient Architecture and Implementations of AES, IEEE ICACTE2010, pp. V6-295-V6-298, 2010. [18] Nadia Nedjah, Luiza de Macedo Mourelle, Marco Paulo Cardoso, A Compact Pipelined hardware Implementation of the AES-128 Cipher, IEEE ITNG2006, 2006.
  • 71. 61 [19] Kenneth Stevens, Otmane A. Mohamed, Single-chip FPGA Implementation of a Pipelined, Memory-Based AES Rijndael Encryption Design, IEEE ECE2005,pp. 1296- 1299, 2005. [20] Deen Kotturi, Seong-Moo Yoo, and John Blizzard, AES Crypto Chip Utilizing High-Speed Parallel Pipelined Architecture, IEEE ISCAS2005,pp. 4653-4656 vol.5, 2005. [21] J. Guajardo, T. Güneysu, Sandeep S. Kumar, C. Paar and J. Pelzl, Efficient Hardware Implementation of Finite Fields with Applications to Cryptography, Acta Appl Math (2006) 93: 75–118, September 2006. [22] A. Karatsuba and Y. Ofman, Multiplication of Multidigit Numbers on Automata, Soviet Physics Doklady, 7:595, 1963. [23] Emilia Käsper and Peter Schwabe, Faster and Timing-Attack Resistant AES- GCM, Katholieke Universiteit Leuven. [24] Bo Yang, Sambit Mishra and Ramesh Karri, High Speed Architecture for Galois/Counter Mode of Operation (GCM), Polytechnic University, Brooklyn, New York. [25] H. Fan and M.A. Hasan, A New Approach to Subquadratic Space Complexity Parallel Multipliers for Extended Binary Fields, IEEE Transactions on Computers, 56(2):224–233, 2007. [26] A. Satoh, High-Speed Parallel Hardware Architecture for Galois Counter Mode, IEEE International Symposium on Circuits and Systems(ISCAS 2007), pages 1863–1866, 2007. [27] M.A. Hasan, Matrix-vector Product based Subquadratic Arithmetic Complexity Schemes for Field Multiplication, Proceedings of SPIE, 6697:669702, 2007.
  • 72. 62 [28] H. Fan and Y. Dai, Fast Bit-Parallel GF (2n ) Multiplier for All Trinomials, IEEE Transactions on Computers, 54(4):485–490, 2005. [29] C. Paar, A new Architecture for a Parallel Finite Field Multiplier with low Complexity based on Composite Fields, IEEE Transactions on Computers, 45(7):856–861, 1996. [30] Pujan Patel, Parallel Multiplier Designs for the Galois/Counter Mode of Operation, A thesis presented to the University of Waterloo, Waterloo, Ontario, Canada, 2008. [31] Bruce Schneier, John Kelsey, Doug Whiting, David Wagner, Chris Hall, Niels Ferguson, Tadayoshi Kohn , The Twofish Team's Final Comments on AES Selection, May 2000. [32] NIST Computer Security Division's (CSD) Security Technology Group (STG), Proposed modes, Cryptographic Toolkit, NIST, April 14, 2013. [33] David A. McGrew, John Viega, The Galois/Counter Mode of Operation, NIST, 2005.