Introduction to Big Data Technologies

Introduction to  
Big Data Technologies
Eakasit Pacharawongsakda, Ph.D.
eakasit@datacubeth.ai
Data Cube / Quandatics

http://dataminingtrend.com http://facebook.com/datacube.th
Outline
• Part 1: Introduction to Big Data
• Part 2: Introduction to NoSQL
• Part 3: Introduction to MapReduce and Hadoop
• Part 4: Introduction to Hive, HBase and Sqoop
2

ในหนึ่งวันทำงาน

source:http://pad1.whstatic.com/images/thumb/a/aa/Reduce-Anxiety-About-Driving-if-You're-a-Teenager-Step-5-Version-2.jpg/
aid196018-728px-Reduce-Anxiety-About-Driving-if-You're-a-Teenager-Step-5-Version-2.jpg
เวลา 07:00 น. ออกเดินทางไปทำงาน

source: http://www.clipartkid.com/images/259/research-and-report-writing-9-23-12-9-30-12-q2r0wg-clipart.jpg
เวลา 07:45 น. ยังคงติดอยู่บนถนน

เวลา 08:00 น. เจ้านายโทรศัพท์เข้ามาถามงาน
source: https://d1ai9qtk9p41kl.cloudfront.net/assets/mc/psuderman/2011_07/text-drive.png

เวลา 08:05 น. ขับรถไปชนกับคันอื่น

เวลา 10:00 น. ถึงที่ทำงานและทำงานต่อไป
source: http://stuffpoint.com/anime-and-manga/image/285181-anime-and-manga-girl-working-in-the-computer.jpg

เวลา 18:00 น. แวะซื้อของกลับบ้าน

เวลา 20:00 น. กลับถึงบ้านและอยู่คนเดียว

ในหนึ่งวันทำงานกับ 
เทคโนโลยีข้อมูลขนาดใหญ่ (Big Data)

ระบบนำทาง
• แอพพลิเคชัน Waze
12

ระบบนำทาง
• แอพพลิเคชัน Waze
13

รถที่ไม่ต้องมีคนขับ (self driving car)
• Waymo (Google self-driving car)
14

แผงไข่อัจฉริยะ
• Egg Minder
15

ร้านค้าที่ไม่ต้องรอคิว
• Amazon Go
16

เทคโนโลยีที่ทำให้ชีวิตประจำวันสะดวกขึ้น
17

ทำไมผู้หญิงถึงโสด
18
source: https://pishetshotisak.wordpress.com/2016/12/07/ทำไมผู้หญิงถึงขึ้นคาน-ค/

คนเรามักชอบอะไรใหญ่ๆ

Big Data & Analytics
• Big Bang
20
source:http://www.thetechy.com/science/exploring-universe-curiosity

• Big Architecture (Great wall of China)
21
source: http://www.history.com/topics/great-wall-of-china

• Big Data
22source: http://www.plmjim.com/?p=583

Data Evolutions
23
source:Data Science and Big Data Analytics: Discovering, analyzing, visualizing and presenting data

What is Big Data?
24
source: https://www.youtube.com/watch?v=TzxmjbL-i4Y

What is Big Data?
25
source: http://www.intel.com/content/www/us/en/big-data/big-data-101-animation.html#

What is Big Data?
• Big Data ประกอบด้วย 3 V
• Volume
• ข้อมูลมีจำนวนเพิ่มขึ้นอย่างมหาศาล
• Velocity
• ข้อมูลเพิ่มขึ้นอย่างรวดเร็ว
• Variety
• ข้อมูลมีความหลากหลายมากขึ้น
26
source: https://upxacademy.com/beginners-guide-to-big-data/

What is Big Data?
• Huge volume of data
• ข้อมูลมีขนาดใหญ่มากๆ เช่น มีจำนวนเป็นพันล้านแถว (billion row) หรือ
เป็นล้านคอลัมน์ (million columns)
27

Big Data: Volume
28
source:https://dataﬂoq.com/read/infographic/226

Big Data: Volume
29
source:https://www.adeptia.com

What is Big Data?
• Speed of new data creation and growth
• ข้อมูลเกิดขึ้นอย่างรวดเร็วมากๆ
30

Big Data: Velocity
31

What is Big Data?
• Speed of new data creation and growth
• ข้อมูลเกิดขึ้นอย่างรวดเร็วมากๆ
• Complexity of data types and structures
• ข้อมูลมีความหลากหลาย ไม่ได้อยู่ในรูปแบบของตารางเท่านั้น อาจจะเป็น
รูปแบบของข้อความ (text) รูปภาพ (images) หรือ วิดีโอ (video clip)
32

Big Data: Variety
33

Big Data: Variety
34

What is Big Data?
35
source: http://dataconomy.com/2014/08/infographic-how-to-explain-big-data-to-your-grandmother/

Internet of Things
36source: http://www.postscapes.com/what-exactly-is-the-internet-of-things-infographic/

Sensors
37source: http://www.postscapes.com/what-exactly-is-the-internet-of-things-infographic/

IoT applications
38

IoT applications
• Disney’s Magic Band
39
source:https://disneyworld.disney.go.com/plan/my-disney-experience/bands-cards/#?CMP=SEC-WDWShareEmailNGE-MDX-MagicBand-video&video=0/0/0/0

IoT applications
• GlowCaps
40
source:http://www.vitality.net/glowcaps.html

IoT applications
• Connected Toothbrush
41
source:https://www.youtube.com/watch?v=gLpUxDdh9iQ

IoT applications
42
source:https://www.youtube.com/watch?v=TqRN7r7mGmk

IoT applications
43

IoT applications
• iBeacon
44
source: https://www.mallmaverick.com/system/site_images/photos/000/001/700/original/blog_ibeacon1.jpg?1391033561

Outline
45

Relational database & SQL
• Databases are made up of tables and each table is made up of
rows and columns
• SQL is a database interaction language that allows you to add,
retrieve, edit and delete information stored in databases
46
ID Mark Code Title
S103 72 DBS Database Systems
S103 58 IAI Intro to AI
S104 68 PR1 Programming 1

• SQL primarily works with two types of operations to query data
• Read consists of the SELECT command, which has three
common clauses
• SELECT
• FROM
• WHERE
47image source:https://justbablog.ﬁles.wordpress.com/2017/03/sql_beginners.jpg

48image source:https://justbablog.ﬁles.wordpress.com/2017/03/sql_beginners.jpg

Why NoSQL?
• Relational databases have been the dominate type of database used
for application for decades.
• With the advent of the Web, however, the limitations of relational
databases became increasingly problematic.
• Companies such as Google, LinkedIn, Yahoo! and Amazon found that
supporting large numbers of users on the Web was different from
supporting much smaller numbers of business users.
49

Why NoSQL?
50image source:https://www.slideshare.net/up1/introduction-to-nosql-61023856?qid=8519a104-f1d8-4955-a58b-a1eb61f61a8c

Why NoSQL?
• Web application needed to support
• Large volumes of read and write operations
• Low latency response times
• High availability
• These requirement were difﬁcult to realise using relational databases.
• There are limits to how many CPUs and memory can be supported in a
single server.
• Another option is to use multiple servers with a relational database.
• operating single RDBMS over multiple servers is a complex operation
51

Why NoSQL?
• NoSQL is “Not Only SQL”
• Four characteristics of data management for large-scale data
management tasks are
• Scalability
• Cost
• Flexibility
• Availability
52

Why NoSQL?: Scalability
• Scalability is the ability to efficiently meet the needs for varying
workloads.
• For example, if there is a spike in traffic to a website, additional
servers can be brought online to handle the additional load.
• When the spike subsides and traffic returns to normal, some of
those additional servers can be shut down.
• Adding servers as needed is called scaling out.
53

• Scaling Up
• Scaling Out
54

• Scaling out is more ﬂexible than scaling up.
• Servers can be added or removed as needed when scaling up.
• NoSQL are designed to utilise servers available in a cluster with
minimal intervention by database administrators.
55

Why NoSQL?: Cost
• Commercial software vendors employ a variety of licensing
models that include charging by
• the size of the server running the RDBMS
• the number of concurrent users on the database
• the number of named users allowed to use the software
• The major NoSQL databases are available as open source. It’s free to
use on as many servers of whatever size needed
56

Why NoSQL?: Cost
57image source:https://www.slideshare.net/up1/introduction-to-nosql-61023856?qid=8519a104-f1d8-4955-a58b-a1eb61f61a8c

Why NoSQL?: Flexibility
• Database designers expect to know at the start of a project all
the tables and columns that will be needed to support an
application.
• It is also commonly assumed that most of the columns in a table
will be needed by most of the rows.
• Unlike relational databases, some NoSQL databases do not
require a ﬁxed table structure.
• For example, in a document database, a program could
dynamically add new attributes as needed without having to have a
database designer alter the database design.
58

Why NoSQL?: Availability
• Many of us have come to expect websites and web applications
to be available whenever we want to use them.
• NoSQL databases are designed to take advantage of multiple,
low-cost servers.
• When one server fails or is taken out of service for maintenance,
the other servers in the cluster can take on the entire workload.
59

Variety of NoSQL Databases
• There are 4 major types of key NoSQL databases
• Key-Value databases
• Document databases
• Column-oriented databases
• Graph databases
60

Key-Value databases
• Key-value databases are the simplest form of NoSQL
databases.
• These databases are modelled on two components:  
keys and values
• Data is stored in a key-value pairs, where attribute is the Key
and content is the Value
• Data can only be queries and retrieved using the key only.
61

Key-Value databases
• use cases
• caching data from
relational databases to
improve performance
• storing data from
sensors (IoT)
• software
• redis
• Amazon DynamoDB
62
3876941. accountNumber
Jane Washington1. Name
31.numItems
Loyalty Member1.custType
Keys Values

Key-Value databases
• Redis example (http://try.redis.io)
• Set or update value against a key:
• SET university "DPU" // set string
• GET university // get string
• HSET student firstName "Manee" // Hash – set field
value
• HGET student firstName // Hash – get field value
• LPUSH "alice:sales" "10" "20" // List create/append
• LSET "alice:sales" "0" "4" // List update
• LRANGE "alice:sales" 0 1 // view list
63

Key-Value databases
• Set or update value against a key:
• SET quantities 1
• INCR quantities
• SADD "alice:friends" "f1" "f2" //Set – create/
update
• SADD "bob:friends" "f2" "f1" //Set – create/update
• Set operations:
• intersection
• SINTER "alice:friends" "bob:friends"
• union
• SUNION "alice:friends" “bob:friends"
64

• Graph databases
65

Document Databases
• A document store allows the inserting, retrieving, and
manipulating of semi-structured data.
• Compared to RDBMS, the documents themselves act as
records (or rows), however, it is semi-structured as compared to
rigid RDBMS.
• It can store the data that have different set of data ﬁelds
(columns)
• Most of the databases available under this category use XML,
JSON
66

Document Databases
• Document examples
67
{
“EmployeeID" : "SM1",
"FirstName" : "Anuj",
"LastName" : "Sharma",
"Age" : 45,
"Salary" : 10000000
}
{
"EmployeeID": "MM2",
"FirstName" : "Anand",
"Age" : 34,
"Salary" : 5000000,
“Address" : {
"Line1" : "123, 4th Street",
"City" : "Bangalore",
"State" : "Karnataka"
},
"Projects" : [
"nosql-migration",
"top-secret-007"
]
}

Document Databases
• Use cases
• back-end support for websites with high volumes of reads and
writes
• applications that use JSON data structures such as twitter data
• Software
• MongoDB
• Couchbase
• IBM Cloudant
68

Document Databases
• MongoDB examples
• Download MongoDB from https://www.mongodb.com/download-
center?jmp=nav#community
• MongoDB’s default data directory path is the absolute path datadb
on the drive from which you start MongoDB
• You can specify an alternate path for data ﬁles using the --dbpath
option to mongod.exe
• Import example data
69
"C:Program FilesMongoDBServer3.4binmongod.exe"
--dbpath d:testmongodbdata
mongoimport --db test --collection restaurants --drop --file
downloads/primer-dataset.json

Document Databases
• Download and install Robomongo (https://robomongo.org/
download)
70

Document Databases
• Find bakery’s shop 
• Find restaurants in “Morris Park Ave” street
• Find restaurants which zip code start with 100
• Find bakery’s shop at “Morris Park Ave” street
71

Document Databases
• Find bakery’s shop and show their grades
• Find bakery’s shop and show their cuisine and grades
• More examples, please visit https://docs.mongodb.com/getting-
started/shell/query/
72

• Graph databases
73

Column-oriented databases
• Store data as columns as opposed to rows that is prominent in
RDBMS
• A relational database shows the data as two-dimensional tables
comprising of rows and columns but stores, retrieves, and
processes it one row at a time
• A column-oriented database stores each column continuously.
i.e. on disk or in-memory each column on the left will be stored
in sequential blocks.
74

• Example table
75image source: http://saphanatutorial.com/column-data-storage-and-row-data-storage-sap-hana/

• Advantages of column-based tables:
• Faster Data Access:
• Only affected columns have to be read during the selection
process of a query. Any of the columns can serve as an index.
• Better Compression:
• Columnar data storage allows highly efﬁcient compression
because the majority of the columns contain only few distinct
values (compared to number of rows).
76

• Advantages of column-based tables:
• Better parallel Processing:
• In a column store, data is already vertically partitioned. This
means that operations on different columns can easily be
processed in parallel.
• If multiple columns need to be searched or aggregated, each of
these operations can be assigned to a different processor core.
77

• In case of analytic applications, where aggregations are used and
faster search & processing are required, row-based storage are not
good.
• In row based tables all data stored in a row has to be read even
though the requirement may be there to access data from a few
columns.
• Hence, these queries on huge amounts of data would take lots of
times.
• In columnar tables, this information is stored physically next to each
other, that signiﬁcantly increases the speed of certain data queries.
78

• Column storage is most useful for OLAP queries (queries using
any SQL aggregate functions). Because, these queries get just
a few attributes from every data entry.
• But for traditional OLTP queries (queries not using any SQL
aggregate functions), it is more advantageous to store all
attributes side-by-side in row tables
79

80image source: http://saphanatutorial.com/column-data-storage-and-row-data-storage-sap-hana/

81image source: http://saphanatutorial.com/column-data-storage-and-row-data-storage-sap-hana/image source: http://saphanatutorial.com/column-data-storage-and-row-data-storage-sap-hana/

82
Operation
Column-
oriented
Row-
oriented
Aggregate Calulation of Single Column e.g. sum(price) Fast Slow
Compression Higher -
Retrieval of a few columns from a table with many
columns
Fast Slow
Insertion/Updating of single new record Slow Fast
Retrieval of a single record Slow Fast

• Use cases
• OLAP
• Data Analytics
• Software
• Cassandra
• Hbase (Hadoop)
• Google BigTable
• SAP HANA
83

• Graph databases
84

Graph databases
• Graph databases are the most specialized of the 4 NoSQL databases.
• Instead of modelling data using columns and rows, a graph database uses
structures called nodes and relationships.
• more formal discussions, they are called vertices and edges
• A node is an object that has an identiﬁer and a set of attributes
• A relationship is a link between two nodes that contain attributes about that
relation.
• Graph databases are designed to model adjacency between objects. Every
node in the database contains pointers to adjacent objects in the database.
• This allows for fast operations that require following paths through a graph.
85

Graph databases
• Example
86image source: NoSQL for Mere Mortals, Dan Sullivan, 2015

Graph databases
• Example
87image source: NoSQL for Mere Mortals, Dan Sullivan, 2015

Outline
88

Hadoop architecture
• Hadoop is composed of two primary components that
implement the basic concepts of distributed storage and
computation: HDFS and YARN
• HDFS (sometimes shortened to DFS) is the Hadoop Distributed
File System, responsible for managing data stored on disks
across the cluster.
• YARN acts as a cluster resource manager, allocating
computational assets (processing availability and memory on
worker nodes) to applications that wish to perform a distributed
computation.
89

Hadoop architecture
90
Image source: "Data Analytics with Hadoop: An Introduction for Data Scientists", Benjamin Bengfort and Jenny Kim, 2016

Hadoop architecture
• HDFS and YARN work in concert to minimize the amount of
network trafﬁc in the cluster primarily by ensuring that data is
local to the required computation.
• A set of machines that is running HDFS and YARN is known as a
cluster, and the individual machines are called nodes.
• A cluster can have a single node, or many thousands of nodes,
but all clusters scale horizontally, meaning as you add more
nodes, the cluster increases in both capacity and performance
in a linear fashion.
91

Hadoop architecture
• Each node in the cluster is identiﬁed by the type of process that
it runs:
• Master nodes
• These nodes run coordinating services for Hadoop workers and
are usually the entry points for user access to the cluster.
• Worker nodes
• Worker nodes run services that accept tasks from master nodes
either to store or retrieve data or to run a particular application.
• A distributed computation is run by parallelizing the analysis
across worker nodes.
92

Hadoop architecture
• For HDFS, the master and worker services are as follows:
• NameNode (Master)
• Stores the directory tree of the file system, file metadata, and the
location of each file in the cluster.
• Clients wanting to access HDFS must first locate the appropriate
storage nodes by requesting information from the NameNode.
• DataNode (Worker)
• Stores and manages HDFS blocks on the local disk.
• Reports health and status of individual data stores back to the
NameNode
93

Hadoop architecture
• An HDFS cluster with a replication factor of two; the NameNode
contains the mapping of ﬁles to blocks, and the DataNodes
store the blocks and their replicas
94
Image source: "Hadoop with Python", Zachary Radtka and Donald Miner, 2016

Hadoop architecture
• When data is accessed from HDFS
• a client application must ﬁrst make a request to the NameNode to
locate the data on disk.
• The NameNode will reply with a list of DataNodes that store the
data.
• the client must then directly request each block of data from the
DataNode.
95

Hadoop architecture
• YARN has multiple master services and a worker service as
follows:
• ResourceManager (Master)
• Allocates and monitors available cluster resources (e.g.,
physical assets like memory and processor cores)
• handling scheduling of jobs on the cluster
• ApplicationMaster (Master)
• Coordinates a particular application being run on the cluster as
scheduled by the ResourceManager
96

Hadoop architecture
• YARN has multiple master services and a worker service as
follows:
• NodeManager (Worker)
• Runs and manages processing tasks on an individual node as
well as reports the health and status of tasks as they’re running
97

Hadoop architecture
• A small Hadoop cluster with two master nodes and four workers
nodes that implements all six primary Hadoop services
98

Hadoop architecture
• Clients that wish to execute a job
• must ﬁrst request resources from the ResourceManager, which
assigns an application-speciﬁc ApplicationMaster for the duration
of the job.
• the ApplicationMaster tracks the execution of the job.
• the ResourceManager tracks the status of the nodes
• each individual NodeManager creates containers and executes
tasks within them
99

Hadoop architecture
• Finally, one other type of cluster is important to note: a single node
cluster.
• In “pseudo-distributed mode” a single machine runs all Hadoop
daemons as though it were part of a cluster, but network trafﬁc occurs
through the local loopback network interface.
• Hadoop developers typically work in a pseudo-distributed environment,
usually inside of a virtual machine to which they connect via SSH.
• Cloudera, Hortonworks, and other popular distributions of Hadoop
provide pre-built virtual machine images that you can download and
get started with right away.
100

Hadoop Distributed File System (HDFS)
• HDFS provides redundant storage for big data by storing that
data across a cluster of cheap, unreliable computers, thus
extending the amount of available storage capacity that a single
machine alone might have.
• HDFS performs best with modest number of very large files
• millions of large files (100 MB or more) rather than billions of smaller
files that might occupy the same volume.
• It is not a good fit as a data backend for applications that require
updates in real-time, interactive data analysis, or record-based
transactional support.
101

Hadoop Distributed File System (HDFS)
• HDFS ﬁles are split into blocks, usually of either 64MB or
128MB.
• Blocks allow very large ﬁles to be split across and distributed to
many machines at run time.
• Additionally, blocks will be replicated across the DataNodes.
• by default, the replication is three fold
• Therefore, each block exists on three different machines and three
different disks, and if even two node fail, the data will not be lost.
102

Interacting with HDFS
• Interacting with HDFS is primarily performed from the command
line using the script named hdfs. The hdfs script has the
following usage:
• The -option argument is the name of a specific option for the
specified command, and <arg> is one or more arguments that that
are specified for this option.
• For example, show help
103
$ hadoop fs [-option <arg>]
$ hadoop fs -help

• List directory contents
• use -ls command:
• Running the -ls command on a new cluster will not return any
results. This is because the -ls command, without any
arguments, will attempt to display the contents of the user’s
home directory on HDFS.
• Providing -ls with the forward slash (/) as an argument displays the
contents of the root of HDFS:
104
$ hadoop fs -ls
$ hadoop fs -ls /

• Creating a directory
• To create the books directory within HDFS, use the -mkdir
command:
• For example, create books directory in home directory
• Use the -ls command to verify that the previous directories were
created:
105
$ hadoop fs -mkdir [directory name]
$ hadoop fs -mkdir books
$ hadoop fs -ls

• Copy Data onto HDFS
• After a directory has been created for the current user, data can
be uploaded to the user’s HDFS home directory with the -put
command:
• For example, copy book ﬁle from local to HDFS
• Use the -ls command to verify that pg20417.txt was moved to
HDFS:
106
$ hadoop fs -put [source file] [destination file]
$ hadoop fs -put pg20417.txt books/pg20417.txt
$ hadoop fs -ls books

• Retrieve (view) Data from HDFS
• Multiple commands allow data to be retrieved from HDFS.
• To simply view the contents of a ﬁle, use the -cat command. -cat
reads a ﬁle on HDFS and displays its contents to stdout.
• The following command uses -cat to display the contents of
pg20417.txt
•
107
$ hadoop fs -cat books/pg20417.txt

• Retrieve (view) Data from HDFS
• Data can also be copied from HDFS to the local ﬁlesystem using
the -get command. The -get command is the opposite of the -put
command:
• For example, This command copies pg20417.txt from HDFS to the
local ﬁlesystem.
108
$ hadoop fs -get [source file] [destination file]
$ hadoop fs -get pg20417.txt .

MapReduce
• MapReduce is a programming model that enables large volumes of data
to be processed and generated by dividing work into independent tasks
and executing the tasks in parallel across a cluster of machines.
• At a high level, every MapReduce program transforms a list of input data
elements into a list of output data elements twice, once in the map phase
and once in the reduce phase.
• The MapReduce framework is composed of three major phases: map,
shufﬂe and sort, and reduce.
109

MapReduce
• Map
• The ﬁrst phase of a MapReduce application is the map phase.
Within the map phase, a function (called the mapper) processes a
series of key-value pairs.
• The mapper sequentially processes each key-value pair
individually, producing zero or more output key-value pairs
• As an example, consider a mapper whose purpose is to transform
sentences into words.
110

MapReduce
• Map
• The input to this mapper would be strings that contain sentences,
and the mapper’s function would be to split the sentences into
words and output the words
111

MapReduce
• Shuffle and Sort
• As the mappers begin completing, the intermediate outputs from
the map phase are moved to the reducers. This process of moving
output from the mappers to the reducers is known as shuffling.
• Shuffling is handled by a partition function, known as the
partitioner. The partitioner ensures that all of the values for the
same key are sent to the same reducer.
• The intermediate keys and values for each partition are sorted by
the Hadoop framework before being presented to the reducer.
112

MapReduce
• Reduce
• Within the reducer phase, an iterator of values is provided to a
function known as the reducer. The iterator of values is a nonunique
set of values for each unique key from the output of the map phase.
• The reducer aggregates the values for each unique key and
produces zero or more output key-value pairs
• As an example, consider a reducer whose purpose is to sum all of
the values for a key. The input to this reducer is an iterator of all of
the values for a key, and the reducer sums all of the values.
113

MapReduce
• Reduce
• The reducer then outputs a key-value pair that contains the input
key and the sum of the input key values
114

MapReduce
• Data ﬂow of a MapReduce job being executed on a cluster of a
few nodes
115

MapReduce examples
• In order to demonstrate how data ﬂows through a map and
reduce computational pipeline, we will present 3 examples
• word counting
• IoT data
• shared friendship
116

MapReduce examples: word count
• The word-counting application takes as input one or more text
ﬁles and produces a list of word and their frequencies as output.
117

• Because Hadoop utilizes key/value pairs the input key is a ﬁle
ID and line number and the input value is a string, while the
output key is a word and the output value is an integer.
• The following Python pseudocode shows how this algorithm is
implemented:
118
# emit is a function that performs hadoop I/O
def map(dockey, line):
for word in value.split():
emit(word, 1)
def reduce(word, values):
count = sum(value for value in values)
emit(word,count)

• Example 
(Map)
119
(27183, “The fast cat wears no hat.”)
(31416, “The cat in the hat ran fast.”)
input
Mapper 1 Mapper 2

• Example 
(Map)
120
input
Mapper 1 Mapper 2
(27183, “The fast cat wears no hat.”) (31416, “The cat in the hat ran fast.”)

• Example 
(Map)
121
(“The”,1) (“The”,1)
input
Mapper 1 Mapper 2

• Example 
(Map)
122
(“The”,1) (“The”,1)(“fast”,1) (“cat”,1)
input
Mapper 1 Mapper 2

• Example 
(Map)
123
(“The”,1) (“The”,1)(“fast”,1) (“cat”,1)(“cat”,1) (“in”,1)
input
Mapper 1 Mapper 2

• Example 
(Map)
124
(“wears”,1) (“the”,1)
input
Mapper 1 Mapper 2

• Example 
(Map)
125
(“wears”,1) (“the”,1)(“no”,1) (“hat”,1)
input
Mapper 1 Mapper 2

• Example 
(Map)
126
(“wears”,1) (“the”,1)(“no”,1) (“hat”,1)(“hat”,1) (“ran”,1)
input
Mapper 1 Mapper 2

• Example 
(Map)
127
(“wears”,1) (“the”,1)(“no”,1) (“hat”,1)(“hat”,1)
(“.”,1)
(“ran”,1)
(“fast”,1)
input
Mapper 1 Mapper 2

• Example  
(Map)
128
(“.”,1)
(“ran”,1)
(“fast”,1) (“.”,1)
input
Mapper 1 Mapper 2

• Example (Shufﬂe & Sort)
129
(“.”,1)
(“ran”,1)
(“fast”,1) (“.”,1)
(“.”,1) (“.”,1)
Mapper 1 Mapper 2
Shufﬂe & Sort

130
(“.”,1)
(“ran”,1)
(“fast”,1) (“.”,1)
(“.”,1) (“.”,1)
(“cat”,1) (“cat”,1)
Mapper 1 Mapper 2
Shufﬂe & Sort

131
(“.”,1)
(“ran”,1)
(“fast”,1) (“.”,1)
(“.”,1) (“.”,1)
(“cat”,1) (“cat”,1)
(“fast”,1) (“fast”,1)
Mapper 1 Mapper 2
Shufﬂe & Sort

132
(“.”,1)
(“ran”,1)
(“fast”,1) (“.”,1)
(“.”,1) (“.”,1)
(“cat”,1) (“cat”,1)
(“fast”,1) (“fast”,1)
(“hat”,1) (“hat”,1)
Mapper 1 Mapper 2
Shufﬂe & Sort

133
(“.”,1)
(“ran”,1)
(“fast”,1) (“.”,1)
(“.”,1) (“.”,1)
(“cat”,1) (“cat”,1)
(“fast”,1) (“fast”,1)
(“hat”,1) (“hat”,1)
(“in”,1)
Mapper 1 Mapper 2
Shufﬂe & Sort

134
(“.”,1)
(“ran”,1)
(“fast”,1) (“.”,1)
(“.”,1) (“.”,1)
(“cat”,1) (“cat”,1)
(“fast”,1) (“fast”,1)
(“hat”,1) (“hat”,1)
(“in”,1)
(“no”,1)
Mapper 1 Mapper 2
Shufﬂe & Sort

135
(“.”,1)
(“ran”,1)
(“fast”,1) (“.”,1)
(“.”,1) (“.”,1)
(“cat”,1) (“cat”,1)
(“fast”,1) (“fast”,1)
(“hat”,1) (“hat”,1)
(“in”,1)
(“no”,1)
(“ran”,1)
Mapper 1 Mapper 2
Shufﬂe & Sort

136
(“.”,1)
(“ran”,1)
(“fast”,1) (“.”,1)
(“.”,1) (“.”,1)
(“cat”,1) (“cat”,1)
(“fast”,1) (“fast”,1)
(“hat”,1) (“hat”,1)
(“in”,1)
(“no”,1)
(“ran”,1)
(“the”,1)
Mapper 1 Mapper 2
Shufﬂe & Sort

137
(“.”,1)
(“ran”,1)
(“fast”,1) (“.”,1)
(“.”,1) (“.”,1)
(“cat”,1) (“cat”,1)
(“fast”,1) (“fast”,1)
(“hat”,1) (“hat”,1)
(“in”,1)
(“no”,1)
(“ran”,1)
(“the”,1)
(“wears”,1)
Mapper 1 Mapper 2
Shufﬂe & Sort

138
Mapper 1
(“.”,1)
(“ran”,1)
(“fast”,1) (“.”,1)
(“.”,1) (“.”,1)
(“cat”,1) (“cat”,1)
(“fast”,1) (“fast”,1)
(“hat”,1) (“hat”,1)
(“in”,1)
(“no”,1)
(“ran”,1)
(“the”,1)
(“wears”,1)
(“The”,1) (“The”,1)
Mapper 2
Shufﬂe & Sort

• Example (Reduce)
139
(“.”,1) (“.”,1)
(“cat”,1) (“cat”,1)
(“fast”,1) (“fast”,1)
(“hat”,1) (“hat”,1)
(“in”,1)
(“no”,1)
(“ran”,1)
(“the”,1)
(“wears”,1)
(“The”,1) (“The”,1)
Shufﬂe & Sort
Reduce
(“.”,2)

140
(“.”,1) (“.”,1)
(“cat”,1) (“cat”,1)
(“fast”,1) (“fast”,1)
(“hat”,1) (“hat”,1)
(“in”,1)
(“no”,1)
(“ran”,1)
(“the”,1)
(“wears”,1)
(“The”,1) (“The”,1)
Shufﬂe & Sort
Reduce
(“.”,2)
(“cat”,2)

141
(“.”,1) (“.”,1)
(“cat”,1) (“cat”,1)
(“fast”,1) (“fast”,1)
(“hat”,1) (“hat”,1)
(“in”,1)
(“no”,1)
(“ran”,1)
(“the”,1)
(“wears”,1)
(“The”,1) (“The”,1)
Shufﬂe & Sort
Reduce
(“.”,2)
(“cat”,2)
(“fast”,2)

142
(“.”,1) (“.”,1)
(“cat”,1) (“cat”,1)
(“fast”,1) (“fast”,1)
(“hat”,1) (“hat”,1)
(“in”,1)
(“no”,1)
(“ran”,1)
(“the”,1)
(“wears”,1)
(“The”,1) (“The”,1)
Shufﬂe & Sort
Reduce
(“.”,2)
(“cat”,2)
(“fast”,2)
(“hat”,2)

143
(“.”,1) (“.”,1)
(“cat”,1) (“cat”,1)
(“fast”,1) (“fast”,1)
(“hat”,1) (“hat”,1)
(“in”,1)
(“no”,1)
(“ran”,1)
(“the”,1)
(“wears”,1)
(“The”,1) (“The”,1)
Shufﬂe & Sort
Reduce
(“.”,2)
(“cat”,2)
(“fast”,2)
(“hat”,2)
(“in”,1)

144
(“.”,1) (“.”,1)
(“cat”,1) (“cat”,1)
(“fast”,1) (“fast”,1)
(“hat”,1) (“hat”,1)
(“in”,1)
(“no”,1)
(“ran”,1)
(“the”,1)
(“wears”,1)
(“The”,1) (“The”,1)
Shufﬂe & Sort
Reduce
(“.”,2)
(“cat”,2)
(“fast”,2)
(“hat”,2)
(“in”,1)
(“no”,1)

145
(“.”,1) (“.”,1)
(“cat”,1) (“cat”,1)
(“fast”,1) (“fast”,1)
(“hat”,1) (“hat”,1)
(“in”,1)
(“no”,1)
(“ran”,1)
(“the”,1)
(“wears”,1)
(“The”,1) (“The”,1)
Shufﬂe & Sort
Reduce
(“.”,2)
(“cat”,2)
(“fast”,2)
(“hat”,2)
(“in”,1)
(“no”,1)
(“ran”,1)

146
(“.”,1) (“.”,1)
(“cat”,1) (“cat”,1)
(“fast”,1) (“fast”,1)
(“hat”,1) (“hat”,1)
(“in”,1)
(“no”,1)
(“ran”,1)
(“the”,1)
(“wears”,1)
(“The”,1) (“The”,1)
Shufﬂe & Sort
Reduce
(“.”,2)
(“cat”,2)
(“fast”,2)
(“hat”,2)
(“in”,1)
(“no”,1)
(“ran”,1)
(“the”,1)

147
(“.”,1) (“.”,1)
(“cat”,1) (“cat”,1)
(“fast”,1) (“fast”,1)
(“hat”,1) (“hat”,1)
(“in”,1)
(“no”,1)
(“ran”,1)
(“the”,1)
(“wears”,1)
(“The”,1) (“The”,1)
Shufﬂe & Sort
Reduce
(“.”,2)
(“cat”,2)
(“fast”,2)
(“hat”,2)
(“in”,1)
(“no”,1)
(“ran”,1)
(“the”,1)
(“wears”,1)

148
(“.”,1) (“.”,1)
(“cat”,1) (“cat”,1)
(“fast”,1) (“fast”,1)
(“hat”,1) (“hat”,1)
(“in”,1)
(“no”,1)
(“ran”,1)
(“the”,1)
(“wears”,1)
(“The”,1) (“The”,1)
Shufﬂe & Sort
Reduce
(“.”,2)
(“cat”,2)
(“fast”,2)
(“hat”,2)
(“in”,1)
(“no”,1)
(“ran”,1)
(“the”,1)
(“wears”,1)
(“The”,2)

MapReduce examples
• word counting
• IoT data
149

MapReduce examples: IoT
• IoT applications create an enormous amount of data that has to
be processed. This data is generated by physical sensors who
take measurements, like room temperature at 8.00 o’Clock.
• Every measurement consists of
• a key (the timestamp when the measurement has been taken) and
• a value (the actual value measured by the sensor).
• for example, (2016-05-01 01:02:03, 1).
• The goal of this exercise is to create average daily values of that
sensor’s data.
150

• Example(Map)
151
input
Mapper 1 Mapper 2 Mapper 3
(“2016-05-01 01:02:03”,1)
(“2016-05-02 12:09:04”,2)
(“2016-05-03 09:21:07”,3)
(“2016-05-03 09:21:45”,4)
(“2016-05-01 01:02:04”,5)
(“2016-05-02 12:09:01”,6)
(“2016-05-02 12:09:30”,7)
(“2016-05-03 09:21:31”,8)
(“2016-05-01 01:02:05”,9)

• Example(Map)
152
input
(“2016-05-01 01:02:03”,1)
(“2016-05-02 12:09:04”,2)
(“2016-05-03 09:21:07”,3)
(“2016-05-03 09:21:45”,4)
(“2016-05-01 01:02:04”,5)
(“2016-05-02 12:09:01”,6)
(“2016-05-02 12:09:30”,7)
(“2016-05-03 09:21:31”,8)
(“2016-05-01 01:02:05”,9)
(“2016-05-01 01:02:03”,1)
(“2016-05-02 12:09:04”,2)
(“2016-05-03 09:21:07”,3)
(“2016-05-03 09:21:45”,4)
(“2016-05-01 01:02:04”,5)
(“2016-05-02 12:09:01”,6)
(“2016-05-02 12:09:30”,7)
(“2016-05-03 09:21:31”,8)
(“2016-05-01 01:02:05”,9)

• Example(Map)
153
input
(“2016-05-01 01:02:03”,1)
(“2016-05-02 12:09:04”,2)
(“2016-05-03 09:21:07”,3)
(“2016-05-03 09:21:45”,4)
(“2016-05-01 01:02:04”,5)
(“2016-05-02 12:09:01”,6)
(“2016-05-02 12:09:30”,7)
(“2016-05-03 09:21:31”,8)
(“2016-05-01 01:02:05”,9)
(“2016-05-01”,1)
(“2016-05-02”,2)
(“2016-05-03”,3)
(“2016-05-03”,4)
(“2016-05-01”,5)
(“2016-05-02”,6)
(“2016-05-02”,7)
(“2016-05-03”,8)
(“2016-05-01”,9)

• Example(Shufﬂe & Sort)
154
(“2016-05-01”,1)
(“2016-05-02”,2)
(“2016-05-03”,3)
(“2016-05-03”,4)
(“2016-05-01”,5)
(“2016-05-02”,6)
(“2016-05-02”,7)
(“2016-05-03”,8)
(“2016-05-01”,9)
(“2016-05-01”,1) (“2016-05-01”,5) (“2016-05-01”,9)
Shufﬂe & Sort

155
(“2016-05-01”,1)
(“2016-05-02”,2)
(“2016-05-03”,3)
(“2016-05-03”,4)
(“2016-05-01”,5)
(“2016-05-02”,6)
(“2016-05-02”,7)
(“2016-05-03”,8)
(“2016-05-01”,9)
(“2016-05-01”,1) (“2016-05-01”,5) (“2016-05-01”,9)
(“2016-05-02”,2) (“2016-05-02”,6) (“2016-05-02”,7)
Shufﬂe & Sort

156
(“2016-05-01”,1)
(“2016-05-02”,2)
(“2016-05-03”,3)
(“2016-05-03”,4)
(“2016-05-01”,5)
(“2016-05-02”,6)
(“2016-05-02”,7)
(“2016-05-03”,8)
(“2016-05-01”,9)
(“2016-05-01”,1) (“2016-05-01”,5) (“2016-05-01”,9)
(“2016-05-02”,2) (“2016-05-02”,6) (“2016-05-02”,7)
(“2016-05-03”,3) (“2016-05-03”,4) (“2016-05-03”,8)
Shufﬂe & Sort

• Example(Reduce)
157
(“2016-05-01”,1) (“2016-05-01”,5) (“2016-05-01”,9)
(“2016-05-02”,2) (“2016-05-02”,6) (“2016-05-02”,7)
(“2016-05-03”,3) (“2016-05-03”,4) (“2016-05-03”,8)
Shufﬂe & Sort
(“2016-05-01”,5)value = (1+5+9)/3
Reduce

• Example(Reduce)
158
(“2016-05-01”,1) (“2016-05-01”,5) (“2016-05-01”,9)
(“2016-05-02”,2) (“2016-05-02”,6) (“2016-05-02”,7)
(“2016-05-03”,3) (“2016-05-03”,4) (“2016-05-03”,8)
Shufﬂe & Sort
Reduce
(“2016-05-01”,5)
value = (2+6+7)/3 (“2016-05-02”,5)

• Example(Reduce)
159
(“2016-05-01”,1) (“2016-05-01”,5) (“2016-05-01”,9)
(“2016-05-02”,2) (“2016-05-02”,6) (“2016-05-02”,7)
(“2016-05-03”,3) (“2016-05-03”,4) (“2016-05-03”,8)
Shufﬂe & Sort
(“2016-05-01”,5)
value = (3+4+8)/3
(“2016-05-02”,5)
(“2016-05-03”,5)
Reduce

MapReduce examples
• word counting
• IoT data
160

MapReduce examples: shared friendship
• In the shared friendship task, the goal is to analyze a social
network to see which friend relationships users have in
common.
• Given an input data source where the key is the name of a user
and the value is a comma-separated list of friends.
161

• The following Python pseudocode demonstrates how to perform
this computation:
162
def map(person, friends):
for friend in friends.split(“,”):
pair = sort([person, friend])
emit(pair,friends)
def reduce(pair, friends):
shared = set(friend[0])
shared = shared.intersection(friends[1])
emit(pair,shared)

• The mapper create an intermediate keycap of all of the possible
(friend, friend) tuples that exist from the initial dataset.
• This allows us to analyze the dataset on a per-relationship basis as the
value is the list of associated friends.
• The pair is sorted, which ensures that the input (“Mike”,“Linda”)
and (“Linda”,“Mike”) end up being the same key during
aggregation in the reducer.
163

• Example(Map)
164
input
(“Allen”,”Betty, Chris, David”)
(“Betty”,”Allen, Chris, David, Ellen”)
(“Chris”,”Allen, Betty, David,Ellen”)
(“David”,”Allen, Betty, Chris, Ellen”)
(“Ellen”,”Betty, Chris, David”)

• Example(Map)
165
input
Mapper 1 Mapper 2
(“Allen, Betty”,”Betty, Chris, David”)
(“Allen, Chris”,”Betty, Chris, David”)
(“Allen, David”,”Betty, Chris, David”)
(“Betty, Ellen”,”Betty, Chris, David”)
(“Chris, Ellen”,”Betty, Chris, David”)
(“David, Ellen”,”Betty, Chris, David”)

• Example(Map)
166
input
Mapper 3 Mapper 4
(“Allen, Betty”,”Allen, Chris, David, Ellen”)
(“Betty, Chris”,”Allen, Chris, David, Ellen”)
(“Betty, David”,”Allen, Chris, David, Ellen”)
(“Betty, Ellen”,”Allen, Chris, David, Ellen”)
(“Allen, Chris”,”Allen, Betty, David,Ellen”)
(“Betty, Chris”,”Allen, Betty, David,Ellen”)
(“Chris, David”,”Allen, Betty, David,Ellen”)
(“Chris, Ellen”,”Allen, Betty, David,Ellen”)

• Example(Map)
167
input
Mapper 5
(“Allen, David”,”Allen, Betty, Chris, Ellen”)
(“Betty, David”,”Allen, Betty, Chris, Ellen”)
(“Chris, David”,”Allen, Betty, Chris, Ellen”)
(“David, Ellen”,”Allen, Betty, Chris, Ellen”)

168
Shufﬂe & Sort
(“Allen, Betty”,”Betty, Chris, David”)(“Allen, Betty”,”Allen, Chris, David, Ellen”)
(“Allen, Chris”,”Allen, Betty, David,Ellen”) (“Allen, Chris”,”Betty, Chris, David”)
(“Allen, David”,”Allen, Betty, Chris, Ellen”) (“Allen, David”,”Betty, Chris, David”)
(“Betty, Chris”,”Allen, Betty, David,Ellen”) (“Betty, Chris”,”Allen, Chris, David, Ellen”)
(“Betty, David”,”Allen, Chris, David, Ellen”)(“Betty, David”,”Allen, Betty, Chris, Ellen”)
(“Betty, Ellen”,”Allen, Chris, David, Ellen”) (“Betty, Ellen”,”Betty, Chris, David”)
(“Chris, David”,”Allen, Betty, David,Ellen”)(“Chris, David”,”Allen, Betty, Chris, Ellen”)
(“Chris, Ellen”,”Betty, Chris, David”)(“Chris, Ellen”,”Allen, Betty, David,Ellen”)
(“David, Ellen”,”Allen, Betty, Chris, Ellen”) (“David, Ellen”,”Betty, Chris, David”)

169
Shufﬂe & Sort
(“Allen, Betty”,”Betty, Chris, David”)(“Allen, Betty”,”Allen, Chris, David, Ellen”)
(“Allen, Chris”,”Allen, Betty, David,Ellen”) (“Allen, Chris”,”Betty, Chris, David”)
(“Allen, David”,”Allen, Betty, Chris, Ellen”) (“Allen, David”,”Betty, Chris, David”)
(“Betty, Chris”,”Allen, Betty, David, Ellen”) (“Betty, Chris”,”Allen, Chris, David, Ellen”)
(“Betty, David”,”Allen, Chris, David, Ellen”)(“Betty, David”,”Allen, Betty, Chris, Ellen”)
(“Betty, Ellen”,”Allen, Chris, David, Ellen”) (“Betty, Ellen”,”Betty, Chris, David”)
(“Chris, David”,”Allen, Betty, David,Ellen”)(“Chris, David”,”Allen, Betty, Chris, Ellen”)
(“Chris, Ellen”,”Betty, Chris, David”)(“Chris, Ellen”,”Allen, Betty, David,Ellen”)
(“David, Ellen”,”Allen, Betty, Chris, Ellen”) (“David, Ellen”,”Betty, Chris, David”)

170
(“Allen, Betty”, “Chris, David”)
(“Allen, Chris”, “Betty, David”)
(“Allen, David”, “Betty, Chris”)
(“Betty, Chris”, “Allen, David, Ellen”)
(“Betty, David”, “Allen, Chris, Ellen”)
(“Betty, Ellen”, “Chris, David”)
(“Chris, David”, “Allen, Betty, Ellen”)
(“Chris, Ellen”, “Betty, David”)
(“David, Ellen”, “Betty, Chris”)

Hadoop Streaming
• Hadoop streaming is a utility that comes packaged with the
Hadoop distribution and allows MapReduce jobs to be created
with any executable as the mapper and/or the reducer.
• The Hadoop streaming utility enables Python, shell scripts, or any
other language to be used as a mapper, reducer, or both.
• The mapper and reducer are both executables that
• read input, line by line, from the standard input (stdin),
• and write output to the standard output (stdout).
• The Hadoop streaming utility creates a MapReduce job, submits the job
to the cluster, and monitors its progress until it is complete.
171

Hadoop Streaming
• When the mapper is initialized, each map task launches the
specified executable as a separate process.
• The mapper reads the input file and presents each line to the
executable via stdin. After the executable processes each line
of input, the mapper collects the output from stdout and
converts each line to a key-value pair.
• The key consists of the part of the line before the first tab
character, and the value consists of the part of the line after the
first tab character.
172

Hadoop Streaming
• When the reducer is initialized, each reduce task launches the
speciﬁed executable as a separate process.
• The reducer converts the input key-value pair to lines that are
presented to the executable via stdin.
• The reducer collects the executables result from stdout and
converts each line to a key-value pair.
• Similar to the mapper, the executable speciﬁes key-value pairs
by separating the key and value by a tab character.
173

Hadoop Streaming
• Data ﬂow in Hadoop Streaming via Python mapper.py and
reducer.py scripts
174

Hadoop Streaming example
• The WordCount application can be implemented as two Python
programs: mapper.py and reducer.py.
• mapper.py is the Python program that implements the logic in
the map phase of WordCount.
• It reads data from stdin, splits the lines into words, and outputs
each word with its intermediate count to stdout.
175

• mapper.py
176
#!/usr/bin/env python
import sys
# input comes from STDIN (standard input)
for line in sys.stdin:
# remove leading and trailing whitespace
line = line.strip()
# split the line into words
words = line.split()
# increase counters
for word in words:
# write the results to STDOUT (standard output);
# what we output here will be the input for the
# Reduce step, i.e. the input for reducer.py
#
# tab-delimited; the trivial word count is 1
print '%st%s' % (word, 1)

• reducer.py is the Python program that implements the logic in
the reduce phase of WordCount.
• It reads the results of mapper.py from stdin, sums the
occurrences of each word, and writes the result to stdout.
• reducer.py
177
#!/usr/bin/env python
from operator import itemgetter
import sys
current_word = None
current_count = 0
word = None

• reducer.py (cont’)
178
# input comes from STDIN
for line in sys.stdin:
# remove leading and trailing whitespace
line = line.strip()
# parse the input we got from mapper.py
word, count = line.split('t', 1)
# convert count (currently a string) to int
try:
count = int(count)
except ValueError:
# count was not a number, so silently
# ignore/discard this line
continue

• reducer.py (cont’)
179
# this IF-switch only works because Hadoop sorts map output
# by key (here: word) before it is passed to the reducer
if current_word == word:
current_count += count
else:
if current_word:
# write result to STDOUT
print '%st%s' % (current_word, current_count)
current_count = count
current_word = word
# do not forget to output the last word if needed!
if current_word == word:
print '%st%s' % (current_word, current_count)

• Before attempting to execute the code, ensure that the
mapper.py and reducer.py files have execution permission.
• The following command will enable this for both files:
• Also ensure that the first line of each file contains the proper
path to Python. This line enables mapper.py and reducer.py to
execute as standalone executables.
• It is highly recommended to test all programs locally before
running them across a Hadoop cluster.
180
$ chmod +x mapper.py reducer.py
$ echo ‘The fast cat wears no hat’ | ./mapper.py | sort -t 1| ./reducer.py

• Download 3 ebooks from Project Gutenberg
• The Outline of Science, Vol. 1 (of 4) by J. Arthur Thomson (659 KB)
• The Notebooks of Leonardo Da Vinci (1.4 MB)
• Ulysses by James Joyce (1.5 MB)
• Before we run the actual MapReduce job, we must first copy the
files from our local file system to Hadoop’s HDFS.
181
 
$ hadoop fs -put pg20417.txt books/pg20417.txt
$ hadoop fs -put 5000-8.txt books/5000-8.txt
$ hadoop fs -put 4300-0.txt books/4300-0.txt
$ hadoop fs -ls books

• The mapper and reducer programs can be run as a
MapReduce application using the Hadoop streaming utility.
• The command to run the Python programs mapper.py and
reducer.py on a Hadoop cluster is as follows:
182
 
$ hadoop jar /usr/lib/hadoop-0.20-mapreduce/contrib/streaming/  
hadoop-streaming-2.0.0-mr1-cdh*.jar
-files mapper.py, reducer.py
-mapper mapper.py
-reducer reducer.py
-input /user/hduser/books/*
-output /user/hduser/books/output

• Options for Hadoop streaming
183
Option Description
-files A command-separated list of ﬁles to be copied to the
MapReduce cluster
-mapper The command to be run as the mapper
-reducer The command to be run as the reducer
-input The DFS input path for the Map step
-output The DFS output directory for the Reduce step

Python MapReduce library: mrjob
• mrjob is a Python MapReduce library, created by Yelp, that
wraps Hadoop streaming, allowing MapReduce applications to
be written in a more Pythonic manner.
• mrjob enables multistep MapReduce jobs to be written in pure
Python.
• MapReduce jobs written with mrjob can be tested locally, run on
a Hadoop cluster, or run in the cloud using Amazon Elastic
MapReduce (EMR).
184

Python MapReduce library: mrjob
• Installation
• First, install python pip on CDH VM
• The installation of mrjob is simple; it can be installed with pip by
using the following command:
185
$ yum -y install python-pip
$ pip install mrjob

mrjob example
• word_count.py
• To run the job locally and count the frequency of words within a
ﬁle named pg20417.txt, use the following command:
186
from mrjob.job import MRJob
class MRWordCount(MRJob):
def mapper(self, _, line):
for word in line.split():
yield(word, 1)
def reducer(self, word, counts):
yield(word, sum(counts))
if __name__ == '__main__':
MRWordCount.run()
$ python word_count.py books/pg20419.txt

mrjob example
• The MapReduce job is defined as the class, MRWordCount. Within the
mrjob library, the class that inherits from MRJob contains the methods
that define the steps of the MapReduce job.
• The steps within an mrjob application are mapper, combiner, and
reducer. The class inheriting MRJob only needs to define one of these
steps.
• The mapper() method defines the mapper for the MapReduce job. It
takes key and value as arguments and yields tuples of (output_key,
output_value).
• In the WordCount example, the mapper ignored the input key and split
the input value to produce words and counts.
187

mrjob example
• The combiner is a process that runs after the mapper and before
the reducer.
• It receives, as input, all of the data emitted by the mapper, and the
output of the combiner is sent to the reducer. The combiner yields
tuples of (output_key, output_value) as output.
• The reducer() method deﬁnes the reducer for the MapReduce job.
• It takes a key and an iterator of values as arguments and yields
tuples of (output_key, output_value).
• In example, the reducer sums the value for each key, which
represents the frequency of words in the input.
188

mrjob example
• The final component of a MapReduce job written with the mrjob
library is the two lines at the end of the file:
if __name__ == '__main__':
MRWordCount.run()
• These lines enable the execution of mrjob; without them, the
application will not work.
• Executing a MapReduce application with mrjob is similar to
executing any other Python program. The command line must
contain the name of the mrjob application and the input file:
189
$ python mr_job.py input.txt

mrjob example
• By default, mrjob runs locally, allowing code to be developed
and debugged before being submitted to a Hadoop cluster.
• To change how the job is run, specify the -r/--runner option.
190
$ python word_count.py -r hadoop hdfs:books/pg20419.txt

Outline
191

Introduction
• The Hadoop ecosystem emerged as a cost effective way of working
with large datasets
• It imposes a particular programming model, called MapReduce, for
breaking up computation tasks into units that can be distributed around
a cluster of commodity
• Underneath this computation model is a distributed ﬁle system called
Hadoop Distributed Filesystem (HDFS)
• However, a challenge remains; how do you move an existing data
infrastructure to Hadoop, when that infrastructure is based on traditional
relational databases and the Structured Query Language (SQL)?
192

Introduction
• This is where Hive comes in. Hive provides an SQL dialect, called
Hive Query Language (abbreviated HiveQL or just HQL) for querying
data stored in a Hadoop cluster.
• SQL knowledge is widespread for a reason; it’s an effective,
reasonably intuitive model for organizing and using data.
• Mapping these familiar data operations to the low-level MapReduce
Java API can be daunting, even for experienced Java developers.
• Hive does this dirty work for you, so you can focus on the query itself.
Hive translates most queries to MapReduce jobs, thereby exploiting
the scalability of Hadoop, while presenting a familiar SQL abstraction.
193

Introduction
• Hive is most suited for data warehouse applications, where relatively
static data is analyzed, fast response times are not required, and when
the data is not changing rapidly.
• Apache Hive is a “data warehousing” framework built on top of
Hadoop.
• Hive provides data analysts with a familiar SQL-based interface to
Hadoop, which allows them to attach structured schemas to data in
HDFS and access and analyze that data using SQL queries.
• Hive has made it possible for developers who are ﬂuent in SQL to
leverage the scalability and resilience of Hadoop without requiring them
to learn Java or the native MapReduce API.
194

Hive in the Hadoop Ecosystem
• Hive modules
195
Image source: “Programming Hive: Data Warehouse and Query Language for Hadoop”, Edward Capriolo, Dean Wampler
and Jason Rutherglen, 2012

• There are several ways to interact with Hive
• CLI: command-line interface
• GUI: Graphic User Interface
• Karmasphere (http://karmasphere.com)
• Cloudera’s open source Hue (https://github.com/cloudera/hue)
• All commands and queries go to the Driver, which compiles the
input, optimizes the computation required, and executes the
required steps, usually with MapReduce jobs.
196

• Hive communicates with the JobTracker to initiate the MapReduce job.
• Hive does not have to be running on the same master node with the
JobTracker. In larger clusters, it’s common to have edge nodes where
tools like Hive run.
• They communicate remotely with the JobTracker on the master node
to execute jobs. Usually, the data ﬁles to be processed are in HDFS,
which is managed by the NameNode.
• The Metastore is a separate relational database (usually a MySQL
instance) where Hive persists table schemas and other system
metadata.
197

Structured Data Queries with Hive
• Hive provides its own dialect of SQL called the Hive Query Language,
or HQL.
• HQL supports many commonly used SQL statements, including data
deﬁnition statements (DDLs) (e.g., CREATE DATABASE/ SCHEMA/ TABLE),
data manipulation statements (DMSs) (e.g., INSERT, UPDATE, LOAD),
and data retrieval queries (e.g., SELECT).
• Hive commands and HQL queries are compiled into an execution plan
or a series of HDFS operations and/ or MapReduce jobs, which are
then executed on a Hadoop cluster.
198

Structured Data Queries with Hive
• Additionally, Hive queries entail higher-latency due to the overhead
required to generate and launch the compiled MapReduce jobs on the
cluster; even small queries that would complete within a few seconds
on a traditional RDBMS may take several minutes to ﬁnish in Hive.
• On the plus side, Hive provides the high-scalability and high-
throughput that you would expect from any Hadoop-based
application.
• It is very well suited to batch-level workloads for online analytical
processing (OLAP) of very large datasets at the terabyte and petabyte
scale.
199

The Hive Command-Line Interface (CLI)
• Hive’s installation comes packaged with a handy command-line
interface (CLI), which we will use to interact with Hive and run
our HQL statements.
• This will initiate the CLI and bootstrap the logger (if configured)
and Hive history file, and finally display a Hive CLI prompt:
• You can view the full list of Hive options for the CLI by using the
-H flag:
200
$ hive
hive>
$ hive -H

HUE: Apache Hadoop UI
• HUE (Hadoop User Experience) is a Web interface for analyzing
data with Apache Hadoop.
• Go to quick start.cloudera:8888/about
• username: cloudera
• password: cloudera
201

Query Editors
• Click Query Editors then Hive
202

Example: web logs database
• Choose default database
• HQL: SELECT * FROM web_logs
203

Example: web logs database
• HQL: SELECT web_logs.country_name, count(1) AS count 
FROM web_logs  
GROUP BY country_name
204

Creating a database
• Creating a database in Hive is very similar to creating a
database in a SQL-based RDBMS, by using the CREATE
DATABASE or CREATE SCHEMA statement:
• When Hive creates a new database, the schema deﬁnition data
is stored in the Hive metastore.
• Hive will raise an error if the database already exists in the
metastore; we can check for the existence of the database by
using IF NOT EXISTS:
• HQL: CREATE DATABASE IF NOT EXISTS flight_data;
205

Creating a database
• We can then run SHOW DATABASES to verify that our database has
been created. Hive will return all databases found in the
metastore, along with the default Hive database:
• HQL: SHOW DATABASES;
206

Creating tables
• Hive provides a SQL-like CREATE TABLE statement, which in its
simplest form takes a table name and column definitions:
• HQL: CREATE TABLE airlines (code INT,  
description STRING)  
ROW FORMAT DELIMITED FIELDS TERMINATED BY 't'  
STORED AS TEXTFILE;
• However, because Hive data is stored in the file system, usually
in HDFS or the local file system
• the CREATE TABLE command also takes optional clauses to
specify the row format with the ROW FORMAT clause that tells
Hive how to read each row in the file and map to our columns.
207

Loading data
• It’s important to note one important distinction between Hive and
traditional RDBMSs with regards to schema enforcement:
• Traditional relational databases enforce the schema on writes
by rejecting any data that does not conform to the schema as
defined;
• Hive can only enforce queries on schema reads. If in reading
the data file, the file structure does not match the defined
schema, Hive will generally return null values for missing fields
or type mismatches
208

Loading data
• Data loading in Hive is done in batch-oriented fashion using a bulk LOAD
DATA command or by inserting results from another query with the
INSERT command.
• LOAD DATA is Hive’s bulk loading command. INPATH takes an argument
to a path on the default file system (in this case, HDFS).
• We can also specify a path on the local file system by using LOCAL
INPATH instead. Hive proceeds to move the file into the warehouse
location.
• If the OVERWRITE keyword is used, then any existing data in the target
table will be deleted and replaced by the data file input; otherwise, the
new data is added to the table.
209

Loading data
• Examples
• HQL: LOAD DATA LOCAL INPATH '/home/cloudera/ 
Downloads/flight_data/ontime_flights.tsv'  
OVERWRITE INTO TABLE flights;
Downloads/flight_data/airlines.tsv'  
OVERWRITE INTO TABLE airlines;
Downloads/flight_data/carriers.tsv'  
OVERWRITE INTO TABLE carriers;
Downloads/flight_data/cancellation_reasons.tsv'  
OVERWRITE INTO TABLE cancellation_reasons;
210

Data Analysis with Hive
• Grouping
• HQL: SELECT airline_code, COUNT(1) AS num_flights  
FROM flights  
GROUP BY airline_code  
ORDER BY num_flights DESC;
211

• Aggregations
• HQL:  
SELECT airline_code,
COUNT(1) AS num_flights,
SUM(IF(depart_delay > 0, 1, 0)) AS
num_depart_delays,
SUM(IF(arrive_delay > 0, 1, 0)) AS
num_arrive_delays,
SUM(IF(is_cancelled, 1, 0)) AS num_cancelled,
FROM flights
GROUP BY airline_code;
212

• Aggregations
• HQL:  
SELECT airline_code,
COUNT(1) AS num_flights,
SUM(IF(depart_delay > 0, 1, 0)) AS num_depart_delays,
ROUND(SUM(IF(depart_delay > 0, 1, 0))/COUNT(1), 2)  
AS depart_delay_rate,
SUM(IF(arrive_delay > 0, 1, 0)) AS num_arrive_delays,
ROUND(SUM(IF(arrive_delay > 0, 1, 0))/COUNT(1), 2)  
AS arrive_delay_rate,
SUM(IF(is_cancelled, 1, 0)) AS num_cancelled,
ROUND(SUM(IF(is_cancelled, 1, 0))/COUNT(1), 2)  
AS cancellation_rate
FROM flights
GROUP BY airline_code
ORDER by cancellation_rate DESC, arrive_delay_rate DESC,  
depart_delay_rate DESC;
213

Introduction to HBase
• While Hive provides a familiar data manipulation paradigm within
Hadoop, it doesn’t change the storage and processing paradigm,
which still utilizes HDFS and MapReduce in a batch-oriented fashion.
• Thus, for use cases that require random, real-time read/ write access
to data, we need to look outside of standard MapReduce and Hive for
our data persistence and processing layer.
• The real-time applications need to record high volumes of time-based
events that tend to have many possible structural variations.
• The data may be keyed on a certain value, like User, but the value is
often represented as a collection of arbitrary metadata.
214

Introduction to HBase
• For example, two events, “Like” and “Share”, which require different column
values, as shown in table.
• In a relational model, rows are sparse but columns are not. That is, upon
inserting a new row to a table, the database allocates storage for every column
regardless of whether a value exists for that field or not.
• However, in applications where data is represented as a collection of arbitrary
fields or sparse columns, each row may use only a subset of available columns,
which can make a standard relational schema both a wasteful and awkward fit.
215

Column-Oriented Databases
• NoSQL is a broad term that generally refers to non-relational
databases and encompasses a wide collection of data storage
models, including
• graph databases
• document databases
• key/ value data stores
• column-family databases.
• HBase is classiﬁed as a column-family or column-oriented database,
modelled on Google’s Big Table architecture.
216

• HBase organizes data into tables that contain rows. Within a
table, rows are identiﬁed by their unique row key, which do not
have a data type.
• Row key are similar to the concept of primary keys in relational
databases, in that they are automatically indexed.
217

• In HBase, table rows are sorted by their row key and because
row keys are byte arrays, almost anything can serve as a row
key from strings to binary representations of longs or even
serialized data structures.
• HBase stores its data key/value pairs, where all table lookups
are performed via the table’s row key, or unique identiﬁer to the
stored record data.
• Data within a row is grouped into column families, which consist
of related columns.
218

• Census data as an HBase schema
219

• Storing data in columns rather than rows has particular beneﬁts for
data warehouses and analytical databases where aggregates are
computed over large sets of data with potentially sparse values, where
not all columns values are present.
• Another interesting feature of HBase and BigTable-based column-
oriented databases is that the table cells, or the intersection of row and
column coordinates, are versioned by timestamp.
• HBase is thus also described as being a multidimensional map where
time provides the third dimension
220

• The time dimension is indexed in decreasing order, so that
when reading from an HBase store, the most recent values are
found ﬁrst.
• The contents of a cell can be  
referenced by a  
{rowkey, column, timestamp}  
tuple, or we can scan for a  
range of cell values by time  
range.
221

Real-Time Analytics with HBase
• For the purposes of this HBase overview, we deﬁne and work with the
HBase shell to design a schema for a linkshare tracker that tracks the
number of times a link has been shared.
• Generating a schema
• When designing schemas in HBase, it’s important to think in terms
of the column-family structure of the data model and how it affects
data access patterns.
• Furthermore, because HBase doesn’t support joins and provides
only a single indexed rowkey, we must be careful to ensure that the
schema can fully support all use cases.
222

• First, we need to declare the table name, and at least one
column-family name at the time of table deﬁnition.
• If no namespace is declared, HBase will use the default
namespace
• We just created a single table called linkshare in the default
namespace with one column-family, named link
• To alter the table after creation, such as changing or adding column
families, we need to ﬁrst disable the table so that clients will not be able
to access the table during the alter operation:
223
hbase> create ‘linkshare’,’link’

• Good row key design affects not only how we query the table, but the
performance and complexity of data access.
• By default, HBase stores rows in sorted order by row key, so that
similar keys are stored to the same RegionServer.
• Thus, in addition to enabling our data access use cases, we also need
to be mindful to account for row key distribution across regions.
• For the current example, let’s assume that we will use the unique
reversed link URL for the row key.
224
hbase> disable ‘linkshare’
hbase> alter ‘linkshare’, ‘statistics’
hbase> enable ‘linkshare’

• In our linkshare application, we want to store descriptive data about
the link, such as its title, while maintaining a frequency counter that
tracks the number of times the link has been shared.
• We can insert, or put, a value in a cell at the speciﬁed table/ row/
column and optionally timestamp coordinates.
• To put a cell value into table linkshare at row with row key
org.hbase.www under column-family link and column title marked with
the current timestamp
225
hbase> put 'linkshare', 'org.hbase.www', 'link:title', 'Apache HBase'
hbase> put 'linkshare', 'org.hadoop.www', 'link:title', 'Apache Hadoop'
hbase> put 'linkshare', 'com.oreilly.www', 'link:title', ‘O’Reilly.com’

• The put operation works great for inserting a value for a single cell, but for
incrementing frequency counters, HBase provides a special mechanism
to treat columns as counters.
• To increment a counter, we use the command incr instead of put.
• The last option passed is the increment value, which in this case is 1.
• Incrementing a counter will return the updated counter value, but you can
also access a counter’s current value any time using the get_counter
command, specifying the table name, row key, and column:
226
hbase> incr ‘linkshare’, ‘org.hbase.www’, ‘statistics:share’, 1
hbase> incr ‘linkshare’, ‘org.hbase.www’, ‘statistics:like’, 1

• HBase provides two general methods to retrieve data from a table:
• the get command performs lookups by row key to retrieve attributes
for a specific row,
• and the scan command, which takes a set of filter specifications and
iterates over multiple rows based on the indicated specifications.
• In its simplest form, the get command accepts the table name
followed by the row key, and returns the most recent version timestamp
and cell value for columns in the row.
227
hbase> incr ‘linkshare’, ‘org.hbase.www’, ‘statistics:share’, 1
hbase> get_counter ‘linkshare’, ‘org.hbase.www’,  
‘statistics:share’
hbase> get ‘linkshare’, ‘org.hbase.www’

• The get command also accepts an optional dictionary of parameters to
specify the column( s), timestamp, timerange, and version of the cell
values we want to retrieve. For example, we can specify the column( s) of
interest
• A scan operation is akin to database cursors or iterators, and takes
advantage of the underlying sequentially sorted storage mechanism,
iterating through row data to match against the scanner speciﬁcations.
• With scan, we can scan an entire HBase table or specify a range of rows
to scan.
228
hbase> get ‘linkshare’, ‘org.hbase.www’, ‘link:title’
hbase> get ‘linkshare’, ‘org.hbase.www’, ‘link:title’,  
‘statistics:share’

• You can specify an optional STARTROW and/ or STOPROW
parameter, which can be used to limit the scan to a speciﬁc
range of rows.
• If neither STARTROW nor STOPROW are provided, the scan
operation will scan through the entire table.
• You can, in fact, call scan with the table name to display all the
contents of a table.
229
hbase> scan ‘linkshare’
hbase> scan 'linkshare', {COLUMNS = > [' link:title'],  
STARTROW = > 'org.hbase.www'}

Introduction to Sqoop
• However, in cases where the input data is already structured because
it resides in a relational database, it would be convenient to leverage
this known schema to import the data into Hadoop in a more efﬁcient
manner than uploading CSVs to HDFS and parsing them manually.
• Sqoop (SQL-to-Hadoop) is designed to transfer data between
relational database management systems (RDBMS) and Hadoop.
• It automates most of the data transfer process by reading the schema
information directly from the RDBMS.
• Sqoop then uses MapReduce to import and export the data to and
from Hadoop.
230

Introduction to Sqoop
• Sqoop gives us the ﬂexibility to maintain our data in its production
state while copying it into Hadoop to make it available for further
analysis without modifying the production database.
• We’ll walk through a few ways to use Sqoop to import data from a
MySQL database into various Hadoop data stores, including HDFS,
Hive, and HBase.
• We will use MySQL as the source and target RDBMS for the examples
in this chapter, so we also assume that a MySQL database resides on
the same host as your Hadoop/ Sqoop services and is accessible via
localhost and the default port, 3306.
231

Importing from MySQL to HDFS
• When importing data from relational databases like MySQL, Sqoop
reads the source database to gather the necessary metadata for the
data being imported.
• Sqoop then submits a map-only Hadoop job to transfer the actual table
data based on the metadata that was captured in the previous step.
• This job produces a set of serialized files, which may be delimited text
files, binary format, or SequenceFiles containing a copy of the imported
table or datasets.
• By default, the files are saved as comma-separated files to a directory
on HDFS with a name that corresponds to the source table name.
232

Introduction to Big Data Technologies

Introduction to Big Data Technologies

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Introduction to Big Data Technologies

Ähnlich wie Introduction to Big Data Technologies (20)

Mehr von Big Data Engineering, Faculty of Engineering, Dhurakij Pundit University

Mehr von Big Data Engineering, Faculty of Engineering, Dhurakij Pundit University (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Introduction to Big Data Technologies