SlideShare ist ein Scribd-Unternehmen logo
1 von 243
Introduction to 

Big Data Technologies
Eakasit Pacharawongsakda, Ph.D.
eakasit@datacubeth.ai
Data Cube / Quandatics
http://dataminingtrend.com http://facebook.com/datacube.th
Outline
• Part 1: Introduction to Big Data
• Part 2: Introduction to NoSQL
• Part 3: Introduction to MapReduce and Hadoop
• Part 4: Introduction to Hive, HBase and Sqoop
2
ในหนึ่งวันทำงาน
source:http://pad1.whstatic.com/images/thumb/a/aa/Reduce-Anxiety-About-Driving-if-You're-a-Teenager-Step-5-Version-2.jpg/
aid196018-728px-Reduce-Anxiety-About-Driving-if-You're-a-Teenager-Step-5-Version-2.jpg
เวลา 07:00 น. ออกเดินทางไปทำงาน
source: http://www.clipartkid.com/images/259/research-and-report-writing-9-23-12-9-30-12-q2r0wg-clipart.jpg
เวลา 07:45 น. ยังคงติดอยู่บนถนน
เวลา 08:00 น. เจ้านายโทรศัพท์เข้ามาถามงาน
source: https://d1ai9qtk9p41kl.cloudfront.net/assets/mc/psuderman/2011_07/text-drive.png
เวลา 08:05 น. ขับรถไปชนกับคันอื่น
เวลา 10:00 น. ถึงที่ทำงานและทำงานต่อไป
source: http://stuffpoint.com/anime-and-manga/image/285181-anime-and-manga-girl-working-in-the-computer.jpg
เวลา 18:00 น. แวะซื้อของกลับบ้าน
เวลา 20:00 น. กลับถึงบ้านและอยู่คนเดียว
ในหนึ่งวันทำงานกับ

เทคโนโลยีข้อมูลขนาดใหญ่ (Big Data)
http://dataminingtrend.com http://facebook.com/datacube.th
ระบบนำทาง
• แอพพลิเคชัน Waze
12
http://dataminingtrend.com http://facebook.com/datacube.th
ระบบนำทาง
• แอพพลิเคชัน Waze
13
http://dataminingtrend.com http://facebook.com/datacube.th
รถที่ไม่ต้องมีคนขับ (self driving car)
• Waymo (Google self-driving car)
14
http://dataminingtrend.com http://facebook.com/datacube.th
แผงไข่อัจฉริยะ
• Egg Minder
15
http://dataminingtrend.com http://facebook.com/datacube.th
ร้านค้าที่ไม่ต้องรอคิว
• Amazon Go
16
http://dataminingtrend.com http://facebook.com/datacube.th
เทคโนโลยีที่ทำให้ชีวิตประจำวันสะดวกขึ้น
17
http://dataminingtrend.com http://facebook.com/datacube.th
ทำไมผู้หญิงถึงโสด
18
source: https://pishetshotisak.wordpress.com/2016/12/07/ทำไมผู้หญิงถึงขึ้นคาน-ค/
คนเรามักชอบอะไรใหญ่ๆ
http://dataminingtrend.com http://facebook.com/datacube.th
Big Data & Analytics
• Big Bang
20
source:http://www.thetechy.com/science/exploring-universe-curiosity
http://dataminingtrend.com http://facebook.com/datacube.th
Big Data & Analytics
• Big Architecture (Great wall of China)
21
source: http://www.history.com/topics/great-wall-of-china
http://dataminingtrend.com http://facebook.com/datacube.th
Big Data & Analytics
• Big Data
22source: http://www.plmjim.com/?p=583
http://dataminingtrend.com http://facebook.com/datacube.th
Data Evolutions
23
source:Data Science and Big Data Analytics: Discovering, analyzing, visualizing and presenting data
http://dataminingtrend.com http://facebook.com/datacube.th
What is Big Data?
24
source: https://www.youtube.com/watch?v=TzxmjbL-i4Y
http://dataminingtrend.com http://facebook.com/datacube.th
What is Big Data?
25
source: http://www.intel.com/content/www/us/en/big-data/big-data-101-animation.html#
http://dataminingtrend.com http://facebook.com/datacube.th
What is Big Data?
• Big Data ประกอบด้วย 3 V
• Volume
• ข้อมูลมีจำนวนเพิ่มขึ้นอย่างมหาศาล
• Velocity
• ข้อมูลเพิ่มขึ้นอย่างรวดเร็ว
• Variety
• ข้อมูลมีความหลากหลายมากขึ้น
26
source: https://upxacademy.com/beginners-guide-to-big-data/
http://dataminingtrend.com http://facebook.com/datacube.th
What is Big Data?
• Huge volume of data
• ข้อมูลมีขนาดใหญ่มากๆ เช่น มีจำนวนเป็นพันล้านแถว (billion row) หรือ
เป็นล้านคอลัมน์ (million columns)
27
http://dataminingtrend.com http://facebook.com/datacube.th
Big Data: Volume
28
source:https://datafloq.com/read/infographic/226
http://dataminingtrend.com http://facebook.com/datacube.th
Big Data: Volume
29
source:https://www.adeptia.com
http://dataminingtrend.com http://facebook.com/datacube.th
What is Big Data?
• Huge volume of data
• ข้อมูลมีขนาดใหญ่มากๆ เช่น มีจำนวนเป็นพันล้านแถว (billion row) หรือ
เป็นล้านคอลัมน์ (million columns)
• Speed of new data creation and growth
• ข้อมูลเกิดขึ้นอย่างรวดเร็วมากๆ
30
http://dataminingtrend.com http://facebook.com/datacube.th
Big Data: Velocity
31
source: https://upxacademy.com/beginners-guide-to-big-data/
http://dataminingtrend.com http://facebook.com/datacube.th
What is Big Data?
• Huge volume of data
• ข้อมูลมีขนาดใหญ่มากๆ เช่น มีจำนวนเป็นพันล้านแถว (billion row) หรือ
เป็นล้านคอลัมน์ (million columns)
• Speed of new data creation and growth
• ข้อมูลเกิดขึ้นอย่างรวดเร็วมากๆ
• Complexity of data types and structures
• ข้อมูลมีความหลากหลาย ไม่ได้อยู่ในรูปแบบของตารางเท่านั้น อาจจะเป็น
รูปแบบของข้อความ (text) รูปภาพ (images) หรือ วิดีโอ (video clip)
32
http://dataminingtrend.com http://facebook.com/datacube.th
Big Data: Variety
33
source: https://upxacademy.com/beginners-guide-to-big-data/
http://dataminingtrend.com http://facebook.com/datacube.th
Big Data: Variety
34
source: https://upxacademy.com/beginners-guide-to-big-data/
http://dataminingtrend.com http://facebook.com/datacube.th
What is Big Data?
35
source: http://dataconomy.com/2014/08/infographic-how-to-explain-big-data-to-your-grandmother/
http://dataminingtrend.com http://facebook.com/datacube.th
Internet of Things
36source: http://www.postscapes.com/what-exactly-is-the-internet-of-things-infographic/
http://dataminingtrend.com http://facebook.com/datacube.th
Sensors
37source: http://www.postscapes.com/what-exactly-is-the-internet-of-things-infographic/
http://dataminingtrend.com http://facebook.com/datacube.th
IoT applications
38
http://dataminingtrend.com http://facebook.com/datacube.th
IoT applications
• Disney’s Magic Band
39
source:https://disneyworld.disney.go.com/plan/my-disney-experience/bands-cards/#?CMP=SEC-WDWShareEmailNGE-MDX-MagicBand-video&video=0/0/0/0
http://dataminingtrend.com http://facebook.com/datacube.th
IoT applications
• GlowCaps
40
source:http://www.vitality.net/glowcaps.html
http://dataminingtrend.com http://facebook.com/datacube.th
IoT applications
• Connected Toothbrush
41
source:https://www.youtube.com/watch?v=gLpUxDdh9iQ
http://dataminingtrend.com http://facebook.com/datacube.th
IoT applications
42
source:https://www.youtube.com/watch?v=TqRN7r7mGmk
http://dataminingtrend.com http://facebook.com/datacube.th
IoT applications
43
http://dataminingtrend.com http://facebook.com/datacube.th
IoT applications
• iBeacon
44
source: https://www.mallmaverick.com/system/site_images/photos/000/001/700/original/blog_ibeacon1.jpg?1391033561
http://dataminingtrend.com http://facebook.com/datacube.th
Outline
• Part 1: Introduction to Big Data
• Part 2: Introduction to NoSQL
• Part 3: Introduction to MapReduce and Hadoop
• Part 4: Introduction to Hive, HBase and Sqoop
45
http://dataminingtrend.com http://facebook.com/datacube.th
Relational database & SQL
• Databases are made up of tables and each table is made up of
rows and columns
• SQL is a database interaction language that allows you to add,
retrieve, edit and delete information stored in databases
46
ID Mark Code Title
S103 72 DBS Database Systems
S103 58 IAI Intro to AI
S104 68 PR1 Programming 1
S104 65 IAI Intro to AI
S106 43 PR2 Programming 2
S107 76 PR1 Programming 1
S107 60 PR2 Programming 2
S107 35 IAI Intro to AI
http://dataminingtrend.com http://facebook.com/datacube.th
Relational database & SQL
• SQL primarily works with two types of operations to query data
• Read consists of the SELECT command, which has three
common clauses
• SELECT
• FROM
• WHERE
47image source:https://justbablog.files.wordpress.com/2017/03/sql_beginners.jpg
http://dataminingtrend.com http://facebook.com/datacube.th
Relational database & SQL
48image source:https://justbablog.files.wordpress.com/2017/03/sql_beginners.jpg
http://dataminingtrend.com http://facebook.com/datacube.th
Why NoSQL?
• Relational databases have been the dominate type of database used
for application for decades.
• With the advent of the Web, however, the limitations of relational
databases became increasingly problematic.
• Companies such as Google, LinkedIn, Yahoo! and Amazon found that
supporting large numbers of users on the Web was different from
supporting much smaller numbers of business users.
49
http://dataminingtrend.com http://facebook.com/datacube.th
Why NoSQL?
50image source:https://www.slideshare.net/up1/introduction-to-nosql-61023856?qid=8519a104-f1d8-4955-a58b-a1eb61f61a8c
http://dataminingtrend.com http://facebook.com/datacube.th
Why NoSQL?
• Web application needed to support
• Large volumes of read and write operations
• Low latency response times
• High availability
• These requirement were difficult to realise using relational databases.
• There are limits to how many CPUs and memory can be supported in a
single server.
• Another option is to use multiple servers with a relational database.
• operating single RDBMS over multiple servers is a complex operation
51
http://dataminingtrend.com http://facebook.com/datacube.th
Why NoSQL?
• NoSQL is “Not Only SQL”
• Four characteristics of data management for large-scale data
management tasks are
• Scalability
• Cost
• Flexibility
• Availability
52
http://dataminingtrend.com http://facebook.com/datacube.th
Why NoSQL?: Scalability
• Scalability is the ability to efficiently meet the needs for varying
workloads.
• For example, if there is a spike in traffic to a website, additional
servers can be brought online to handle the additional load.
• When the spike subsides and traffic returns to normal, some of
those additional servers can be shut down.
• Adding servers as needed is called scaling out.
53
http://dataminingtrend.com http://facebook.com/datacube.th
Why NoSQL?: Scalability
• Scaling Up
• Scaling Out
54
http://dataminingtrend.com http://facebook.com/datacube.th
Why NoSQL?: Scalability
• Scaling out is more flexible than scaling up.
• Servers can be added or removed as needed when scaling up.
• NoSQL are designed to utilise servers available in a cluster with
minimal intervention by database administrators.
55
http://dataminingtrend.com http://facebook.com/datacube.th
Why NoSQL?: Cost
• Commercial software vendors employ a variety of licensing
models that include charging by
• the size of the server running the RDBMS
• the number of concurrent users on the database
• the number of named users allowed to use the software
• The major NoSQL databases are available as open source. It’s free to
use on as many servers of whatever size needed
56
http://dataminingtrend.com http://facebook.com/datacube.th
Why NoSQL?: Cost
57image source:https://www.slideshare.net/up1/introduction-to-nosql-61023856?qid=8519a104-f1d8-4955-a58b-a1eb61f61a8c
http://dataminingtrend.com http://facebook.com/datacube.th
Why NoSQL?: Flexibility
• Database designers expect to know at the start of a project all
the tables and columns that will be needed to support an
application.
• It is also commonly assumed that most of the columns in a table
will be needed by most of the rows.
• Unlike relational databases, some NoSQL databases do not
require a fixed table structure.
• For example, in a document database, a program could
dynamically add new attributes as needed without having to have a
database designer alter the database design.
58
http://dataminingtrend.com http://facebook.com/datacube.th
Why NoSQL?: Availability
• Many of us have come to expect websites and web applications
to be available whenever we want to use them.
• NoSQL databases are designed to take advantage of multiple,
low-cost servers.
• When one server fails or is taken out of service for maintenance,
the other servers in the cluster can take on the entire workload.
59
http://dataminingtrend.com http://facebook.com/datacube.th
Variety of NoSQL Databases
• There are 4 major types of key NoSQL databases
• Key-Value databases
• Document databases
• Column-oriented databases
• Graph databases
60
http://dataminingtrend.com http://facebook.com/datacube.th
Key-Value databases
• Key-value databases are the simplest form of NoSQL
databases.
• These databases are modelled on two components: 

keys and values
• Data is stored in a key-value pairs, where attribute is the Key
and content is the Value
• Data can only be queries and retrieved using the key only.
61
http://dataminingtrend.com http://facebook.com/datacube.th
Key-Value databases
• use cases
• caching data from
relational databases to
improve performance
• storing data from
sensors (IoT)
• software
• redis
• Amazon DynamoDB
62
3876941. accountNumber
Jane Washington1. Name
31.numItems
Loyalty Member1.custType
Keys Values
http://dataminingtrend.com http://facebook.com/datacube.th
Key-Value databases
• Redis example (http://try.redis.io)
• Set or update value against a key:
• SET university "DPU" // set string
• GET university // get string
• HSET student firstName "Manee" // Hash – set field
value
• HGET student firstName // Hash – get field value
• LPUSH "alice:sales" "10" "20" // List create/append
• LSET "alice:sales" "0" "4" // List update
• LRANGE "alice:sales" 0 1 // view list
63
http://dataminingtrend.com http://facebook.com/datacube.th
Key-Value databases
• Set or update value against a key:
• SET quantities 1
• INCR quantities
• SADD "alice:friends" "f1" "f2" //Set – create/
update
• SADD "bob:friends" "f2" "f1" //Set – create/update
• Set operations:
• intersection
• SINTER "alice:friends" "bob:friends"
• union
• SUNION "alice:friends" “bob:friends"
64
http://dataminingtrend.com http://facebook.com/datacube.th
Variety of NoSQL Databases
• There are 4 major types of key NoSQL databases
• Key-Value databases
• Document databases
• Column-oriented databases
• Graph databases
65
http://dataminingtrend.com http://facebook.com/datacube.th
Document Databases
• A document store allows the inserting, retrieving, and
manipulating of semi-structured data.
• Compared to RDBMS, the documents themselves act as
records (or rows), however, it is semi-structured as compared to
rigid RDBMS.
• It can store the data that have different set of data fields
(columns)
• Most of the databases available under this category use XML,
JSON
66
http://dataminingtrend.com http://facebook.com/datacube.th
Document Databases
• Document examples
67
{
“EmployeeID" : "SM1",
"FirstName" : "Anuj",
"LastName" : "Sharma",
"Age" : 45,
"Salary" : 10000000
}
{
"EmployeeID": "MM2",
"FirstName" : "Anand",
"Age" : 34,
"Salary" : 5000000,
“Address" : {
"Line1" : "123, 4th Street",
"City" : "Bangalore",
"State" : "Karnataka"
},
"Projects" : [
"nosql-migration",
"top-secret-007"
]
}
http://dataminingtrend.com http://facebook.com/datacube.th
Document Databases
• Use cases
• back-end support for websites with high volumes of reads and
writes
• applications that use JSON data structures such as twitter data
• Software
• MongoDB
• Couchbase
• IBM Cloudant
68
http://dataminingtrend.com http://facebook.com/datacube.th
Document Databases
• MongoDB examples
• Download MongoDB from https://www.mongodb.com/download-
center?jmp=nav#community
• MongoDB’s default data directory path is the absolute path datadb
on the drive from which you start MongoDB
• You can specify an alternate path for data files using the --dbpath
option to mongod.exe
• Import example data
69
"C:Program FilesMongoDBServer3.4binmongod.exe"
--dbpath d:testmongodbdata
mongoimport --db test --collection restaurants --drop --file
downloads/primer-dataset.json
http://dataminingtrend.com http://facebook.com/datacube.th
Document Databases
• MongoDB examples
• Download and install Robomongo (https://robomongo.org/
download)
70
http://dataminingtrend.com http://facebook.com/datacube.th
Document Databases
• MongoDB examples
• Find bakery’s shop

• Find restaurants in “Morris Park Ave” street
• Find restaurants which zip code start with 100
• Find bakery’s shop at “Morris Park Ave” street
71
http://dataminingtrend.com http://facebook.com/datacube.th
Document Databases
• MongoDB examples
• Find bakery’s shop and show their grades
• Find bakery’s shop and show their cuisine and grades
• More examples, please visit https://docs.mongodb.com/getting-
started/shell/query/
72
http://dataminingtrend.com http://facebook.com/datacube.th
Variety of NoSQL Databases
• There are 4 major types of key NoSQL databases
• Key-Value databases
• Document databases
• Column-oriented databases
• Graph databases
73
http://dataminingtrend.com http://facebook.com/datacube.th
Column-oriented databases
• Store data as columns as opposed to rows that is prominent in
RDBMS
• A relational database shows the data as two-dimensional tables
comprising of rows and columns but stores, retrieves, and
processes it one row at a time
• A column-oriented database stores each column continuously.
i.e. on disk or in-memory each column on the left will be stored
in sequential blocks.
74
http://dataminingtrend.com http://facebook.com/datacube.th
Column-oriented databases
• Example table
75image source: http://saphanatutorial.com/column-data-storage-and-row-data-storage-sap-hana/
http://dataminingtrend.com http://facebook.com/datacube.th
Column-oriented databases
• Advantages of column-based tables:
• Faster Data Access:
• Only affected columns have to be read during the selection
process of a query. Any of the columns can serve as an index.
• Better Compression:
• Columnar data storage allows highly efficient compression
because the majority of the columns contain only few distinct
values (compared to number of rows).
76
http://dataminingtrend.com http://facebook.com/datacube.th
Column-oriented databases
• Advantages of column-based tables:
• Better parallel Processing:
• In a column store, data is already vertically partitioned. This
means that operations on different columns can easily be
processed in parallel.
• If multiple columns need to be searched or aggregated, each of
these operations can be assigned to a different processor core.
77
http://dataminingtrend.com http://facebook.com/datacube.th
Column-oriented databases
• In case of analytic applications, where aggregations are used and
faster search & processing are required, row-based storage are not
good.
• In row based tables all data stored in a row has to be read even
though the requirement may be there to access data from a few
columns.
• Hence, these queries on huge amounts of data would take lots of
times.
• In columnar tables, this information is stored physically next to each
other, that significantly increases the speed of certain data queries.
78
http://dataminingtrend.com http://facebook.com/datacube.th
Column-oriented databases
• Column storage is most useful for OLAP queries (queries using
any SQL aggregate functions). Because, these queries get just
a few attributes from every data entry.
• But for traditional OLTP queries (queries not using any SQL
aggregate functions), it is more advantageous to store all
attributes side-by-side in row tables
79
http://dataminingtrend.com http://facebook.com/datacube.th
Column-oriented databases
80image source: http://saphanatutorial.com/column-data-storage-and-row-data-storage-sap-hana/
http://dataminingtrend.com http://facebook.com/datacube.th
Column-oriented databases
81image source: http://saphanatutorial.com/column-data-storage-and-row-data-storage-sap-hana/image source: http://saphanatutorial.com/column-data-storage-and-row-data-storage-sap-hana/
http://dataminingtrend.com http://facebook.com/datacube.th
Column-oriented databases
82
Operation
Column-
oriented
Row-
oriented
Aggregate Calulation of Single Column e.g. sum(price) Fast Slow
Compression Higher -
Retrieval of a few columns from a table with many
columns
Fast Slow
Insertion/Updating of single new record Slow Fast
Retrieval of a single record Slow Fast
http://dataminingtrend.com http://facebook.com/datacube.th
Column-oriented databases
• Use cases
• OLAP
• Data Analytics
• Software
• Cassandra
• Hbase (Hadoop)
• Google BigTable
• SAP HANA
83
http://dataminingtrend.com http://facebook.com/datacube.th
Variety of NoSQL Databases
• There are 4 major types of key NoSQL databases
• Key-Value databases
• Document databases
• Column-oriented databases
• Graph databases
84
http://dataminingtrend.com http://facebook.com/datacube.th
Graph databases
• Graph databases are the most specialized of the 4 NoSQL databases.
• Instead of modelling data using columns and rows, a graph database uses
structures called nodes and relationships.
• more formal discussions, they are called vertices and edges
• A node is an object that has an identifier and a set of attributes
• A relationship is a link between two nodes that contain attributes about that
relation.
• Graph databases are designed to model adjacency between objects. Every
node in the database contains pointers to adjacent objects in the database.
• This allows for fast operations that require following paths through a graph.
85
http://dataminingtrend.com http://facebook.com/datacube.th
Graph databases
• Example
86image source: NoSQL for Mere Mortals, Dan Sullivan, 2015
http://dataminingtrend.com http://facebook.com/datacube.th
Graph databases
• Example
87image source: NoSQL for Mere Mortals, Dan Sullivan, 2015
http://dataminingtrend.com http://facebook.com/datacube.th
Outline
• Part 1: Introduction to Big Data
• Part 2: Introduction to NoSQL
• Part 3: Introduction to MapReduce and Hadoop
• Part 4: Introduction to Hive, HBase and Sqoop
88
http://dataminingtrend.com http://facebook.com/datacube.th
Hadoop architecture
• Hadoop is composed of two primary components that
implement the basic concepts of distributed storage and
computation: HDFS and YARN
• HDFS (sometimes shortened to DFS) is the Hadoop Distributed
File System, responsible for managing data stored on disks
across the cluster.
• YARN acts as a cluster resource manager, allocating
computational assets (processing availability and memory on
worker nodes) to applications that wish to perform a distributed
computation.
89
http://dataminingtrend.com http://facebook.com/datacube.th
Hadoop architecture
90
Image source: "Data Analytics with Hadoop: An Introduction for Data Scientists", Benjamin Bengfort and Jenny Kim, 2016
http://dataminingtrend.com http://facebook.com/datacube.th
Hadoop architecture
• HDFS and YARN work in concert to minimize the amount of
network traffic in the cluster primarily by ensuring that data is
local to the required computation.
• A set of machines that is running HDFS and YARN is known as a
cluster, and the individual machines are called nodes.
• A cluster can have a single node, or many thousands of nodes,
but all clusters scale horizontally, meaning as you add more
nodes, the cluster increases in both capacity and performance
in a linear fashion.
91
http://dataminingtrend.com http://facebook.com/datacube.th
Hadoop architecture
• Each node in the cluster is identified by the type of process that
it runs:
• Master nodes
• These nodes run coordinating services for Hadoop workers and
are usually the entry points for user access to the cluster.
• Worker nodes
• Worker nodes run services that accept tasks from master nodes
either to store or retrieve data or to run a particular application.
• A distributed computation is run by parallelizing the analysis
across worker nodes.
92
http://dataminingtrend.com http://facebook.com/datacube.th
Hadoop architecture
• For HDFS, the master and worker services are as follows:
• NameNode (Master)
• Stores the directory tree of the file system, file metadata, and the
location of each file in the cluster.
• Clients wanting to access HDFS must first locate the appropriate
storage nodes by requesting information from the NameNode.
• DataNode (Worker)
• Stores and manages HDFS blocks on the local disk.
• Reports health and status of individual data stores back to the
NameNode
93
http://dataminingtrend.com http://facebook.com/datacube.th
Hadoop architecture
• An HDFS cluster with a replication factor of two; the NameNode
contains the mapping of files to blocks, and the DataNodes
store the blocks and their replicas
94
Image source: "Hadoop with Python", Zachary Radtka and Donald Miner, 2016
http://dataminingtrend.com http://facebook.com/datacube.th
Hadoop architecture
• When data is accessed from HDFS
• a client application must first make a request to the NameNode to
locate the data on disk.
• The NameNode will reply with a list of DataNodes that store the
data.
• the client must then directly request each block of data from the
DataNode.
95
http://dataminingtrend.com http://facebook.com/datacube.th
Hadoop architecture
• YARN has multiple master services and a worker service as
follows:
• ResourceManager (Master)
• Allocates and monitors available cluster resources (e.g.,
physical assets like memory and processor cores)
• handling scheduling of jobs on the cluster
• ApplicationMaster (Master)
• Coordinates a particular application being run on the cluster as
scheduled by the ResourceManager
96
http://dataminingtrend.com http://facebook.com/datacube.th
Hadoop architecture
• YARN has multiple master services and a worker service as
follows:
• NodeManager (Worker)
• Runs and manages processing tasks on an individual node as
well as reports the health and status of tasks as they’re running
97
http://dataminingtrend.com http://facebook.com/datacube.th
Hadoop architecture
• A small Hadoop cluster with two master nodes and four workers
nodes that implements all six primary Hadoop services
98
Image source: "Data Analytics with Hadoop: An Introduction for Data Scientists", Benjamin Bengfort and Jenny Kim, 2016
http://dataminingtrend.com http://facebook.com/datacube.th
Hadoop architecture
• Clients that wish to execute a job
• must first request resources from the ResourceManager, which
assigns an application-specific ApplicationMaster for the duration
of the job.
• the ApplicationMaster tracks the execution of the job.
• the ResourceManager tracks the status of the nodes
• each individual NodeManager creates containers and executes
tasks within them
99
http://dataminingtrend.com http://facebook.com/datacube.th
Hadoop architecture
• Finally, one other type of cluster is important to note: a single node
cluster.
• In “pseudo-distributed mode” a single machine runs all Hadoop
daemons as though it were part of a cluster, but network traffic occurs
through the local loopback network interface.
• Hadoop developers typically work in a pseudo-distributed environment,
usually inside of a virtual machine to which they connect via SSH.
• Cloudera, Hortonworks, and other popular distributions of Hadoop
provide pre-built virtual machine images that you can download and
get started with right away.
100
http://dataminingtrend.com http://facebook.com/datacube.th
Hadoop Distributed File System (HDFS)
• HDFS provides redundant storage for big data by storing that
data across a cluster of cheap, unreliable computers, thus
extending the amount of available storage capacity that a single
machine alone might have.
• HDFS performs best with modest number of very large files
• millions of large files (100 MB or more) rather than billions of smaller
files that might occupy the same volume.
• It is not a good fit as a data backend for applications that require
updates in real-time, interactive data analysis, or record-based
transactional support.
101
http://dataminingtrend.com http://facebook.com/datacube.th
Hadoop Distributed File System (HDFS)
• HDFS files are split into blocks, usually of either 64MB or
128MB.
• Blocks allow very large files to be split across and distributed to
many machines at run time.
• Additionally, blocks will be replicated across the DataNodes.
• by default, the replication is three fold
• Therefore, each block exists on three different machines and three
different disks, and if even two node fail, the data will not be lost.
102
http://dataminingtrend.com http://facebook.com/datacube.th
Interacting with HDFS
• Interacting with HDFS is primarily performed from the command
line using the script named hdfs. The hdfs script has the
following usage:
• The -option argument is the name of a specific option for the
specified command, and <arg> is one or more arguments that that
are specified for this option.
• For example, show help
103
$ hadoop fs [-option <arg>]
$ hadoop fs -help
http://dataminingtrend.com http://facebook.com/datacube.th
Interacting with HDFS
• List directory contents
• use -ls command:
• Running the -ls command on a new cluster will not return any
results. This is because the -ls command, without any
arguments, will attempt to display the contents of the user’s
home directory on HDFS.
• Providing -ls with the forward slash (/) as an argument displays the
contents of the root of HDFS:
104
$ hadoop fs -ls
$ hadoop fs -ls /
http://dataminingtrend.com http://facebook.com/datacube.th
Interacting with HDFS
• Creating a directory
• To create the books directory within HDFS, use the -mkdir
command:
• For example, create books directory in home directory
• Use the -ls command to verify that the previous directories were
created:
105
$ hadoop fs -mkdir [directory name]
$ hadoop fs -mkdir books
$ hadoop fs -ls
http://dataminingtrend.com http://facebook.com/datacube.th
Interacting with HDFS
• Copy Data onto HDFS
• After a directory has been created for the current user, data can
be uploaded to the user’s HDFS home directory with the -put
command:
• For example, copy book file from local to HDFS
• Use the -ls command to verify that pg20417.txt was moved to
HDFS:
106
$ hadoop fs -put [source file] [destination file]
$ hadoop fs -put pg20417.txt books/pg20417.txt
$ hadoop fs -ls books
http://dataminingtrend.com http://facebook.com/datacube.th
Interacting with HDFS
• Retrieve (view) Data from HDFS
• Multiple commands allow data to be retrieved from HDFS.
• To simply view the contents of a file, use the -cat command. -cat
reads a file on HDFS and displays its contents to stdout.
• The following command uses -cat to display the contents of
pg20417.txt
•
107
$ hadoop fs -cat books/pg20417.txt
http://dataminingtrend.com http://facebook.com/datacube.th
Interacting with HDFS
• Retrieve (view) Data from HDFS
• Data can also be copied from HDFS to the local filesystem using
the -get command. The -get command is the opposite of the -put
command:
• For example, This command copies pg20417.txt from HDFS to the
local filesystem.
108
$ hadoop fs -get [source file] [destination file]
$ hadoop fs -get pg20417.txt .
http://dataminingtrend.com http://facebook.com/datacube.th
MapReduce
• MapReduce is a programming model that enables large volumes of data
to be processed and generated by dividing work into independent tasks
and executing the tasks in parallel across a cluster of machines.
• At a high level, every MapReduce program transforms a list of input data
elements into a list of output data elements twice, once in the map phase
and once in the reduce phase.
• The MapReduce framework is composed of three major phases: map,
shuffle and sort, and reduce.
109
Image source: "Data Analytics with Hadoop: An Introduction for Data Scientists", Benjamin Bengfort and Jenny Kim, 2016
http://dataminingtrend.com http://facebook.com/datacube.th
MapReduce
• Map
• The first phase of a MapReduce application is the map phase.
Within the map phase, a function (called the mapper) processes a
series of key-value pairs.
• The mapper sequentially processes each key-value pair
individually, producing zero or more output key-value pairs
• As an example, consider a mapper whose purpose is to transform
sentences into words.
110
http://dataminingtrend.com http://facebook.com/datacube.th
MapReduce
• Map
• The input to this mapper would be strings that contain sentences,
and the mapper’s function would be to split the sentences into
words and output the words
111
Image source: "Hadoop with Python", Zachary Radtka and Donald Miner, 2016
http://dataminingtrend.com http://facebook.com/datacube.th
MapReduce
• Shuffle and Sort
• As the mappers begin completing, the intermediate outputs from
the map phase are moved to the reducers. This process of moving
output from the mappers to the reducers is known as shuffling.
• Shuffling is handled by a partition function, known as the
partitioner. The partitioner ensures that all of the values for the
same key are sent to the same reducer.
• The intermediate keys and values for each partition are sorted by
the Hadoop framework before being presented to the reducer.
112
http://dataminingtrend.com http://facebook.com/datacube.th
MapReduce
• Reduce
• Within the reducer phase, an iterator of values is provided to a
function known as the reducer. The iterator of values is a nonunique
set of values for each unique key from the output of the map phase.
• The reducer aggregates the values for each unique key and
produces zero or more output key-value pairs
• As an example, consider a reducer whose purpose is to sum all of
the values for a key. The input to this reducer is an iterator of all of
the values for a key, and the reducer sums all of the values.
113
http://dataminingtrend.com http://facebook.com/datacube.th
MapReduce
• Reduce
• The reducer then outputs a key-value pair that contains the input
key and the sum of the input key values
114
Image source: "Hadoop with Python", Zachary Radtka and Donald Miner, 2016
http://dataminingtrend.com http://facebook.com/datacube.th
MapReduce
• Data flow of a MapReduce job being executed on a cluster of a
few nodes
115
Image source: "Data Analytics with Hadoop: An Introduction for Data Scientists", Benjamin Bengfort and Jenny Kim, 2016
http://dataminingtrend.com http://facebook.com/datacube.th
MapReduce examples
• In order to demonstrate how data flows through a map and
reduce computational pipeline, we will present 3 examples
• word counting
• IoT data
• shared friendship
116
http://dataminingtrend.com http://facebook.com/datacube.th
MapReduce examples: word count
• The word-counting application takes as input one or more text
files and produces a list of word and their frequencies as output.
117
Image source: "Data Analytics with Hadoop: An Introduction for Data Scientists", Benjamin Bengfort and Jenny Kim, 2016
http://dataminingtrend.com http://facebook.com/datacube.th
MapReduce examples: word count
• Because Hadoop utilizes key/value pairs the input key is a file
ID and line number and the input value is a string, while the
output key is a word and the output value is an integer.
• The following Python pseudocode shows how this algorithm is
implemented:
118
# emit is a function that performs hadoop I/O
def map(dockey, line):
for word in value.split():
emit(word, 1)
def reduce(word, values):
count = sum(value for value in values)
emit(word,count)
http://dataminingtrend.com http://facebook.com/datacube.th
MapReduce examples: word count
• Example

(Map)
119
(27183, “The fast cat wears no hat.”)
(31416, “The cat in the hat ran fast.”)
input
Mapper 1 Mapper 2
http://dataminingtrend.com http://facebook.com/datacube.th
MapReduce examples: word count
• Example

(Map)
120
(27183, “The fast cat wears no hat.”)
(31416, “The cat in the hat ran fast.”)
input
Mapper 1 Mapper 2
(27183, “The fast cat wears no hat.”) (31416, “The cat in the hat ran fast.”)
http://dataminingtrend.com http://facebook.com/datacube.th
MapReduce examples: word count
• Example

(Map)
121
(“The”,1) (“The”,1)
input
Mapper 1 Mapper 2
(27183, “The fast cat wears no hat.”)
(31416, “The cat in the hat ran fast.”)
http://dataminingtrend.com http://facebook.com/datacube.th
MapReduce examples: word count
• Example

(Map)
122
(27183, “The fast cat wears no hat.”)
(31416, “The cat in the hat ran fast.”)
(“The”,1) (“The”,1)(“fast”,1) (“cat”,1)
input
Mapper 1 Mapper 2
http://dataminingtrend.com http://facebook.com/datacube.th
MapReduce examples: word count
• Example

(Map)
123
(27183, “The fast cat wears no hat.”)
(31416, “The cat in the hat ran fast.”)
(“The”,1) (“The”,1)(“fast”,1) (“cat”,1)(“cat”,1) (“in”,1)
input
Mapper 1 Mapper 2
http://dataminingtrend.com http://facebook.com/datacube.th
MapReduce examples: word count
• Example

(Map)
124
(27183, “The fast cat wears no hat.”)
(31416, “The cat in the hat ran fast.”)
(“The”,1) (“The”,1)(“fast”,1) (“cat”,1)(“cat”,1) (“in”,1)
(“wears”,1) (“the”,1)
input
Mapper 1 Mapper 2
http://dataminingtrend.com http://facebook.com/datacube.th
MapReduce examples: word count
• Example

(Map)
125
(27183, “The fast cat wears no hat.”)
(31416, “The cat in the hat ran fast.”)
(“The”,1) (“The”,1)(“fast”,1) (“cat”,1)(“cat”,1) (“in”,1)
(“wears”,1) (“the”,1)(“no”,1) (“hat”,1)
input
Mapper 1 Mapper 2
http://dataminingtrend.com http://facebook.com/datacube.th
MapReduce examples: word count
• Example

(Map)
126
(27183, “The fast cat wears no hat.”)
(31416, “The cat in the hat ran fast.”)
(“The”,1) (“The”,1)(“fast”,1) (“cat”,1)(“cat”,1) (“in”,1)
(“wears”,1) (“the”,1)(“no”,1) (“hat”,1)(“hat”,1) (“ran”,1)
input
Mapper 1 Mapper 2
http://dataminingtrend.com http://facebook.com/datacube.th
MapReduce examples: word count
• Example

(Map)
127
(27183, “The fast cat wears no hat.”)
(31416, “The cat in the hat ran fast.”)
(“The”,1) (“The”,1)(“fast”,1) (“cat”,1)(“cat”,1) (“in”,1)
(“wears”,1) (“the”,1)(“no”,1) (“hat”,1)(“hat”,1)
(“.”,1)
(“ran”,1)
(“fast”,1)
input
Mapper 1 Mapper 2
http://dataminingtrend.com http://facebook.com/datacube.th
MapReduce examples: word count
• Example 

(Map)
128
(27183, “The fast cat wears no hat.”)
(31416, “The cat in the hat ran fast.”)
(“The”,1) (“The”,1)(“fast”,1) (“cat”,1)(“cat”,1) (“in”,1)
(“wears”,1) (“the”,1)(“no”,1) (“hat”,1)(“hat”,1)
(“.”,1)
(“ran”,1)
(“fast”,1) (“.”,1)
input
Mapper 1 Mapper 2
http://dataminingtrend.com http://facebook.com/datacube.th
MapReduce examples: word count
• Example (Shuffle & Sort)
129
(“The”,1) (“The”,1)(“fast”,1) (“cat”,1)(“cat”,1) (“in”,1)
(“wears”,1) (“the”,1)(“no”,1) (“hat”,1)(“hat”,1)
(“.”,1)
(“ran”,1)
(“fast”,1) (“.”,1)
(“.”,1) (“.”,1)
Mapper 1 Mapper 2
Shuffle & Sort
http://dataminingtrend.com http://facebook.com/datacube.th
MapReduce examples: word count
• Example (Shuffle & Sort)
130
(“The”,1) (“The”,1)(“fast”,1) (“cat”,1)(“cat”,1) (“in”,1)
(“wears”,1) (“the”,1)(“no”,1) (“hat”,1)(“hat”,1)
(“.”,1)
(“ran”,1)
(“fast”,1) (“.”,1)
(“.”,1) (“.”,1)
(“cat”,1) (“cat”,1)
Mapper 1 Mapper 2
Shuffle & Sort
http://dataminingtrend.com http://facebook.com/datacube.th
MapReduce examples: word count
• Example (Shuffle & Sort)
131
(“The”,1) (“The”,1)(“fast”,1) (“cat”,1)(“cat”,1) (“in”,1)
(“wears”,1) (“the”,1)(“no”,1) (“hat”,1)(“hat”,1)
(“.”,1)
(“ran”,1)
(“fast”,1) (“.”,1)
(“.”,1) (“.”,1)
(“cat”,1) (“cat”,1)
(“fast”,1) (“fast”,1)
Mapper 1 Mapper 2
Shuffle & Sort
http://dataminingtrend.com http://facebook.com/datacube.th
MapReduce examples: word count
• Example (Shuffle & Sort)
132
(“The”,1) (“The”,1)(“fast”,1) (“cat”,1)(“cat”,1) (“in”,1)
(“wears”,1) (“the”,1)(“no”,1) (“hat”,1)(“hat”,1)
(“.”,1)
(“ran”,1)
(“fast”,1) (“.”,1)
(“.”,1) (“.”,1)
(“cat”,1) (“cat”,1)
(“fast”,1) (“fast”,1)
(“hat”,1) (“hat”,1)
Mapper 1 Mapper 2
Shuffle & Sort
http://dataminingtrend.com http://facebook.com/datacube.th
MapReduce examples: word count
• Example (Shuffle & Sort)
133
(“The”,1) (“The”,1)(“fast”,1) (“cat”,1)(“cat”,1) (“in”,1)
(“wears”,1) (“the”,1)(“no”,1) (“hat”,1)(“hat”,1)
(“.”,1)
(“ran”,1)
(“fast”,1) (“.”,1)
(“.”,1) (“.”,1)
(“cat”,1) (“cat”,1)
(“fast”,1) (“fast”,1)
(“hat”,1) (“hat”,1)
(“in”,1)
Mapper 1 Mapper 2
Shuffle & Sort
http://dataminingtrend.com http://facebook.com/datacube.th
MapReduce examples: word count
• Example (Shuffle & Sort)
134
(“The”,1) (“The”,1)(“fast”,1) (“cat”,1)(“cat”,1) (“in”,1)
(“wears”,1) (“the”,1)(“no”,1) (“hat”,1)(“hat”,1)
(“.”,1)
(“ran”,1)
(“fast”,1) (“.”,1)
(“.”,1) (“.”,1)
(“cat”,1) (“cat”,1)
(“fast”,1) (“fast”,1)
(“hat”,1) (“hat”,1)
(“in”,1)
(“no”,1)
Mapper 1 Mapper 2
Shuffle & Sort
http://dataminingtrend.com http://facebook.com/datacube.th
MapReduce examples: word count
• Example (Shuffle & Sort)
135
(“The”,1) (“The”,1)(“fast”,1) (“cat”,1)(“cat”,1) (“in”,1)
(“wears”,1) (“the”,1)(“no”,1) (“hat”,1)(“hat”,1)
(“.”,1)
(“ran”,1)
(“fast”,1) (“.”,1)
(“.”,1) (“.”,1)
(“cat”,1) (“cat”,1)
(“fast”,1) (“fast”,1)
(“hat”,1) (“hat”,1)
(“in”,1)
(“no”,1)
(“ran”,1)
Mapper 1 Mapper 2
Shuffle & Sort
http://dataminingtrend.com http://facebook.com/datacube.th
MapReduce examples: word count
• Example (Shuffle & Sort)
136
(“The”,1) (“The”,1)(“fast”,1) (“cat”,1)(“cat”,1) (“in”,1)
(“wears”,1) (“the”,1)(“no”,1) (“hat”,1)(“hat”,1)
(“.”,1)
(“ran”,1)
(“fast”,1) (“.”,1)
(“.”,1) (“.”,1)
(“cat”,1) (“cat”,1)
(“fast”,1) (“fast”,1)
(“hat”,1) (“hat”,1)
(“in”,1)
(“no”,1)
(“ran”,1)
(“the”,1)
Mapper 1 Mapper 2
Shuffle & Sort
http://dataminingtrend.com http://facebook.com/datacube.th
MapReduce examples: word count
• Example (Shuffle & Sort)
137
(“The”,1) (“The”,1)(“fast”,1) (“cat”,1)(“cat”,1) (“in”,1)
(“wears”,1) (“the”,1)(“no”,1) (“hat”,1)(“hat”,1)
(“.”,1)
(“ran”,1)
(“fast”,1) (“.”,1)
(“.”,1) (“.”,1)
(“cat”,1) (“cat”,1)
(“fast”,1) (“fast”,1)
(“hat”,1) (“hat”,1)
(“in”,1)
(“no”,1)
(“ran”,1)
(“the”,1)
(“wears”,1)
Mapper 1 Mapper 2
Shuffle & Sort
http://dataminingtrend.com http://facebook.com/datacube.th
MapReduce examples: word count
• Example (Shuffle & Sort)
138
Mapper 1
(“The”,1) (“The”,1)(“fast”,1) (“cat”,1)(“cat”,1) (“in”,1)
(“wears”,1) (“the”,1)(“no”,1) (“hat”,1)(“hat”,1)
(“.”,1)
(“ran”,1)
(“fast”,1) (“.”,1)
(“.”,1) (“.”,1)
(“cat”,1) (“cat”,1)
(“fast”,1) (“fast”,1)
(“hat”,1) (“hat”,1)
(“in”,1)
(“no”,1)
(“ran”,1)
(“the”,1)
(“wears”,1)
(“The”,1) (“The”,1)
Mapper 2
Shuffle & Sort
http://dataminingtrend.com http://facebook.com/datacube.th
MapReduce examples: word count
• Example (Reduce)
139
(“.”,1) (“.”,1)
(“cat”,1) (“cat”,1)
(“fast”,1) (“fast”,1)
(“hat”,1) (“hat”,1)
(“in”,1)
(“no”,1)
(“ran”,1)
(“the”,1)
(“wears”,1)
(“The”,1) (“The”,1)
Shuffle & Sort
Reduce
(“.”,2)
http://dataminingtrend.com http://facebook.com/datacube.th
MapReduce examples: word count
• Example (Reduce)
140
(“.”,1) (“.”,1)
(“cat”,1) (“cat”,1)
(“fast”,1) (“fast”,1)
(“hat”,1) (“hat”,1)
(“in”,1)
(“no”,1)
(“ran”,1)
(“the”,1)
(“wears”,1)
(“The”,1) (“The”,1)
Shuffle & Sort
Reduce
(“.”,2)
(“cat”,2)
http://dataminingtrend.com http://facebook.com/datacube.th
MapReduce examples: word count
• Example (Reduce)
141
(“.”,1) (“.”,1)
(“cat”,1) (“cat”,1)
(“fast”,1) (“fast”,1)
(“hat”,1) (“hat”,1)
(“in”,1)
(“no”,1)
(“ran”,1)
(“the”,1)
(“wears”,1)
(“The”,1) (“The”,1)
Shuffle & Sort
Reduce
(“.”,2)
(“cat”,2)
(“fast”,2)
http://dataminingtrend.com http://facebook.com/datacube.th
MapReduce examples: word count
• Example (Reduce)
142
(“.”,1) (“.”,1)
(“cat”,1) (“cat”,1)
(“fast”,1) (“fast”,1)
(“hat”,1) (“hat”,1)
(“in”,1)
(“no”,1)
(“ran”,1)
(“the”,1)
(“wears”,1)
(“The”,1) (“The”,1)
Shuffle & Sort
Reduce
(“.”,2)
(“cat”,2)
(“fast”,2)
(“hat”,2)
http://dataminingtrend.com http://facebook.com/datacube.th
MapReduce examples: word count
• Example (Reduce)
143
(“.”,1) (“.”,1)
(“cat”,1) (“cat”,1)
(“fast”,1) (“fast”,1)
(“hat”,1) (“hat”,1)
(“in”,1)
(“no”,1)
(“ran”,1)
(“the”,1)
(“wears”,1)
(“The”,1) (“The”,1)
Shuffle & Sort
Reduce
(“.”,2)
(“cat”,2)
(“fast”,2)
(“hat”,2)
(“in”,1)
http://dataminingtrend.com http://facebook.com/datacube.th
MapReduce examples: word count
• Example (Reduce)
144
(“.”,1) (“.”,1)
(“cat”,1) (“cat”,1)
(“fast”,1) (“fast”,1)
(“hat”,1) (“hat”,1)
(“in”,1)
(“no”,1)
(“ran”,1)
(“the”,1)
(“wears”,1)
(“The”,1) (“The”,1)
Shuffle & Sort
Reduce
(“.”,2)
(“cat”,2)
(“fast”,2)
(“hat”,2)
(“in”,1)
(“no”,1)
http://dataminingtrend.com http://facebook.com/datacube.th
MapReduce examples: word count
• Example (Reduce)
145
(“.”,1) (“.”,1)
(“cat”,1) (“cat”,1)
(“fast”,1) (“fast”,1)
(“hat”,1) (“hat”,1)
(“in”,1)
(“no”,1)
(“ran”,1)
(“the”,1)
(“wears”,1)
(“The”,1) (“The”,1)
Shuffle & Sort
Reduce
(“.”,2)
(“cat”,2)
(“fast”,2)
(“hat”,2)
(“in”,1)
(“no”,1)
(“ran”,1)
http://dataminingtrend.com http://facebook.com/datacube.th
MapReduce examples: word count
• Example (Reduce)
146
(“.”,1) (“.”,1)
(“cat”,1) (“cat”,1)
(“fast”,1) (“fast”,1)
(“hat”,1) (“hat”,1)
(“in”,1)
(“no”,1)
(“ran”,1)
(“the”,1)
(“wears”,1)
(“The”,1) (“The”,1)
Shuffle & Sort
Reduce
(“.”,2)
(“cat”,2)
(“fast”,2)
(“hat”,2)
(“in”,1)
(“no”,1)
(“ran”,1)
(“the”,1)
http://dataminingtrend.com http://facebook.com/datacube.th
MapReduce examples: word count
• Example (Reduce)
147
(“.”,1) (“.”,1)
(“cat”,1) (“cat”,1)
(“fast”,1) (“fast”,1)
(“hat”,1) (“hat”,1)
(“in”,1)
(“no”,1)
(“ran”,1)
(“the”,1)
(“wears”,1)
(“The”,1) (“The”,1)
Shuffle & Sort
Reduce
(“.”,2)
(“cat”,2)
(“fast”,2)
(“hat”,2)
(“in”,1)
(“no”,1)
(“ran”,1)
(“the”,1)
(“wears”,1)
http://dataminingtrend.com http://facebook.com/datacube.th
MapReduce examples: word count
• Example (Reduce)
148
(“.”,1) (“.”,1)
(“cat”,1) (“cat”,1)
(“fast”,1) (“fast”,1)
(“hat”,1) (“hat”,1)
(“in”,1)
(“no”,1)
(“ran”,1)
(“the”,1)
(“wears”,1)
(“The”,1) (“The”,1)
Shuffle & Sort
Reduce
(“.”,2)
(“cat”,2)
(“fast”,2)
(“hat”,2)
(“in”,1)
(“no”,1)
(“ran”,1)
(“the”,1)
(“wears”,1)
(“The”,2)
http://dataminingtrend.com http://facebook.com/datacube.th
MapReduce examples
• In order to demonstrate how data flows through a map and
reduce computational pipeline, we will present 3 examples
• word counting
• IoT data
• shared friendship
149
http://dataminingtrend.com http://facebook.com/datacube.th
MapReduce examples: IoT
• IoT applications create an enormous amount of data that has to
be processed. This data is generated by physical sensors who
take measurements, like room temperature at 8.00 o’Clock.
• Every measurement consists of
• a key (the timestamp when the measurement has been taken) and
• a value (the actual value measured by the sensor).
• for example, (2016-05-01 01:02:03, 1).
• The goal of this exercise is to create average daily values of that
sensor’s data.
150
http://dataminingtrend.com http://facebook.com/datacube.th
MapReduce examples: IoT
• Example(Map)
151
input
Mapper 1 Mapper 2 Mapper 3
(“2016-05-01 01:02:03”,1)
(“2016-05-02 12:09:04”,2)
(“2016-05-03 09:21:07”,3)
(“2016-05-03 09:21:45”,4)
(“2016-05-01 01:02:04”,5)
(“2016-05-02 12:09:01”,6)
(“2016-05-02 12:09:30”,7)
(“2016-05-03 09:21:31”,8)
(“2016-05-01 01:02:05”,9)
http://dataminingtrend.com http://facebook.com/datacube.th
MapReduce examples: IoT
• Example(Map)
152
input
Mapper 1 Mapper 2 Mapper 3
(“2016-05-01 01:02:03”,1)
(“2016-05-02 12:09:04”,2)
(“2016-05-03 09:21:07”,3)
(“2016-05-03 09:21:45”,4)
(“2016-05-01 01:02:04”,5)
(“2016-05-02 12:09:01”,6)
(“2016-05-02 12:09:30”,7)
(“2016-05-03 09:21:31”,8)
(“2016-05-01 01:02:05”,9)
(“2016-05-01 01:02:03”,1)
(“2016-05-02 12:09:04”,2)
(“2016-05-03 09:21:07”,3)
(“2016-05-03 09:21:45”,4)
(“2016-05-01 01:02:04”,5)
(“2016-05-02 12:09:01”,6)
(“2016-05-02 12:09:30”,7)
(“2016-05-03 09:21:31”,8)
(“2016-05-01 01:02:05”,9)
http://dataminingtrend.com http://facebook.com/datacube.th
MapReduce examples: IoT
• Example(Map)
153
input
Mapper 1 Mapper 2 Mapper 3
(“2016-05-01 01:02:03”,1)
(“2016-05-02 12:09:04”,2)
(“2016-05-03 09:21:07”,3)
(“2016-05-03 09:21:45”,4)
(“2016-05-01 01:02:04”,5)
(“2016-05-02 12:09:01”,6)
(“2016-05-02 12:09:30”,7)
(“2016-05-03 09:21:31”,8)
(“2016-05-01 01:02:05”,9)
(“2016-05-01”,1)
(“2016-05-02”,2)
(“2016-05-03”,3)
(“2016-05-03”,4)
(“2016-05-01”,5)
(“2016-05-02”,6)
(“2016-05-02”,7)
(“2016-05-03”,8)
(“2016-05-01”,9)
http://dataminingtrend.com http://facebook.com/datacube.th
MapReduce examples: IoT
• Example(Shuffle & Sort)
154
Mapper 1 Mapper 2 Mapper 3
(“2016-05-01”,1)
(“2016-05-02”,2)
(“2016-05-03”,3)
(“2016-05-03”,4)
(“2016-05-01”,5)
(“2016-05-02”,6)
(“2016-05-02”,7)
(“2016-05-03”,8)
(“2016-05-01”,9)
(“2016-05-01”,1) (“2016-05-01”,5) (“2016-05-01”,9)
Shuffle & Sort
http://dataminingtrend.com http://facebook.com/datacube.th
MapReduce examples: IoT
• Example(Shuffle & Sort)
155
Mapper 1 Mapper 2 Mapper 3
(“2016-05-01”,1)
(“2016-05-02”,2)
(“2016-05-03”,3)
(“2016-05-03”,4)
(“2016-05-01”,5)
(“2016-05-02”,6)
(“2016-05-02”,7)
(“2016-05-03”,8)
(“2016-05-01”,9)
(“2016-05-01”,1) (“2016-05-01”,5) (“2016-05-01”,9)
(“2016-05-02”,2) (“2016-05-02”,6) (“2016-05-02”,7)
Shuffle & Sort
http://dataminingtrend.com http://facebook.com/datacube.th
MapReduce examples: IoT
• Example(Shuffle & Sort)
156
Mapper 1 Mapper 2 Mapper 3
(“2016-05-01”,1)
(“2016-05-02”,2)
(“2016-05-03”,3)
(“2016-05-03”,4)
(“2016-05-01”,5)
(“2016-05-02”,6)
(“2016-05-02”,7)
(“2016-05-03”,8)
(“2016-05-01”,9)
(“2016-05-01”,1) (“2016-05-01”,5) (“2016-05-01”,9)
(“2016-05-02”,2) (“2016-05-02”,6) (“2016-05-02”,7)
(“2016-05-03”,3) (“2016-05-03”,4) (“2016-05-03”,8)
Shuffle & Sort
http://dataminingtrend.com http://facebook.com/datacube.th
MapReduce examples: IoT
• Example(Reduce)
157
(“2016-05-01”,1) (“2016-05-01”,5) (“2016-05-01”,9)
(“2016-05-02”,2) (“2016-05-02”,6) (“2016-05-02”,7)
(“2016-05-03”,3) (“2016-05-03”,4) (“2016-05-03”,8)
Shuffle & Sort
(“2016-05-01”,5)value = (1+5+9)/3
Reduce
http://dataminingtrend.com http://facebook.com/datacube.th
MapReduce examples: IoT
• Example(Reduce)
158
(“2016-05-01”,1) (“2016-05-01”,5) (“2016-05-01”,9)
(“2016-05-02”,2) (“2016-05-02”,6) (“2016-05-02”,7)
(“2016-05-03”,3) (“2016-05-03”,4) (“2016-05-03”,8)
Shuffle & Sort
Reduce
(“2016-05-01”,5)
value = (2+6+7)/3 (“2016-05-02”,5)
http://dataminingtrend.com http://facebook.com/datacube.th
MapReduce examples: IoT
• Example(Reduce)
159
(“2016-05-01”,1) (“2016-05-01”,5) (“2016-05-01”,9)
(“2016-05-02”,2) (“2016-05-02”,6) (“2016-05-02”,7)
(“2016-05-03”,3) (“2016-05-03”,4) (“2016-05-03”,8)
Shuffle & Sort
(“2016-05-01”,5)
value = (3+4+8)/3
(“2016-05-02”,5)
(“2016-05-03”,5)
Reduce
http://dataminingtrend.com http://facebook.com/datacube.th
MapReduce examples
• In order to demonstrate how data flows through a map and
reduce computational pipeline, we will present 3 examples
• word counting
• IoT data
• shared friendship
160
http://dataminingtrend.com http://facebook.com/datacube.th
MapReduce examples: shared friendship
• In the shared friendship task, the goal is to analyze a social
network to see which friend relationships users have in
common.
• Given an input data source where the key is the name of a user
and the value is a comma-separated list of friends.
161
http://dataminingtrend.com http://facebook.com/datacube.th
MapReduce examples: shared friendship
• The following Python pseudocode demonstrates how to perform
this computation:
162
def map(person, friends):
for friend in friends.split(“,”):
pair = sort([person, friend])
emit(pair,friends)
def reduce(pair, friends):
shared = set(friend[0])
shared = shared.intersection(friends[1])
emit(pair,shared)
http://dataminingtrend.com http://facebook.com/datacube.th
MapReduce examples: shared friendship
• The mapper create an intermediate keycap of all of the possible
(friend, friend) tuples that exist from the initial dataset.
• This allows us to analyze the dataset on a per-relationship basis as the
value is the list of associated friends.
• The pair is sorted, which ensures that the input (“Mike”,“Linda”)
and (“Linda”,“Mike”) end up being the same key during
aggregation in the reducer.
163
http://dataminingtrend.com http://facebook.com/datacube.th
MapReduce examples: shared friendship
• Example(Map)
164
input
(“Allen”,”Betty, Chris, David”)
(“Betty”,”Allen, Chris, David, Ellen”)
(“Chris”,”Allen, Betty, David,Ellen”)
(“David”,”Allen, Betty, Chris, Ellen”)
(“Ellen”,”Betty, Chris, David”)
http://dataminingtrend.com http://facebook.com/datacube.th
MapReduce examples: shared friendship
• Example(Map)
165
input
Mapper 1 Mapper 2
(“Allen”,”Betty, Chris, David”)
(“Betty”,”Allen, Chris, David, Ellen”)
(“Chris”,”Allen, Betty, David,Ellen”)
(“David”,”Allen, Betty, Chris, Ellen”)
(“Ellen”,”Betty, Chris, David”)
(“Allen, Betty”,”Betty, Chris, David”)
(“Allen, Chris”,”Betty, Chris, David”)
(“Allen, David”,”Betty, Chris, David”)
(“Betty, Ellen”,”Betty, Chris, David”)
(“Chris, Ellen”,”Betty, Chris, David”)
(“David, Ellen”,”Betty, Chris, David”)
http://dataminingtrend.com http://facebook.com/datacube.th
MapReduce examples: shared friendship
• Example(Map)
166
input
Mapper 3 Mapper 4
(“Allen”,”Betty, Chris, David”)
(“Betty”,”Allen, Chris, David, Ellen”)
(“Chris”,”Allen, Betty, David,Ellen”)
(“David”,”Allen, Betty, Chris, Ellen”)
(“Ellen”,”Betty, Chris, David”)
(“Allen, Betty”,”Allen, Chris, David, Ellen”)
(“Betty, Chris”,”Allen, Chris, David, Ellen”)
(“Betty, David”,”Allen, Chris, David, Ellen”)
(“Betty, Ellen”,”Allen, Chris, David, Ellen”)
(“Allen, Chris”,”Allen, Betty, David,Ellen”)
(“Betty, Chris”,”Allen, Betty, David,Ellen”)
(“Chris, David”,”Allen, Betty, David,Ellen”)
(“Chris, Ellen”,”Allen, Betty, David,Ellen”)
http://dataminingtrend.com http://facebook.com/datacube.th
MapReduce examples: shared friendship
• Example(Map)
167
input
Mapper 5
(“Allen”,”Betty, Chris, David”)
(“Betty”,”Allen, Chris, David, Ellen”)
(“Chris”,”Allen, Betty, David,Ellen”)
(“David”,”Allen, Betty, Chris, Ellen”)
(“Ellen”,”Betty, Chris, David”)
(“Allen, David”,”Allen, Betty, Chris, Ellen”)
(“Betty, David”,”Allen, Betty, Chris, Ellen”)
(“Chris, David”,”Allen, Betty, Chris, Ellen”)
(“David, Ellen”,”Allen, Betty, Chris, Ellen”)
http://dataminingtrend.com http://facebook.com/datacube.th
MapReduce examples: shared friendship
• Example (Shuffle & Sort)
168
Shuffle & Sort
(“Allen, Betty”,”Betty, Chris, David”)(“Allen, Betty”,”Allen, Chris, David, Ellen”)
(“Allen, Chris”,”Allen, Betty, David,Ellen”) (“Allen, Chris”,”Betty, Chris, David”)
(“Allen, David”,”Allen, Betty, Chris, Ellen”) (“Allen, David”,”Betty, Chris, David”)
(“Betty, Chris”,”Allen, Betty, David,Ellen”) (“Betty, Chris”,”Allen, Chris, David, Ellen”)
(“Betty, David”,”Allen, Chris, David, Ellen”)(“Betty, David”,”Allen, Betty, Chris, Ellen”)
(“Betty, Ellen”,”Allen, Chris, David, Ellen”) (“Betty, Ellen”,”Betty, Chris, David”)
(“Chris, David”,”Allen, Betty, David,Ellen”)(“Chris, David”,”Allen, Betty, Chris, Ellen”)
(“Chris, Ellen”,”Betty, Chris, David”)(“Chris, Ellen”,”Allen, Betty, David,Ellen”)
(“David, Ellen”,”Allen, Betty, Chris, Ellen”) (“David, Ellen”,”Betty, Chris, David”)
http://dataminingtrend.com http://facebook.com/datacube.th
MapReduce examples: shared friendship
• Example (Shuffle & Sort)
169
Shuffle & Sort
(“Allen, Betty”,”Betty, Chris, David”)(“Allen, Betty”,”Allen, Chris, David, Ellen”)
(“Allen, Chris”,”Allen, Betty, David,Ellen”) (“Allen, Chris”,”Betty, Chris, David”)
(“Allen, David”,”Allen, Betty, Chris, Ellen”) (“Allen, David”,”Betty, Chris, David”)
(“Betty, Chris”,”Allen, Betty, David, Ellen”) (“Betty, Chris”,”Allen, Chris, David, Ellen”)
(“Betty, David”,”Allen, Chris, David, Ellen”)(“Betty, David”,”Allen, Betty, Chris, Ellen”)
(“Betty, Ellen”,”Allen, Chris, David, Ellen”) (“Betty, Ellen”,”Betty, Chris, David”)
(“Chris, David”,”Allen, Betty, David,Ellen”)(“Chris, David”,”Allen, Betty, Chris, Ellen”)
(“Chris, Ellen”,”Betty, Chris, David”)(“Chris, Ellen”,”Allen, Betty, David,Ellen”)
(“David, Ellen”,”Allen, Betty, Chris, Ellen”) (“David, Ellen”,”Betty, Chris, David”)
http://dataminingtrend.com http://facebook.com/datacube.th
MapReduce examples: shared friendship
• Example (Reduce)
170
(“Allen, Betty”, “Chris, David”)
(“Allen, Chris”, “Betty, David”)
(“Allen, David”, “Betty, Chris”)
(“Betty, Chris”, “Allen, David, Ellen”)
(“Betty, David”, “Allen, Chris, Ellen”)
(“Betty, Ellen”, “Chris, David”)
(“Chris, David”, “Allen, Betty, Ellen”)
(“Chris, Ellen”, “Betty, David”)
(“David, Ellen”, “Betty, Chris”)
http://dataminingtrend.com http://facebook.com/datacube.th
Hadoop Streaming
• Hadoop streaming is a utility that comes packaged with the
Hadoop distribution and allows MapReduce jobs to be created
with any executable as the mapper and/or the reducer.
• The Hadoop streaming utility enables Python, shell scripts, or any
other language to be used as a mapper, reducer, or both.
• The mapper and reducer are both executables that
• read input, line by line, from the standard input (stdin),
• and write output to the standard output (stdout).
• The Hadoop streaming utility creates a MapReduce job, submits the job
to the cluster, and monitors its progress until it is complete.
171
http://dataminingtrend.com http://facebook.com/datacube.th
Hadoop Streaming
• When the mapper is initialized, each map task launches the
specified executable as a separate process.
• The mapper reads the input file and presents each line to the
executable via stdin. After the executable processes each line
of input, the mapper collects the output from stdout and
converts each line to a key-value pair.
• The key consists of the part of the line before the first tab
character, and the value consists of the part of the line after the
first tab character.
172
http://dataminingtrend.com http://facebook.com/datacube.th
Hadoop Streaming
• When the reducer is initialized, each reduce task launches the
specified executable as a separate process.
• The reducer converts the input key-value pair to lines that are
presented to the executable via stdin.
• The reducer collects the executables result from stdout and
converts each line to a key-value pair.
• Similar to the mapper, the executable specifies key-value pairs
by separating the key and value by a tab character.
173
http://dataminingtrend.com http://facebook.com/datacube.th
Hadoop Streaming
• Data flow in Hadoop Streaming via Python mapper.py and
reducer.py scripts
174
Image source: "Data Analytics with Hadoop: An Introduction for Data Scientists", Benjamin Bengfort and Jenny Kim, 2016
http://dataminingtrend.com http://facebook.com/datacube.th
Hadoop Streaming example
• The WordCount application can be implemented as two Python
programs: mapper.py and reducer.py.
• mapper.py is the Python program that implements the logic in
the map phase of WordCount.
• It reads data from stdin, splits the lines into words, and outputs
each word with its intermediate count to stdout.
175
http://dataminingtrend.com http://facebook.com/datacube.th
Hadoop Streaming example
• mapper.py
176
#!/usr/bin/env python
import sys
# input comes from STDIN (standard input)
for line in sys.stdin:
# remove leading and trailing whitespace
line = line.strip()
# split the line into words
words = line.split()
# increase counters
for word in words:
# write the results to STDOUT (standard output);
# what we output here will be the input for the
# Reduce step, i.e. the input for reducer.py
#
# tab-delimited; the trivial word count is 1
print '%st%s' % (word, 1)
http://dataminingtrend.com http://facebook.com/datacube.th
Hadoop Streaming example
• reducer.py is the Python program that implements the logic in
the reduce phase of WordCount.
• It reads the results of mapper.py from stdin, sums the
occurrences of each word, and writes the result to stdout.
• reducer.py
177
#!/usr/bin/env python
from operator import itemgetter
import sys
current_word = None
current_count = 0
word = None
http://dataminingtrend.com http://facebook.com/datacube.th
Hadoop Streaming example
• reducer.py (cont’)
178
# input comes from STDIN
for line in sys.stdin:
# remove leading and trailing whitespace
line = line.strip()
# parse the input we got from mapper.py
word, count = line.split('t', 1)
# convert count (currently a string) to int
try:
count = int(count)
except ValueError:
# count was not a number, so silently
# ignore/discard this line
continue
http://dataminingtrend.com http://facebook.com/datacube.th
Hadoop Streaming example
• reducer.py (cont’)
179
# this IF-switch only works because Hadoop sorts map output
# by key (here: word) before it is passed to the reducer
if current_word == word:
current_count += count
else:
if current_word:
# write result to STDOUT
print '%st%s' % (current_word, current_count)
current_count = count
current_word = word
# do not forget to output the last word if needed!
if current_word == word:
print '%st%s' % (current_word, current_count)
http://dataminingtrend.com http://facebook.com/datacube.th
Hadoop Streaming example
• Before attempting to execute the code, ensure that the
mapper.py and reducer.py files have execution permission.
• The following command will enable this for both files:
• Also ensure that the first line of each file contains the proper
path to Python. This line enables mapper.py and reducer.py to
execute as standalone executables.
• It is highly recommended to test all programs locally before
running them across a Hadoop cluster.
180
$ chmod +x mapper.py reducer.py
$ echo ‘The fast cat wears no hat’ | ./mapper.py | sort -t 1| ./reducer.py
http://dataminingtrend.com http://facebook.com/datacube.th
Hadoop Streaming example
• Download 3 ebooks from Project Gutenberg
• The Outline of Science, Vol. 1 (of 4) by J. Arthur Thomson (659 KB)
• The Notebooks of Leonardo Da Vinci (1.4 MB)
• Ulysses by James Joyce (1.5 MB)
• Before we run the actual MapReduce job, we must first copy the
files from our local file system to Hadoop’s HDFS.
181


$ hadoop fs -put pg20417.txt books/pg20417.txt
$ hadoop fs -put 5000-8.txt books/5000-8.txt
$ hadoop fs -put 4300-0.txt books/4300-0.txt
$ hadoop fs -ls books
http://dataminingtrend.com http://facebook.com/datacube.th
Hadoop Streaming example
• The mapper and reducer programs can be run as a
MapReduce application using the Hadoop streaming utility.
• The command to run the Python programs mapper.py and
reducer.py on a Hadoop cluster is as follows:
182


$ hadoop jar /usr/lib/hadoop-0.20-mapreduce/contrib/streaming/ 

hadoop-streaming-2.0.0-mr1-cdh*.jar 
-files mapper.py, reducer.py 
-mapper mapper.py 
-reducer reducer.py 
-input /user/hduser/books/* 
-output /user/hduser/books/output
http://dataminingtrend.com http://facebook.com/datacube.th
Hadoop Streaming example
• Options for Hadoop streaming
183
Option Description
-files A command-separated list of files to be copied to the
MapReduce cluster
-mapper The command to be run as the mapper
-reducer The command to be run as the reducer
-input The DFS input path for the Map step
-output The DFS output directory for the Reduce step
http://dataminingtrend.com http://facebook.com/datacube.th
Python MapReduce library: mrjob
• mrjob is a Python MapReduce library, created by Yelp, that
wraps Hadoop streaming, allowing MapReduce applications to
be written in a more Pythonic manner.
• mrjob enables multistep MapReduce jobs to be written in pure
Python.
• MapReduce jobs written with mrjob can be tested locally, run on
a Hadoop cluster, or run in the cloud using Amazon Elastic
MapReduce (EMR).
184
http://dataminingtrend.com http://facebook.com/datacube.th
Python MapReduce library: mrjob
• Installation
• First, install python pip on CDH VM
• The installation of mrjob is simple; it can be installed with pip by
using the following command:
185
$ yum -y install python-pip
$ pip install mrjob
http://dataminingtrend.com http://facebook.com/datacube.th
mrjob example
• word_count.py
• To run the job locally and count the frequency of words within a
file named pg20417.txt, use the following command:
186
from mrjob.job import MRJob
class MRWordCount(MRJob):
def mapper(self, _, line):
for word in line.split():
yield(word, 1)
def reducer(self, word, counts):
yield(word, sum(counts))
if __name__ == '__main__':
MRWordCount.run()
$ python word_count.py books/pg20419.txt
http://dataminingtrend.com http://facebook.com/datacube.th
mrjob example
• The MapReduce job is defined as the class, MRWordCount. Within the
mrjob library, the class that inherits from MRJob contains the methods
that define the steps of the MapReduce job.
• The steps within an mrjob application are mapper, combiner, and
reducer. The class inheriting MRJob only needs to define one of these
steps.
• The mapper() method defines the mapper for the MapReduce job. It
takes key and value as arguments and yields tuples of (output_key,
output_value).
• In the WordCount example, the mapper ignored the input key and split
the input value to produce words and counts.
187
http://dataminingtrend.com http://facebook.com/datacube.th
mrjob example
• The combiner is a process that runs after the mapper and before
the reducer.
• It receives, as input, all of the data emitted by the mapper, and the
output of the combiner is sent to the reducer. The combiner yields
tuples of (output_key, output_value) as output.
• The reducer() method defines the reducer for the MapReduce job.
• It takes a key and an iterator of values as arguments and yields
tuples of (output_key, output_value).
• In example, the reducer sums the value for each key, which
represents the frequency of words in the input.
188
http://dataminingtrend.com http://facebook.com/datacube.th
mrjob example
• The final component of a MapReduce job written with the mrjob
library is the two lines at the end of the file:
if __name__ == '__main__':
MRWordCount.run()
• These lines enable the execution of mrjob; without them, the
application will not work.
• Executing a MapReduce application with mrjob is similar to
executing any other Python program. The command line must
contain the name of the mrjob application and the input file:
189
$ python mr_job.py input.txt
http://dataminingtrend.com http://facebook.com/datacube.th
mrjob example
• By default, mrjob runs locally, allowing code to be developed
and debugged before being submitted to a Hadoop cluster.
• To change how the job is run, specify the -r/--runner option.
190
$ python word_count.py -r hadoop hdfs:books/pg20419.txt
http://dataminingtrend.com http://facebook.com/datacube.th
Outline
• Part 1: Introduction to Big Data
• Part 2: Introduction to NoSQL
• Part 3: Introduction to MapReduce and Hadoop
• Part 4: Introduction to Hive, HBase and Sqoop
191
http://dataminingtrend.com http://facebook.com/datacube.th
Introduction
• The Hadoop ecosystem emerged as a cost effective way of working
with large datasets
• It imposes a particular programming model, called MapReduce, for
breaking up computation tasks into units that can be distributed around
a cluster of commodity
• Underneath this computation model is a distributed file system called
Hadoop Distributed Filesystem (HDFS)
• However, a challenge remains; how do you move an existing data
infrastructure to Hadoop, when that infrastructure is based on traditional
relational databases and the Structured Query Language (SQL)?
192
http://dataminingtrend.com http://facebook.com/datacube.th
Introduction
• This is where Hive comes in. Hive provides an SQL dialect, called
Hive Query Language (abbreviated HiveQL or just HQL) for querying
data stored in a Hadoop cluster.
• SQL knowledge is widespread for a reason; it’s an effective,
reasonably intuitive model for organizing and using data.
• Mapping these familiar data operations to the low-level MapReduce
Java API can be daunting, even for experienced Java developers.
• Hive does this dirty work for you, so you can focus on the query itself.
Hive translates most queries to MapReduce jobs, thereby exploiting
the scalability of Hadoop, while presenting a familiar SQL abstraction.
193
http://dataminingtrend.com http://facebook.com/datacube.th
Introduction
• Hive is most suited for data warehouse applications, where relatively
static data is analyzed, fast response times are not required, and when
the data is not changing rapidly.
• Apache Hive is a “data warehousing” framework built on top of
Hadoop.
• Hive provides data analysts with a familiar SQL-based interface to
Hadoop, which allows them to attach structured schemas to data in
HDFS and access and analyze that data using SQL queries.
• Hive has made it possible for developers who are fluent in SQL to
leverage the scalability and resilience of Hadoop without requiring them
to learn Java or the native MapReduce API.
194
http://dataminingtrend.com http://facebook.com/datacube.th
Hive in the Hadoop Ecosystem
• Hive modules
195
Image source: “Programming Hive: Data Warehouse and Query Language for Hadoop”, Edward Capriolo, Dean Wampler
and Jason Rutherglen, 2012
http://dataminingtrend.com http://facebook.com/datacube.th
Hive in the Hadoop Ecosystem
• There are several ways to interact with Hive
• CLI: command-line interface
• GUI: Graphic User Interface
• Karmasphere (http://karmasphere.com)
• Cloudera’s open source Hue (https://github.com/cloudera/hue)
• All commands and queries go to the Driver, which compiles the
input, optimizes the computation required, and executes the
required steps, usually with MapReduce jobs.
196
http://dataminingtrend.com http://facebook.com/datacube.th
Hive in the Hadoop Ecosystem
• Hive communicates with the JobTracker to initiate the MapReduce job.
• Hive does not have to be running on the same master node with the
JobTracker. In larger clusters, it’s common to have edge nodes where
tools like Hive run.
• They communicate remotely with the JobTracker on the master node
to execute jobs. Usually, the data files to be processed are in HDFS,
which is managed by the NameNode.
• The Metastore is a separate relational database (usually a MySQL
instance) where Hive persists table schemas and other system
metadata.
197
http://dataminingtrend.com http://facebook.com/datacube.th
Structured Data Queries with Hive
• Hive provides its own dialect of SQL called the Hive Query Language,
or HQL.
• HQL supports many commonly used SQL statements, including data
definition statements (DDLs) (e.g., CREATE DATABASE/ SCHEMA/ TABLE),
data manipulation statements (DMSs) (e.g., INSERT, UPDATE, LOAD),
and data retrieval queries (e.g., SELECT).
• Hive commands and HQL queries are compiled into an execution plan
or a series of HDFS operations and/ or MapReduce jobs, which are
then executed on a Hadoop cluster.
198
http://dataminingtrend.com http://facebook.com/datacube.th
Structured Data Queries with Hive
• Additionally, Hive queries entail higher-latency due to the overhead
required to generate and launch the compiled MapReduce jobs on the
cluster; even small queries that would complete within a few seconds
on a traditional RDBMS may take several minutes to finish in Hive.
• On the plus side, Hive provides the high-scalability and high-
throughput that you would expect from any Hadoop-based
application.
• It is very well suited to batch-level workloads for online analytical
processing (OLAP) of very large datasets at the terabyte and petabyte
scale.
199
http://dataminingtrend.com http://facebook.com/datacube.th
The Hive Command-Line Interface (CLI)
• Hive’s installation comes packaged with a handy command-line
interface (CLI), which we will use to interact with Hive and run
our HQL statements.
• This will initiate the CLI and bootstrap the logger (if configured)
and Hive history file, and finally display a Hive CLI prompt:
• You can view the full list of Hive options for the CLI by using the
-H flag:
200
$ hive
hive>
$ hive -H
http://dataminingtrend.com http://facebook.com/datacube.th
HUE: Apache Hadoop UI
• HUE (Hadoop User Experience) is a Web interface for analyzing
data with Apache Hadoop.
• Go to quick start.cloudera:8888/about
• username: cloudera
• password: cloudera
201
http://dataminingtrend.com http://facebook.com/datacube.th
Query Editors
• Click Query Editors then Hive
202
http://dataminingtrend.com http://facebook.com/datacube.th
Example: web logs database
• Choose default database
• HQL: SELECT * FROM web_logs
203
http://dataminingtrend.com http://facebook.com/datacube.th
Example: web logs database
• HQL: SELECT web_logs.country_name, count(1) AS count

FROM web_logs 

GROUP BY country_name
204
http://dataminingtrend.com http://facebook.com/datacube.th
Creating a database
• Creating a database in Hive is very similar to creating a
database in a SQL-based RDBMS, by using the CREATE
DATABASE or CREATE SCHEMA statement:
• When Hive creates a new database, the schema definition data
is stored in the Hive metastore.
• Hive will raise an error if the database already exists in the
metastore; we can check for the existence of the database by
using IF NOT EXISTS:
• HQL: CREATE DATABASE IF NOT EXISTS flight_data;
205
http://dataminingtrend.com http://facebook.com/datacube.th
Creating a database
• We can then run SHOW DATABASES to verify that our database has
been created. Hive will return all databases found in the
metastore, along with the default Hive database:
• HQL: SHOW DATABASES;
206
http://dataminingtrend.com http://facebook.com/datacube.th
Creating tables
• Hive provides a SQL-like CREATE TABLE statement, which in its
simplest form takes a table name and column definitions:
• HQL: CREATE TABLE airlines (code INT, 

description STRING) 

ROW FORMAT DELIMITED FIELDS TERMINATED BY 't' 

STORED AS TEXTFILE;
• However, because Hive data is stored in the file system, usually
in HDFS or the local file system
• the CREATE TABLE command also takes optional clauses to
specify the row format with the ROW FORMAT clause that tells
Hive how to read each row in the file and map to our columns.
207
http://dataminingtrend.com http://facebook.com/datacube.th
Loading data
• It’s important to note one important distinction between Hive and
traditional RDBMSs with regards to schema enforcement:
• Traditional relational databases enforce the schema on writes
by rejecting any data that does not conform to the schema as
defined;
• Hive can only enforce queries on schema reads. If in reading
the data file, the file structure does not match the defined
schema, Hive will generally return null values for missing fields
or type mismatches
208
http://dataminingtrend.com http://facebook.com/datacube.th
Loading data
• Data loading in Hive is done in batch-oriented fashion using a bulk LOAD
DATA command or by inserting results from another query with the
INSERT command.
• LOAD DATA is Hive’s bulk loading command. INPATH takes an argument
to a path on the default file system (in this case, HDFS).
• We can also specify a path on the local file system by using LOCAL
INPATH instead. Hive proceeds to move the file into the warehouse
location.
• If the OVERWRITE keyword is used, then any existing data in the target
table will be deleted and replaced by the data file input; otherwise, the
new data is added to the table.
209
http://dataminingtrend.com http://facebook.com/datacube.th
Loading data
• Examples
• HQL: LOAD DATA LOCAL INPATH '/home/cloudera/

Downloads/flight_data/ontime_flights.tsv' 

OVERWRITE INTO TABLE flights;
• HQL: LOAD DATA LOCAL INPATH '/home/cloudera/

Downloads/flight_data/airlines.tsv' 

OVERWRITE INTO TABLE airlines;
• HQL: LOAD DATA LOCAL INPATH '/home/cloudera/

Downloads/flight_data/carriers.tsv' 

OVERWRITE INTO TABLE carriers;
• HQL: LOAD DATA LOCAL INPATH '/home/cloudera/

Downloads/flight_data/cancellation_reasons.tsv' 

OVERWRITE INTO TABLE cancellation_reasons;
210
http://dataminingtrend.com http://facebook.com/datacube.th
Data Analysis with Hive
• Grouping
• HQL: SELECT airline_code, COUNT(1) AS num_flights 

FROM flights 

GROUP BY airline_code 

ORDER BY num_flights DESC;
211
http://dataminingtrend.com http://facebook.com/datacube.th
Data Analysis with Hive
• Aggregations
• HQL: 

SELECT airline_code,
COUNT(1) AS num_flights,
SUM(IF(depart_delay > 0, 1, 0)) AS
num_depart_delays,
SUM(IF(arrive_delay > 0, 1, 0)) AS
num_arrive_delays,
SUM(IF(is_cancelled, 1, 0)) AS num_cancelled,
FROM flights
GROUP BY airline_code;
212
http://dataminingtrend.com http://facebook.com/datacube.th
Data Analysis with Hive
• Aggregations
• HQL: 

SELECT airline_code,
COUNT(1) AS num_flights,
SUM(IF(depart_delay > 0, 1, 0)) AS num_depart_delays,
ROUND(SUM(IF(depart_delay > 0, 1, 0))/COUNT(1), 2) 

AS depart_delay_rate,
SUM(IF(arrive_delay > 0, 1, 0)) AS num_arrive_delays,
ROUND(SUM(IF(arrive_delay > 0, 1, 0))/COUNT(1), 2) 

AS arrive_delay_rate,
SUM(IF(is_cancelled, 1, 0)) AS num_cancelled,
ROUND(SUM(IF(is_cancelled, 1, 0))/COUNT(1), 2) 

AS cancellation_rate
FROM flights
GROUP BY airline_code
ORDER by cancellation_rate DESC, arrive_delay_rate DESC, 

depart_delay_rate DESC;
213
http://dataminingtrend.com http://facebook.com/datacube.th
Introduction to HBase
• While Hive provides a familiar data manipulation paradigm within
Hadoop, it doesn’t change the storage and processing paradigm,
which still utilizes HDFS and MapReduce in a batch-oriented fashion.
• Thus, for use cases that require random, real-time read/ write access
to data, we need to look outside of standard MapReduce and Hive for
our data persistence and processing layer.
• The real-time applications need to record high volumes of time-based
events that tend to have many possible structural variations.
• The data may be keyed on a certain value, like User, but the value is
often represented as a collection of arbitrary metadata.
214
http://dataminingtrend.com http://facebook.com/datacube.th
Introduction to HBase
• For example, two events, “Like” and “Share”, which require different column
values, as shown in table.
• In a relational model, rows are sparse but columns are not. That is, upon
inserting a new row to a table, the database allocates storage for every column
regardless of whether a value exists for that field or not.
• However, in applications where data is represented as a collection of arbitrary
fields or sparse columns, each row may use only a subset of available columns,
which can make a standard relational schema both a wasteful and awkward fit.
215
http://dataminingtrend.com http://facebook.com/datacube.th
Column-Oriented Databases
• NoSQL is a broad term that generally refers to non-relational
databases and encompasses a wide collection of data storage
models, including
• graph databases
• document databases
• key/ value data stores
• column-family databases.
• HBase is classified as a column-family or column-oriented database,
modelled on Google’s Big Table architecture.
216
http://dataminingtrend.com http://facebook.com/datacube.th
Column-Oriented Databases
• HBase organizes data into tables that contain rows. Within a
table, rows are identified by their unique row key, which do not
have a data type.
• Row key are similar to the concept of primary keys in relational
databases, in that they are automatically indexed.
217
http://dataminingtrend.com http://facebook.com/datacube.th
Column-Oriented Databases
• In HBase, table rows are sorted by their row key and because
row keys are byte arrays, almost anything can serve as a row
key from strings to binary representations of longs or even
serialized data structures.
• HBase stores its data key/value pairs, where all table lookups
are performed via the table’s row key, or unique identifier to the
stored record data.
• Data within a row is grouped into column families, which consist
of related columns.
218
http://dataminingtrend.com http://facebook.com/datacube.th
Column-Oriented Databases
• Census data as an HBase schema
219
http://dataminingtrend.com http://facebook.com/datacube.th
Column-Oriented Databases
• Storing data in columns rather than rows has particular benefits for
data warehouses and analytical databases where aggregates are
computed over large sets of data with potentially sparse values, where
not all columns values are present.
• Another interesting feature of HBase and BigTable-based column-
oriented databases is that the table cells, or the intersection of row and
column coordinates, are versioned by timestamp.
• HBase is thus also described as being a multidimensional map where
time provides the third dimension
220
http://dataminingtrend.com http://facebook.com/datacube.th
Column-Oriented Databases
• The time dimension is indexed in decreasing order, so that
when reading from an HBase store, the most recent values are
found first.
• The contents of a cell can be 

referenced by a 

{rowkey, column, timestamp} 

tuple, or we can scan for a 

range of cell values by time 

range.
221
http://dataminingtrend.com http://facebook.com/datacube.th
Real-Time Analytics with HBase
• For the purposes of this HBase overview, we define and work with the
HBase shell to design a schema for a linkshare tracker that tracks the
number of times a link has been shared.
• Generating a schema
• When designing schemas in HBase, it’s important to think in terms
of the column-family structure of the data model and how it affects
data access patterns.
• Furthermore, because HBase doesn’t support joins and provides
only a single indexed rowkey, we must be careful to ensure that the
schema can fully support all use cases.
222
http://dataminingtrend.com http://facebook.com/datacube.th
Real-Time Analytics with HBase
• First, we need to declare the table name, and at least one
column-family name at the time of table definition.
• If no namespace is declared, HBase will use the default
namespace
• We just created a single table called linkshare in the default
namespace with one column-family, named link
• To alter the table after creation, such as changing or adding column
families, we need to first disable the table so that clients will not be able
to access the table during the alter operation:
223
hbase> create ‘linkshare’,’link’
http://dataminingtrend.com http://facebook.com/datacube.th
Real-Time Analytics with HBase
• Good row key design affects not only how we query the table, but the
performance and complexity of data access.
• By default, HBase stores rows in sorted order by row key, so that
similar keys are stored to the same RegionServer.
• Thus, in addition to enabling our data access use cases, we also need
to be mindful to account for row key distribution across regions.
• For the current example, let’s assume that we will use the unique
reversed link URL for the row key.
224
hbase> disable ‘linkshare’
hbase> alter ‘linkshare’, ‘statistics’
hbase> enable ‘linkshare’
http://dataminingtrend.com http://facebook.com/datacube.th
Real-Time Analytics with HBase
• In our linkshare application, we want to store descriptive data about
the link, such as its title, while maintaining a frequency counter that
tracks the number of times the link has been shared.
• We can insert, or put, a value in a cell at the specified table/ row/
column and optionally timestamp coordinates.
• To put a cell value into table linkshare at row with row key
org.hbase.www under column-family link and column title marked with
the current timestamp
225
hbase> put 'linkshare', 'org.hbase.www', 'link:title', 'Apache HBase'
hbase> put 'linkshare', 'org.hadoop.www', 'link:title', 'Apache Hadoop'
hbase> put 'linkshare', 'com.oreilly.www', 'link:title', ‘O’Reilly.com’
http://dataminingtrend.com http://facebook.com/datacube.th
Real-Time Analytics with HBase
• The put operation works great for inserting a value for a single cell, but for
incrementing frequency counters, HBase provides a special mechanism
to treat columns as counters.
• To increment a counter, we use the command incr instead of put.
• The last option passed is the increment value, which in this case is 1.
• Incrementing a counter will return the updated counter value, but you can
also access a counter’s current value any time using the get_counter
command, specifying the table name, row key, and column:
226
hbase> incr ‘linkshare’, ‘org.hbase.www’, ‘statistics:share’, 1
hbase> incr ‘linkshare’, ‘org.hbase.www’, ‘statistics:like’, 1
http://dataminingtrend.com http://facebook.com/datacube.th
Real-Time Analytics with HBase
• HBase provides two general methods to retrieve data from a table:
• the get command performs lookups by row key to retrieve attributes
for a specific row,
• and the scan command, which takes a set of filter specifications and
iterates over multiple rows based on the indicated specifications.
• In its simplest form, the get command accepts the table name
followed by the row key, and returns the most recent version timestamp
and cell value for columns in the row.
227
hbase> incr ‘linkshare’, ‘org.hbase.www’, ‘statistics:share’, 1
hbase> get_counter ‘linkshare’, ‘org.hbase.www’, 

‘statistics:share’
hbase> get ‘linkshare’, ‘org.hbase.www’
http://dataminingtrend.com http://facebook.com/datacube.th
Real-Time Analytics with HBase
• The get command also accepts an optional dictionary of parameters to
specify the column( s), timestamp, timerange, and version of the cell
values we want to retrieve. For example, we can specify the column( s) of
interest
• A scan operation is akin to database cursors or iterators, and takes
advantage of the underlying sequentially sorted storage mechanism,
iterating through row data to match against the scanner specifications.
• With scan, we can scan an entire HBase table or specify a range of rows
to scan.
228
hbase> get ‘linkshare’, ‘org.hbase.www’, ‘link:title’
hbase> get ‘linkshare’, ‘org.hbase.www’, ‘link:title’, 

‘statistics:share’
http://dataminingtrend.com http://facebook.com/datacube.th
Real-Time Analytics with HBase
• You can specify an optional STARTROW and/ or STOPROW
parameter, which can be used to limit the scan to a specific
range of rows.
• If neither STARTROW nor STOPROW are provided, the scan
operation will scan through the entire table.
• You can, in fact, call scan with the table name to display all the
contents of a table.
229
hbase> scan ‘linkshare’
hbase> scan 'linkshare', {COLUMNS = > [' link:title'], 

STARTROW = > 'org.hbase.www'}
http://dataminingtrend.com http://facebook.com/datacube.th
Introduction to Sqoop
• However, in cases where the input data is already structured because
it resides in a relational database, it would be convenient to leverage
this known schema to import the data into Hadoop in a more efficient
manner than uploading CSVs to HDFS and parsing them manually.
• Sqoop (SQL-to-Hadoop) is designed to transfer data between
relational database management systems (RDBMS) and Hadoop.
• It automates most of the data transfer process by reading the schema
information directly from the RDBMS.
• Sqoop then uses MapReduce to import and export the data to and
from Hadoop.
230
http://dataminingtrend.com http://facebook.com/datacube.th
Introduction to Sqoop
• Sqoop gives us the flexibility to maintain our data in its production
state while copying it into Hadoop to make it available for further
analysis without modifying the production database.
• We’ll walk through a few ways to use Sqoop to import data from a
MySQL database into various Hadoop data stores, including HDFS,
Hive, and HBase.
• We will use MySQL as the source and target RDBMS for the examples
in this chapter, so we also assume that a MySQL database resides on
the same host as your Hadoop/ Sqoop services and is accessible via
localhost and the default port, 3306.
231
http://dataminingtrend.com http://facebook.com/datacube.th
Importing from MySQL to HDFS
• When importing data from relational databases like MySQL, Sqoop
reads the source database to gather the necessary metadata for the
data being imported.
• Sqoop then submits a map-only Hadoop job to transfer the actual table
data based on the metadata that was captured in the previous step.
• This job produces a set of serialized files, which may be delimited text
files, binary format, or SequenceFiles containing a copy of the imported
table or datasets.
• By default, the files are saved as comma-separated files to a directory
on HDFS with a name that corresponds to the source table name.
232
Introduction to Big Data Technologies
Introduction to Big Data Technologies
Introduction to Big Data Technologies
Introduction to Big Data Technologies
Introduction to Big Data Technologies
Introduction to Big Data Technologies
Introduction to Big Data Technologies
Introduction to Big Data Technologies
Introduction to Big Data Technologies
Introduction to Big Data Technologies
Introduction to Big Data Technologies

Weitere ähnliche Inhalte

Was ist angesagt?

Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
Simplilearn
 

Was ist angesagt? (20)

Data science presentation 2nd CI day
Data science presentation 2nd CI dayData science presentation 2nd CI day
Data science presentation 2nd CI day
 
Big Data Science: Intro and Benefits
Big Data Science: Intro and BenefitsBig Data Science: Intro and Benefits
Big Data Science: Intro and Benefits
 
Big Data Analytics - Introduction
Big Data Analytics - IntroductionBig Data Analytics - Introduction
Big Data Analytics - Introduction
 
Introduction to Python for Data Science
Introduction to Python for Data ScienceIntroduction to Python for Data Science
Introduction to Python for Data Science
 
Analytics and Data Mining Industry Overview
Analytics and Data Mining Industry OverviewAnalytics and Data Mining Industry Overview
Analytics and Data Mining Industry Overview
 
Data Science: Past, Present, and Future
Data Science: Past, Present, and FutureData Science: Past, Present, and Future
Data Science: Past, Present, and Future
 
Data Science: Not Just For Big Data
Data Science: Not Just For Big DataData Science: Not Just For Big Data
Data Science: Not Just For Big Data
 
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
 
Life of a data scientist (pub)
Life of a data scientist (pub)Life of a data scientist (pub)
Life of a data scientist (pub)
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Intro to Python for Data Science
Intro to Python for Data ScienceIntro to Python for Data Science
Intro to Python for Data Science
 
Training in Analytics and Data Science
Training in Analytics and Data ScienceTraining in Analytics and Data Science
Training in Analytics and Data Science
 
Introduction To Data Science
Introduction To Data ScienceIntroduction To Data Science
Introduction To Data Science
 
Rating Prediction using Deep Learning and Spark
Rating Prediction using Deep Learning and SparkRating Prediction using Deep Learning and Spark
Rating Prediction using Deep Learning and Spark
 
The Importance of Open Innovation in AI era
The Importance of Open Innovation in AI eraThe Importance of Open Innovation in AI era
The Importance of Open Innovation in AI era
 
A Hacking Toolset for Big Tabular Files (3)
A Hacking Toolset for Big Tabular Files (3)A Hacking Toolset for Big Tabular Files (3)
A Hacking Toolset for Big Tabular Files (3)
 
Massive scale analytics with Stratosphere using R
Massive scale analytics with Stratosphere using RMassive scale analytics with Stratosphere using R
Massive scale analytics with Stratosphere using R
 
Myths of Data Science
Myths of Data ScienceMyths of Data Science
Myths of Data Science
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 
Predictive Analysis of Financial Fraud Detection using Azure and Spark ML
Predictive Analysis of Financial Fraud Detection using Azure and Spark MLPredictive Analysis of Financial Fraud Detection using Azure and Spark ML
Predictive Analysis of Financial Fraud Detection using Azure and Spark ML
 

Ähnlich wie Introduction to Big Data Technologies

Séminaire Big Data Alter Way - Elasticsearch - octobre 2014
Séminaire Big Data Alter Way - Elasticsearch - octobre 2014Séminaire Big Data Alter Way - Elasticsearch - octobre 2014
Séminaire Big Data Alter Way - Elasticsearch - octobre 2014
ALTER WAY
 
Semantic Web: In Quest for the Next Generation Killer Apps
Semantic Web: In Quest for the Next Generation Killer AppsSemantic Web: In Quest for the Next Generation Killer Apps
Semantic Web: In Quest for the Next Generation Killer Apps
Jie Bao
 
Data Strategy Best Practices
Data Strategy Best PracticesData Strategy Best Practices
Data Strategy Best Practices
DATAVERSITY
 

Ähnlich wie Introduction to Big Data Technologies (20)

Intro to Data Science by DatalentTeam at Data Science Clinic#11
Intro to Data Science by DatalentTeam at Data Science Clinic#11Intro to Data Science by DatalentTeam at Data Science Clinic#11
Intro to Data Science by DatalentTeam at Data Science Clinic#11
 
Web-based Information Visualisation
Web-based Information VisualisationWeb-based Information Visualisation
Web-based Information Visualisation
 
Big Data Analysis : Deciphering the haystack
Big Data Analysis : Deciphering the haystack Big Data Analysis : Deciphering the haystack
Big Data Analysis : Deciphering the haystack
 
Concepts, use cases and principles to build big data systems (1)
Concepts, use cases and principles to build big data systems (1)Concepts, use cases and principles to build big data systems (1)
Concepts, use cases and principles to build big data systems (1)
 
Fixing data science & Accelerating Artificial Super Intelligence Development
 Fixing data science & Accelerating Artificial Super Intelligence Development Fixing data science & Accelerating Artificial Super Intelligence Development
Fixing data science & Accelerating Artificial Super Intelligence Development
 
2_Image Classification.pdf
2_Image Classification.pdf2_Image Classification.pdf
2_Image Classification.pdf
 
Technologie Proche: Imagining the Archival Systems of Tomorrow With the Tools...
Technologie Proche: Imagining the Archival Systems of Tomorrow With the Tools...Technologie Proche: Imagining the Archival Systems of Tomorrow With the Tools...
Technologie Proche: Imagining the Archival Systems of Tomorrow With the Tools...
 
Data analytics & its Trends
Data analytics & its TrendsData analytics & its Trends
Data analytics & its Trends
 
Big Data Management: What's New, What's Different, and What You Need To Know
Big Data Management: What's New, What's Different, and What You Need To KnowBig Data Management: What's New, What's Different, and What You Need To Know
Big Data Management: What's New, What's Different, and What You Need To Know
 
NASA and PHP
NASA and PHPNASA and PHP
NASA and PHP
 
Séminaire Big Data Alter Way - Elasticsearch - octobre 2014
Séminaire Big Data Alter Way - Elasticsearch - octobre 2014Séminaire Big Data Alter Way - Elasticsearch - octobre 2014
Séminaire Big Data Alter Way - Elasticsearch - octobre 2014
 
Big Data is changing abruptly, and where it is likely heading
Big Data is changing abruptly, and where it is likely headingBig Data is changing abruptly, and where it is likely heading
Big Data is changing abruptly, and where it is likely heading
 
Semantic Web: In Quest for the Next Generation Killer Apps
Semantic Web: In Quest for the Next Generation Killer AppsSemantic Web: In Quest for the Next Generation Killer Apps
Semantic Web: In Quest for the Next Generation Killer Apps
 
Hadoop : The Pile of Big Data
Hadoop : The Pile of Big DataHadoop : The Pile of Big Data
Hadoop : The Pile of Big Data
 
Hadoop Master Class : A concise overview
Hadoop Master Class : A concise overviewHadoop Master Class : A concise overview
Hadoop Master Class : A concise overview
 
Introduction Data Warehouse With BigQuery
Introduction Data Warehouse With BigQueryIntroduction Data Warehouse With BigQuery
Introduction Data Warehouse With BigQuery
 
Data Strategy Best Practices
Data Strategy Best PracticesData Strategy Best Practices
Data Strategy Best Practices
 
TSE_Pres12.pptx
TSE_Pres12.pptxTSE_Pres12.pptx
TSE_Pres12.pptx
 
Big Data
Big DataBig Data
Big Data
 
Large scale computing
Large scale computing Large scale computing
Large scale computing
 

Mehr von Big Data Engineering, Faculty of Engineering, Dhurakij Pundit University

Introduction to Data Analytics with RapidMiner Studio 6 (ภาษาไทย)
Introduction to Data Analytics with RapidMiner Studio 6 (ภาษาไทย)Introduction to Data Analytics with RapidMiner Studio 6 (ภาษาไทย)
Introduction to Data Analytics with RapidMiner Studio 6 (ภาษาไทย)
Big Data Engineering, Faculty of Engineering, Dhurakij Pundit University
 

Mehr von Big Data Engineering, Faculty of Engineering, Dhurakij Pundit University (20)

Practical Data Science 
Use-cases in Retail & eCommerce
Practical Data Science 
Use-cases in Retail & eCommercePractical Data Science 
Use-cases in Retail & eCommerce
Practical Data Science 
Use-cases in Retail & eCommerce
 
First Step to Big Data
First Step to Big DataFirst Step to Big Data
First Step to Big Data
 
Introduction to Data Mining and Big Data Analytics
Introduction to Data Mining and Big Data AnalyticsIntroduction to Data Mining and Big Data Analytics
Introduction to Data Mining and Big Data Analytics
 
Apply (Big) Data Analytics & Predictive Analytics to Business Application
Apply (Big) Data Analytics & Predictive Analytics to Business ApplicationApply (Big) Data Analytics & Predictive Analytics to Business Application
Apply (Big) Data Analytics & Predictive Analytics to Business Application
 
Introduction to Predictive Analytics with case studies
Introduction to Predictive Analytics with case studiesIntroduction to Predictive Analytics with case studies
Introduction to Predictive Analytics with case studies
 
Introduction to Data Mining and Big Data Analytics
Introduction to Data Mining and Big Data AnalyticsIntroduction to Data Mining and Big Data Analytics
Introduction to Data Mining and Big Data Analytics
 
Advanced Predictive Modeling with R and RapidMiner Studio 7
Advanced Predictive Modeling with R and RapidMiner Studio 7Advanced Predictive Modeling with R and RapidMiner Studio 7
Advanced Predictive Modeling with R and RapidMiner Studio 7
 
Predictive analytic-for-retail-business
Predictive analytic-for-retail-businessPredictive analytic-for-retail-business
Predictive analytic-for-retail-business
 
Building Decision Tree model with numerical attributes
Building Decision Tree model with numerical attributesBuilding Decision Tree model with numerical attributes
Building Decision Tree model with numerical attributes
 
Data manipulation with RapidMiner Studio 7
Data manipulation with RapidMiner Studio 7Data manipulation with RapidMiner Studio 7
Data manipulation with RapidMiner Studio 7
 
Preprocessing with RapidMiner Studio 6
Preprocessing with RapidMiner Studio 6Preprocessing with RapidMiner Studio 6
Preprocessing with RapidMiner Studio 6
 
Evaluation metrics: Precision, Recall, F-Measure, ROC
Evaluation metrics: Precision, Recall, F-Measure, ROCEvaluation metrics: Precision, Recall, F-Measure, ROC
Evaluation metrics: Precision, Recall, F-Measure, ROC
 
Introduction to Text Classification with RapidMiner Studio 7
Introduction to Text Classification with RapidMiner Studio 7Introduction to Text Classification with RapidMiner Studio 7
Introduction to Text Classification with RapidMiner Studio 7
 
Introduction to Feature (Attribute) Selection with RapidMiner Studio 6
Introduction to Feature (Attribute) Selection with RapidMiner Studio 6Introduction to Feature (Attribute) Selection with RapidMiner Studio 6
Introduction to Feature (Attribute) Selection with RapidMiner Studio 6
 
Search Twitter with RapidMiner Studio 6
Search Twitter with RapidMiner Studio 6Search Twitter with RapidMiner Studio 6
Search Twitter with RapidMiner Studio 6
 
Data mining and_big_data_web
Data mining and_big_data_webData mining and_big_data_web
Data mining and_big_data_web
 
Introduction to Data Analytics with RapidMiner Studio 6 (ภาษาไทย)
Introduction to Data Analytics with RapidMiner Studio 6 (ภาษาไทย)Introduction to Data Analytics with RapidMiner Studio 6 (ภาษาไทย)
Introduction to Data Analytics with RapidMiner Studio 6 (ภาษาไทย)
 
Practical Data Mining with RapidMiner Studio 7 : A Basic and Intermediate
Practical Data Mining with RapidMiner Studio 7 : A Basic and IntermediatePractical Data Mining with RapidMiner Studio 7 : A Basic and Intermediate
Practical Data Mining with RapidMiner Studio 7 : A Basic and Intermediate
 
Practical Data Mining: FP-Growth
Practical Data Mining: FP-GrowthPractical Data Mining: FP-Growth
Practical Data Mining: FP-Growth
 
Install weka extension_rapidminer
Install weka extension_rapidminerInstall weka extension_rapidminer
Install weka extension_rapidminer
 

Kürzlich hochgeladen

Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...
shambhavirathore45
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
shivangimorya083
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
AroojKhan71
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
amitlee9823
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
MarinCaroMartnezBerg
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
shivangimorya083
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
JohnnyPlasten
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
amitlee9823
 

Kürzlich hochgeladen (20)

Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 

Introduction to Big Data Technologies

  • 1. Introduction to 
 Big Data Technologies Eakasit Pacharawongsakda, Ph.D. eakasit@datacubeth.ai Data Cube / Quandatics
  • 2. http://dataminingtrend.com http://facebook.com/datacube.th Outline • Part 1: Introduction to Big Data • Part 2: Introduction to NoSQL • Part 3: Introduction to MapReduce and Hadoop • Part 4: Introduction to Hive, HBase and Sqoop 2
  • 6. เวลา 08:00 น. เจ้านายโทรศัพท์เข้ามาถามงาน source: https://d1ai9qtk9p41kl.cloudfront.net/assets/mc/psuderman/2011_07/text-drive.png
  • 7. เวลา 08:05 น. ขับรถไปชนกับคันอื่น
  • 8. เวลา 10:00 น. ถึงที่ทำงานและทำงานต่อไป source: http://stuffpoint.com/anime-and-manga/image/285181-anime-and-manga-girl-working-in-the-computer.jpg
  • 9. เวลา 18:00 น. แวะซื้อของกลับบ้าน
  • 10. เวลา 20:00 น. กลับถึงบ้านและอยู่คนเดียว
  • 20. http://dataminingtrend.com http://facebook.com/datacube.th Big Data & Analytics • Big Bang 20 source:http://www.thetechy.com/science/exploring-universe-curiosity
  • 21. http://dataminingtrend.com http://facebook.com/datacube.th Big Data & Analytics • Big Architecture (Great wall of China) 21 source: http://www.history.com/topics/great-wall-of-china
  • 22. http://dataminingtrend.com http://facebook.com/datacube.th Big Data & Analytics • Big Data 22source: http://www.plmjim.com/?p=583
  • 23. http://dataminingtrend.com http://facebook.com/datacube.th Data Evolutions 23 source:Data Science and Big Data Analytics: Discovering, analyzing, visualizing and presenting data
  • 24. http://dataminingtrend.com http://facebook.com/datacube.th What is Big Data? 24 source: https://www.youtube.com/watch?v=TzxmjbL-i4Y
  • 25. http://dataminingtrend.com http://facebook.com/datacube.th What is Big Data? 25 source: http://www.intel.com/content/www/us/en/big-data/big-data-101-animation.html#
  • 26. http://dataminingtrend.com http://facebook.com/datacube.th What is Big Data? • Big Data ประกอบด้วย 3 V • Volume • ข้อมูลมีจำนวนเพิ่มขึ้นอย่างมหาศาล • Velocity • ข้อมูลเพิ่มขึ้นอย่างรวดเร็ว • Variety • ข้อมูลมีความหลากหลายมากขึ้น 26 source: https://upxacademy.com/beginners-guide-to-big-data/
  • 27. http://dataminingtrend.com http://facebook.com/datacube.th What is Big Data? • Huge volume of data • ข้อมูลมีขนาดใหญ่มากๆ เช่น มีจำนวนเป็นพันล้านแถว (billion row) หรือ เป็นล้านคอลัมน์ (million columns) 27
  • 28. http://dataminingtrend.com http://facebook.com/datacube.th Big Data: Volume 28 source:https://datafloq.com/read/infographic/226
  • 30. http://dataminingtrend.com http://facebook.com/datacube.th What is Big Data? • Huge volume of data • ข้อมูลมีขนาดใหญ่มากๆ เช่น มีจำนวนเป็นพันล้านแถว (billion row) หรือ เป็นล้านคอลัมน์ (million columns) • Speed of new data creation and growth • ข้อมูลเกิดขึ้นอย่างรวดเร็วมากๆ 30
  • 31. http://dataminingtrend.com http://facebook.com/datacube.th Big Data: Velocity 31 source: https://upxacademy.com/beginners-guide-to-big-data/
  • 32. http://dataminingtrend.com http://facebook.com/datacube.th What is Big Data? • Huge volume of data • ข้อมูลมีขนาดใหญ่มากๆ เช่น มีจำนวนเป็นพันล้านแถว (billion row) หรือ เป็นล้านคอลัมน์ (million columns) • Speed of new data creation and growth • ข้อมูลเกิดขึ้นอย่างรวดเร็วมากๆ • Complexity of data types and structures • ข้อมูลมีความหลากหลาย ไม่ได้อยู่ในรูปแบบของตารางเท่านั้น อาจจะเป็น รูปแบบของข้อความ (text) รูปภาพ (images) หรือ วิดีโอ (video clip) 32
  • 33. http://dataminingtrend.com http://facebook.com/datacube.th Big Data: Variety 33 source: https://upxacademy.com/beginners-guide-to-big-data/
  • 34. http://dataminingtrend.com http://facebook.com/datacube.th Big Data: Variety 34 source: https://upxacademy.com/beginners-guide-to-big-data/
  • 35. http://dataminingtrend.com http://facebook.com/datacube.th What is Big Data? 35 source: http://dataconomy.com/2014/08/infographic-how-to-explain-big-data-to-your-grandmother/
  • 36. http://dataminingtrend.com http://facebook.com/datacube.th Internet of Things 36source: http://www.postscapes.com/what-exactly-is-the-internet-of-things-infographic/
  • 39. http://dataminingtrend.com http://facebook.com/datacube.th IoT applications • Disney’s Magic Band 39 source:https://disneyworld.disney.go.com/plan/my-disney-experience/bands-cards/#?CMP=SEC-WDWShareEmailNGE-MDX-MagicBand-video&video=0/0/0/0
  • 40. http://dataminingtrend.com http://facebook.com/datacube.th IoT applications • GlowCaps 40 source:http://www.vitality.net/glowcaps.html
  • 41. http://dataminingtrend.com http://facebook.com/datacube.th IoT applications • Connected Toothbrush 41 source:https://www.youtube.com/watch?v=gLpUxDdh9iQ
  • 44. http://dataminingtrend.com http://facebook.com/datacube.th IoT applications • iBeacon 44 source: https://www.mallmaverick.com/system/site_images/photos/000/001/700/original/blog_ibeacon1.jpg?1391033561
  • 45. http://dataminingtrend.com http://facebook.com/datacube.th Outline • Part 1: Introduction to Big Data • Part 2: Introduction to NoSQL • Part 3: Introduction to MapReduce and Hadoop • Part 4: Introduction to Hive, HBase and Sqoop 45
  • 46. http://dataminingtrend.com http://facebook.com/datacube.th Relational database & SQL • Databases are made up of tables and each table is made up of rows and columns • SQL is a database interaction language that allows you to add, retrieve, edit and delete information stored in databases 46 ID Mark Code Title S103 72 DBS Database Systems S103 58 IAI Intro to AI S104 68 PR1 Programming 1 S104 65 IAI Intro to AI S106 43 PR2 Programming 2 S107 76 PR1 Programming 1 S107 60 PR2 Programming 2 S107 35 IAI Intro to AI
  • 47. http://dataminingtrend.com http://facebook.com/datacube.th Relational database & SQL • SQL primarily works with two types of operations to query data • Read consists of the SELECT command, which has three common clauses • SELECT • FROM • WHERE 47image source:https://justbablog.files.wordpress.com/2017/03/sql_beginners.jpg
  • 48. http://dataminingtrend.com http://facebook.com/datacube.th Relational database & SQL 48image source:https://justbablog.files.wordpress.com/2017/03/sql_beginners.jpg
  • 49. http://dataminingtrend.com http://facebook.com/datacube.th Why NoSQL? • Relational databases have been the dominate type of database used for application for decades. • With the advent of the Web, however, the limitations of relational databases became increasingly problematic. • Companies such as Google, LinkedIn, Yahoo! and Amazon found that supporting large numbers of users on the Web was different from supporting much smaller numbers of business users. 49
  • 50. http://dataminingtrend.com http://facebook.com/datacube.th Why NoSQL? 50image source:https://www.slideshare.net/up1/introduction-to-nosql-61023856?qid=8519a104-f1d8-4955-a58b-a1eb61f61a8c
  • 51. http://dataminingtrend.com http://facebook.com/datacube.th Why NoSQL? • Web application needed to support • Large volumes of read and write operations • Low latency response times • High availability • These requirement were difficult to realise using relational databases. • There are limits to how many CPUs and memory can be supported in a single server. • Another option is to use multiple servers with a relational database. • operating single RDBMS over multiple servers is a complex operation 51
  • 52. http://dataminingtrend.com http://facebook.com/datacube.th Why NoSQL? • NoSQL is “Not Only SQL” • Four characteristics of data management for large-scale data management tasks are • Scalability • Cost • Flexibility • Availability 52
  • 53. http://dataminingtrend.com http://facebook.com/datacube.th Why NoSQL?: Scalability • Scalability is the ability to efficiently meet the needs for varying workloads. • For example, if there is a spike in traffic to a website, additional servers can be brought online to handle the additional load. • When the spike subsides and traffic returns to normal, some of those additional servers can be shut down. • Adding servers as needed is called scaling out. 53
  • 54. http://dataminingtrend.com http://facebook.com/datacube.th Why NoSQL?: Scalability • Scaling Up • Scaling Out 54
  • 55. http://dataminingtrend.com http://facebook.com/datacube.th Why NoSQL?: Scalability • Scaling out is more flexible than scaling up. • Servers can be added or removed as needed when scaling up. • NoSQL are designed to utilise servers available in a cluster with minimal intervention by database administrators. 55
  • 56. http://dataminingtrend.com http://facebook.com/datacube.th Why NoSQL?: Cost • Commercial software vendors employ a variety of licensing models that include charging by • the size of the server running the RDBMS • the number of concurrent users on the database • the number of named users allowed to use the software • The major NoSQL databases are available as open source. It’s free to use on as many servers of whatever size needed 56
  • 57. http://dataminingtrend.com http://facebook.com/datacube.th Why NoSQL?: Cost 57image source:https://www.slideshare.net/up1/introduction-to-nosql-61023856?qid=8519a104-f1d8-4955-a58b-a1eb61f61a8c
  • 58. http://dataminingtrend.com http://facebook.com/datacube.th Why NoSQL?: Flexibility • Database designers expect to know at the start of a project all the tables and columns that will be needed to support an application. • It is also commonly assumed that most of the columns in a table will be needed by most of the rows. • Unlike relational databases, some NoSQL databases do not require a fixed table structure. • For example, in a document database, a program could dynamically add new attributes as needed without having to have a database designer alter the database design. 58
  • 59. http://dataminingtrend.com http://facebook.com/datacube.th Why NoSQL?: Availability • Many of us have come to expect websites and web applications to be available whenever we want to use them. • NoSQL databases are designed to take advantage of multiple, low-cost servers. • When one server fails or is taken out of service for maintenance, the other servers in the cluster can take on the entire workload. 59
  • 60. http://dataminingtrend.com http://facebook.com/datacube.th Variety of NoSQL Databases • There are 4 major types of key NoSQL databases • Key-Value databases • Document databases • Column-oriented databases • Graph databases 60
  • 61. http://dataminingtrend.com http://facebook.com/datacube.th Key-Value databases • Key-value databases are the simplest form of NoSQL databases. • These databases are modelled on two components: 
 keys and values • Data is stored in a key-value pairs, where attribute is the Key and content is the Value • Data can only be queries and retrieved using the key only. 61
  • 62. http://dataminingtrend.com http://facebook.com/datacube.th Key-Value databases • use cases • caching data from relational databases to improve performance • storing data from sensors (IoT) • software • redis • Amazon DynamoDB 62 3876941. accountNumber Jane Washington1. Name 31.numItems Loyalty Member1.custType Keys Values
  • 63. http://dataminingtrend.com http://facebook.com/datacube.th Key-Value databases • Redis example (http://try.redis.io) • Set or update value against a key: • SET university "DPU" // set string • GET university // get string • HSET student firstName "Manee" // Hash – set field value • HGET student firstName // Hash – get field value • LPUSH "alice:sales" "10" "20" // List create/append • LSET "alice:sales" "0" "4" // List update • LRANGE "alice:sales" 0 1 // view list 63
  • 64. http://dataminingtrend.com http://facebook.com/datacube.th Key-Value databases • Set or update value against a key: • SET quantities 1 • INCR quantities • SADD "alice:friends" "f1" "f2" //Set – create/ update • SADD "bob:friends" "f2" "f1" //Set – create/update • Set operations: • intersection • SINTER "alice:friends" "bob:friends" • union • SUNION "alice:friends" “bob:friends" 64
  • 65. http://dataminingtrend.com http://facebook.com/datacube.th Variety of NoSQL Databases • There are 4 major types of key NoSQL databases • Key-Value databases • Document databases • Column-oriented databases • Graph databases 65
  • 66. http://dataminingtrend.com http://facebook.com/datacube.th Document Databases • A document store allows the inserting, retrieving, and manipulating of semi-structured data. • Compared to RDBMS, the documents themselves act as records (or rows), however, it is semi-structured as compared to rigid RDBMS. • It can store the data that have different set of data fields (columns) • Most of the databases available under this category use XML, JSON 66
  • 67. http://dataminingtrend.com http://facebook.com/datacube.th Document Databases • Document examples 67 { “EmployeeID" : "SM1", "FirstName" : "Anuj", "LastName" : "Sharma", "Age" : 45, "Salary" : 10000000 } { "EmployeeID": "MM2", "FirstName" : "Anand", "Age" : 34, "Salary" : 5000000, “Address" : { "Line1" : "123, 4th Street", "City" : "Bangalore", "State" : "Karnataka" }, "Projects" : [ "nosql-migration", "top-secret-007" ] }
  • 68. http://dataminingtrend.com http://facebook.com/datacube.th Document Databases • Use cases • back-end support for websites with high volumes of reads and writes • applications that use JSON data structures such as twitter data • Software • MongoDB • Couchbase • IBM Cloudant 68
  • 69. http://dataminingtrend.com http://facebook.com/datacube.th Document Databases • MongoDB examples • Download MongoDB from https://www.mongodb.com/download- center?jmp=nav#community • MongoDB’s default data directory path is the absolute path datadb on the drive from which you start MongoDB • You can specify an alternate path for data files using the --dbpath option to mongod.exe • Import example data 69 "C:Program FilesMongoDBServer3.4binmongod.exe" --dbpath d:testmongodbdata mongoimport --db test --collection restaurants --drop --file downloads/primer-dataset.json
  • 70. http://dataminingtrend.com http://facebook.com/datacube.th Document Databases • MongoDB examples • Download and install Robomongo (https://robomongo.org/ download) 70
  • 71. http://dataminingtrend.com http://facebook.com/datacube.th Document Databases • MongoDB examples • Find bakery’s shop
 • Find restaurants in “Morris Park Ave” street • Find restaurants which zip code start with 100 • Find bakery’s shop at “Morris Park Ave” street 71
  • 72. http://dataminingtrend.com http://facebook.com/datacube.th Document Databases • MongoDB examples • Find bakery’s shop and show their grades • Find bakery’s shop and show their cuisine and grades • More examples, please visit https://docs.mongodb.com/getting- started/shell/query/ 72
  • 73. http://dataminingtrend.com http://facebook.com/datacube.th Variety of NoSQL Databases • There are 4 major types of key NoSQL databases • Key-Value databases • Document databases • Column-oriented databases • Graph databases 73
  • 74. http://dataminingtrend.com http://facebook.com/datacube.th Column-oriented databases • Store data as columns as opposed to rows that is prominent in RDBMS • A relational database shows the data as two-dimensional tables comprising of rows and columns but stores, retrieves, and processes it one row at a time • A column-oriented database stores each column continuously. i.e. on disk or in-memory each column on the left will be stored in sequential blocks. 74
  • 75. http://dataminingtrend.com http://facebook.com/datacube.th Column-oriented databases • Example table 75image source: http://saphanatutorial.com/column-data-storage-and-row-data-storage-sap-hana/
  • 76. http://dataminingtrend.com http://facebook.com/datacube.th Column-oriented databases • Advantages of column-based tables: • Faster Data Access: • Only affected columns have to be read during the selection process of a query. Any of the columns can serve as an index. • Better Compression: • Columnar data storage allows highly efficient compression because the majority of the columns contain only few distinct values (compared to number of rows). 76
  • 77. http://dataminingtrend.com http://facebook.com/datacube.th Column-oriented databases • Advantages of column-based tables: • Better parallel Processing: • In a column store, data is already vertically partitioned. This means that operations on different columns can easily be processed in parallel. • If multiple columns need to be searched or aggregated, each of these operations can be assigned to a different processor core. 77
  • 78. http://dataminingtrend.com http://facebook.com/datacube.th Column-oriented databases • In case of analytic applications, where aggregations are used and faster search & processing are required, row-based storage are not good. • In row based tables all data stored in a row has to be read even though the requirement may be there to access data from a few columns. • Hence, these queries on huge amounts of data would take lots of times. • In columnar tables, this information is stored physically next to each other, that significantly increases the speed of certain data queries. 78
  • 79. http://dataminingtrend.com http://facebook.com/datacube.th Column-oriented databases • Column storage is most useful for OLAP queries (queries using any SQL aggregate functions). Because, these queries get just a few attributes from every data entry. • But for traditional OLTP queries (queries not using any SQL aggregate functions), it is more advantageous to store all attributes side-by-side in row tables 79
  • 80. http://dataminingtrend.com http://facebook.com/datacube.th Column-oriented databases 80image source: http://saphanatutorial.com/column-data-storage-and-row-data-storage-sap-hana/
  • 81. http://dataminingtrend.com http://facebook.com/datacube.th Column-oriented databases 81image source: http://saphanatutorial.com/column-data-storage-and-row-data-storage-sap-hana/image source: http://saphanatutorial.com/column-data-storage-and-row-data-storage-sap-hana/
  • 82. http://dataminingtrend.com http://facebook.com/datacube.th Column-oriented databases 82 Operation Column- oriented Row- oriented Aggregate Calulation of Single Column e.g. sum(price) Fast Slow Compression Higher - Retrieval of a few columns from a table with many columns Fast Slow Insertion/Updating of single new record Slow Fast Retrieval of a single record Slow Fast
  • 83. http://dataminingtrend.com http://facebook.com/datacube.th Column-oriented databases • Use cases • OLAP • Data Analytics • Software • Cassandra • Hbase (Hadoop) • Google BigTable • SAP HANA 83
  • 84. http://dataminingtrend.com http://facebook.com/datacube.th Variety of NoSQL Databases • There are 4 major types of key NoSQL databases • Key-Value databases • Document databases • Column-oriented databases • Graph databases 84
  • 85. http://dataminingtrend.com http://facebook.com/datacube.th Graph databases • Graph databases are the most specialized of the 4 NoSQL databases. • Instead of modelling data using columns and rows, a graph database uses structures called nodes and relationships. • more formal discussions, they are called vertices and edges • A node is an object that has an identifier and a set of attributes • A relationship is a link between two nodes that contain attributes about that relation. • Graph databases are designed to model adjacency between objects. Every node in the database contains pointers to adjacent objects in the database. • This allows for fast operations that require following paths through a graph. 85
  • 86. http://dataminingtrend.com http://facebook.com/datacube.th Graph databases • Example 86image source: NoSQL for Mere Mortals, Dan Sullivan, 2015
  • 87. http://dataminingtrend.com http://facebook.com/datacube.th Graph databases • Example 87image source: NoSQL for Mere Mortals, Dan Sullivan, 2015
  • 88. http://dataminingtrend.com http://facebook.com/datacube.th Outline • Part 1: Introduction to Big Data • Part 2: Introduction to NoSQL • Part 3: Introduction to MapReduce and Hadoop • Part 4: Introduction to Hive, HBase and Sqoop 88
  • 89. http://dataminingtrend.com http://facebook.com/datacube.th Hadoop architecture • Hadoop is composed of two primary components that implement the basic concepts of distributed storage and computation: HDFS and YARN • HDFS (sometimes shortened to DFS) is the Hadoop Distributed File System, responsible for managing data stored on disks across the cluster. • YARN acts as a cluster resource manager, allocating computational assets (processing availability and memory on worker nodes) to applications that wish to perform a distributed computation. 89
  • 90. http://dataminingtrend.com http://facebook.com/datacube.th Hadoop architecture 90 Image source: "Data Analytics with Hadoop: An Introduction for Data Scientists", Benjamin Bengfort and Jenny Kim, 2016
  • 91. http://dataminingtrend.com http://facebook.com/datacube.th Hadoop architecture • HDFS and YARN work in concert to minimize the amount of network traffic in the cluster primarily by ensuring that data is local to the required computation. • A set of machines that is running HDFS and YARN is known as a cluster, and the individual machines are called nodes. • A cluster can have a single node, or many thousands of nodes, but all clusters scale horizontally, meaning as you add more nodes, the cluster increases in both capacity and performance in a linear fashion. 91
  • 92. http://dataminingtrend.com http://facebook.com/datacube.th Hadoop architecture • Each node in the cluster is identified by the type of process that it runs: • Master nodes • These nodes run coordinating services for Hadoop workers and are usually the entry points for user access to the cluster. • Worker nodes • Worker nodes run services that accept tasks from master nodes either to store or retrieve data or to run a particular application. • A distributed computation is run by parallelizing the analysis across worker nodes. 92
  • 93. http://dataminingtrend.com http://facebook.com/datacube.th Hadoop architecture • For HDFS, the master and worker services are as follows: • NameNode (Master) • Stores the directory tree of the file system, file metadata, and the location of each file in the cluster. • Clients wanting to access HDFS must first locate the appropriate storage nodes by requesting information from the NameNode. • DataNode (Worker) • Stores and manages HDFS blocks on the local disk. • Reports health and status of individual data stores back to the NameNode 93
  • 94. http://dataminingtrend.com http://facebook.com/datacube.th Hadoop architecture • An HDFS cluster with a replication factor of two; the NameNode contains the mapping of files to blocks, and the DataNodes store the blocks and their replicas 94 Image source: "Hadoop with Python", Zachary Radtka and Donald Miner, 2016
  • 95. http://dataminingtrend.com http://facebook.com/datacube.th Hadoop architecture • When data is accessed from HDFS • a client application must first make a request to the NameNode to locate the data on disk. • The NameNode will reply with a list of DataNodes that store the data. • the client must then directly request each block of data from the DataNode. 95
  • 96. http://dataminingtrend.com http://facebook.com/datacube.th Hadoop architecture • YARN has multiple master services and a worker service as follows: • ResourceManager (Master) • Allocates and monitors available cluster resources (e.g., physical assets like memory and processor cores) • handling scheduling of jobs on the cluster • ApplicationMaster (Master) • Coordinates a particular application being run on the cluster as scheduled by the ResourceManager 96
  • 97. http://dataminingtrend.com http://facebook.com/datacube.th Hadoop architecture • YARN has multiple master services and a worker service as follows: • NodeManager (Worker) • Runs and manages processing tasks on an individual node as well as reports the health and status of tasks as they’re running 97
  • 98. http://dataminingtrend.com http://facebook.com/datacube.th Hadoop architecture • A small Hadoop cluster with two master nodes and four workers nodes that implements all six primary Hadoop services 98 Image source: "Data Analytics with Hadoop: An Introduction for Data Scientists", Benjamin Bengfort and Jenny Kim, 2016
  • 99. http://dataminingtrend.com http://facebook.com/datacube.th Hadoop architecture • Clients that wish to execute a job • must first request resources from the ResourceManager, which assigns an application-specific ApplicationMaster for the duration of the job. • the ApplicationMaster tracks the execution of the job. • the ResourceManager tracks the status of the nodes • each individual NodeManager creates containers and executes tasks within them 99
  • 100. http://dataminingtrend.com http://facebook.com/datacube.th Hadoop architecture • Finally, one other type of cluster is important to note: a single node cluster. • In “pseudo-distributed mode” a single machine runs all Hadoop daemons as though it were part of a cluster, but network traffic occurs through the local loopback network interface. • Hadoop developers typically work in a pseudo-distributed environment, usually inside of a virtual machine to which they connect via SSH. • Cloudera, Hortonworks, and other popular distributions of Hadoop provide pre-built virtual machine images that you can download and get started with right away. 100
  • 101. http://dataminingtrend.com http://facebook.com/datacube.th Hadoop Distributed File System (HDFS) • HDFS provides redundant storage for big data by storing that data across a cluster of cheap, unreliable computers, thus extending the amount of available storage capacity that a single machine alone might have. • HDFS performs best with modest number of very large files • millions of large files (100 MB or more) rather than billions of smaller files that might occupy the same volume. • It is not a good fit as a data backend for applications that require updates in real-time, interactive data analysis, or record-based transactional support. 101
  • 102. http://dataminingtrend.com http://facebook.com/datacube.th Hadoop Distributed File System (HDFS) • HDFS files are split into blocks, usually of either 64MB or 128MB. • Blocks allow very large files to be split across and distributed to many machines at run time. • Additionally, blocks will be replicated across the DataNodes. • by default, the replication is three fold • Therefore, each block exists on three different machines and three different disks, and if even two node fail, the data will not be lost. 102
  • 103. http://dataminingtrend.com http://facebook.com/datacube.th Interacting with HDFS • Interacting with HDFS is primarily performed from the command line using the script named hdfs. The hdfs script has the following usage: • The -option argument is the name of a specific option for the specified command, and <arg> is one or more arguments that that are specified for this option. • For example, show help 103 $ hadoop fs [-option <arg>] $ hadoop fs -help
  • 104. http://dataminingtrend.com http://facebook.com/datacube.th Interacting with HDFS • List directory contents • use -ls command: • Running the -ls command on a new cluster will not return any results. This is because the -ls command, without any arguments, will attempt to display the contents of the user’s home directory on HDFS. • Providing -ls with the forward slash (/) as an argument displays the contents of the root of HDFS: 104 $ hadoop fs -ls $ hadoop fs -ls /
  • 105. http://dataminingtrend.com http://facebook.com/datacube.th Interacting with HDFS • Creating a directory • To create the books directory within HDFS, use the -mkdir command: • For example, create books directory in home directory • Use the -ls command to verify that the previous directories were created: 105 $ hadoop fs -mkdir [directory name] $ hadoop fs -mkdir books $ hadoop fs -ls
  • 106. http://dataminingtrend.com http://facebook.com/datacube.th Interacting with HDFS • Copy Data onto HDFS • After a directory has been created for the current user, data can be uploaded to the user’s HDFS home directory with the -put command: • For example, copy book file from local to HDFS • Use the -ls command to verify that pg20417.txt was moved to HDFS: 106 $ hadoop fs -put [source file] [destination file] $ hadoop fs -put pg20417.txt books/pg20417.txt $ hadoop fs -ls books
  • 107. http://dataminingtrend.com http://facebook.com/datacube.th Interacting with HDFS • Retrieve (view) Data from HDFS • Multiple commands allow data to be retrieved from HDFS. • To simply view the contents of a file, use the -cat command. -cat reads a file on HDFS and displays its contents to stdout. • The following command uses -cat to display the contents of pg20417.txt • 107 $ hadoop fs -cat books/pg20417.txt
  • 108. http://dataminingtrend.com http://facebook.com/datacube.th Interacting with HDFS • Retrieve (view) Data from HDFS • Data can also be copied from HDFS to the local filesystem using the -get command. The -get command is the opposite of the -put command: • For example, This command copies pg20417.txt from HDFS to the local filesystem. 108 $ hadoop fs -get [source file] [destination file] $ hadoop fs -get pg20417.txt .
  • 109. http://dataminingtrend.com http://facebook.com/datacube.th MapReduce • MapReduce is a programming model that enables large volumes of data to be processed and generated by dividing work into independent tasks and executing the tasks in parallel across a cluster of machines. • At a high level, every MapReduce program transforms a list of input data elements into a list of output data elements twice, once in the map phase and once in the reduce phase. • The MapReduce framework is composed of three major phases: map, shuffle and sort, and reduce. 109 Image source: "Data Analytics with Hadoop: An Introduction for Data Scientists", Benjamin Bengfort and Jenny Kim, 2016
  • 110. http://dataminingtrend.com http://facebook.com/datacube.th MapReduce • Map • The first phase of a MapReduce application is the map phase. Within the map phase, a function (called the mapper) processes a series of key-value pairs. • The mapper sequentially processes each key-value pair individually, producing zero or more output key-value pairs • As an example, consider a mapper whose purpose is to transform sentences into words. 110
  • 111. http://dataminingtrend.com http://facebook.com/datacube.th MapReduce • Map • The input to this mapper would be strings that contain sentences, and the mapper’s function would be to split the sentences into words and output the words 111 Image source: "Hadoop with Python", Zachary Radtka and Donald Miner, 2016
  • 112. http://dataminingtrend.com http://facebook.com/datacube.th MapReduce • Shuffle and Sort • As the mappers begin completing, the intermediate outputs from the map phase are moved to the reducers. This process of moving output from the mappers to the reducers is known as shuffling. • Shuffling is handled by a partition function, known as the partitioner. The partitioner ensures that all of the values for the same key are sent to the same reducer. • The intermediate keys and values for each partition are sorted by the Hadoop framework before being presented to the reducer. 112
  • 113. http://dataminingtrend.com http://facebook.com/datacube.th MapReduce • Reduce • Within the reducer phase, an iterator of values is provided to a function known as the reducer. The iterator of values is a nonunique set of values for each unique key from the output of the map phase. • The reducer aggregates the values for each unique key and produces zero or more output key-value pairs • As an example, consider a reducer whose purpose is to sum all of the values for a key. The input to this reducer is an iterator of all of the values for a key, and the reducer sums all of the values. 113
  • 114. http://dataminingtrend.com http://facebook.com/datacube.th MapReduce • Reduce • The reducer then outputs a key-value pair that contains the input key and the sum of the input key values 114 Image source: "Hadoop with Python", Zachary Radtka and Donald Miner, 2016
  • 115. http://dataminingtrend.com http://facebook.com/datacube.th MapReduce • Data flow of a MapReduce job being executed on a cluster of a few nodes 115 Image source: "Data Analytics with Hadoop: An Introduction for Data Scientists", Benjamin Bengfort and Jenny Kim, 2016
  • 116. http://dataminingtrend.com http://facebook.com/datacube.th MapReduce examples • In order to demonstrate how data flows through a map and reduce computational pipeline, we will present 3 examples • word counting • IoT data • shared friendship 116
  • 117. http://dataminingtrend.com http://facebook.com/datacube.th MapReduce examples: word count • The word-counting application takes as input one or more text files and produces a list of word and their frequencies as output. 117 Image source: "Data Analytics with Hadoop: An Introduction for Data Scientists", Benjamin Bengfort and Jenny Kim, 2016
  • 118. http://dataminingtrend.com http://facebook.com/datacube.th MapReduce examples: word count • Because Hadoop utilizes key/value pairs the input key is a file ID and line number and the input value is a string, while the output key is a word and the output value is an integer. • The following Python pseudocode shows how this algorithm is implemented: 118 # emit is a function that performs hadoop I/O def map(dockey, line): for word in value.split(): emit(word, 1) def reduce(word, values): count = sum(value for value in values) emit(word,count)
  • 119. http://dataminingtrend.com http://facebook.com/datacube.th MapReduce examples: word count • Example
 (Map) 119 (27183, “The fast cat wears no hat.”) (31416, “The cat in the hat ran fast.”) input Mapper 1 Mapper 2
  • 120. http://dataminingtrend.com http://facebook.com/datacube.th MapReduce examples: word count • Example
 (Map) 120 (27183, “The fast cat wears no hat.”) (31416, “The cat in the hat ran fast.”) input Mapper 1 Mapper 2 (27183, “The fast cat wears no hat.”) (31416, “The cat in the hat ran fast.”)
  • 121. http://dataminingtrend.com http://facebook.com/datacube.th MapReduce examples: word count • Example
 (Map) 121 (“The”,1) (“The”,1) input Mapper 1 Mapper 2 (27183, “The fast cat wears no hat.”) (31416, “The cat in the hat ran fast.”)
  • 122. http://dataminingtrend.com http://facebook.com/datacube.th MapReduce examples: word count • Example
 (Map) 122 (27183, “The fast cat wears no hat.”) (31416, “The cat in the hat ran fast.”) (“The”,1) (“The”,1)(“fast”,1) (“cat”,1) input Mapper 1 Mapper 2
  • 123. http://dataminingtrend.com http://facebook.com/datacube.th MapReduce examples: word count • Example
 (Map) 123 (27183, “The fast cat wears no hat.”) (31416, “The cat in the hat ran fast.”) (“The”,1) (“The”,1)(“fast”,1) (“cat”,1)(“cat”,1) (“in”,1) input Mapper 1 Mapper 2
  • 124. http://dataminingtrend.com http://facebook.com/datacube.th MapReduce examples: word count • Example
 (Map) 124 (27183, “The fast cat wears no hat.”) (31416, “The cat in the hat ran fast.”) (“The”,1) (“The”,1)(“fast”,1) (“cat”,1)(“cat”,1) (“in”,1) (“wears”,1) (“the”,1) input Mapper 1 Mapper 2
  • 125. http://dataminingtrend.com http://facebook.com/datacube.th MapReduce examples: word count • Example
 (Map) 125 (27183, “The fast cat wears no hat.”) (31416, “The cat in the hat ran fast.”) (“The”,1) (“The”,1)(“fast”,1) (“cat”,1)(“cat”,1) (“in”,1) (“wears”,1) (“the”,1)(“no”,1) (“hat”,1) input Mapper 1 Mapper 2
  • 126. http://dataminingtrend.com http://facebook.com/datacube.th MapReduce examples: word count • Example
 (Map) 126 (27183, “The fast cat wears no hat.”) (31416, “The cat in the hat ran fast.”) (“The”,1) (“The”,1)(“fast”,1) (“cat”,1)(“cat”,1) (“in”,1) (“wears”,1) (“the”,1)(“no”,1) (“hat”,1)(“hat”,1) (“ran”,1) input Mapper 1 Mapper 2
  • 127. http://dataminingtrend.com http://facebook.com/datacube.th MapReduce examples: word count • Example
 (Map) 127 (27183, “The fast cat wears no hat.”) (31416, “The cat in the hat ran fast.”) (“The”,1) (“The”,1)(“fast”,1) (“cat”,1)(“cat”,1) (“in”,1) (“wears”,1) (“the”,1)(“no”,1) (“hat”,1)(“hat”,1) (“.”,1) (“ran”,1) (“fast”,1) input Mapper 1 Mapper 2
  • 128. http://dataminingtrend.com http://facebook.com/datacube.th MapReduce examples: word count • Example 
 (Map) 128 (27183, “The fast cat wears no hat.”) (31416, “The cat in the hat ran fast.”) (“The”,1) (“The”,1)(“fast”,1) (“cat”,1)(“cat”,1) (“in”,1) (“wears”,1) (“the”,1)(“no”,1) (“hat”,1)(“hat”,1) (“.”,1) (“ran”,1) (“fast”,1) (“.”,1) input Mapper 1 Mapper 2
  • 129. http://dataminingtrend.com http://facebook.com/datacube.th MapReduce examples: word count • Example (Shuffle & Sort) 129 (“The”,1) (“The”,1)(“fast”,1) (“cat”,1)(“cat”,1) (“in”,1) (“wears”,1) (“the”,1)(“no”,1) (“hat”,1)(“hat”,1) (“.”,1) (“ran”,1) (“fast”,1) (“.”,1) (“.”,1) (“.”,1) Mapper 1 Mapper 2 Shuffle & Sort
  • 130. http://dataminingtrend.com http://facebook.com/datacube.th MapReduce examples: word count • Example (Shuffle & Sort) 130 (“The”,1) (“The”,1)(“fast”,1) (“cat”,1)(“cat”,1) (“in”,1) (“wears”,1) (“the”,1)(“no”,1) (“hat”,1)(“hat”,1) (“.”,1) (“ran”,1) (“fast”,1) (“.”,1) (“.”,1) (“.”,1) (“cat”,1) (“cat”,1) Mapper 1 Mapper 2 Shuffle & Sort
  • 131. http://dataminingtrend.com http://facebook.com/datacube.th MapReduce examples: word count • Example (Shuffle & Sort) 131 (“The”,1) (“The”,1)(“fast”,1) (“cat”,1)(“cat”,1) (“in”,1) (“wears”,1) (“the”,1)(“no”,1) (“hat”,1)(“hat”,1) (“.”,1) (“ran”,1) (“fast”,1) (“.”,1) (“.”,1) (“.”,1) (“cat”,1) (“cat”,1) (“fast”,1) (“fast”,1) Mapper 1 Mapper 2 Shuffle & Sort
  • 132. http://dataminingtrend.com http://facebook.com/datacube.th MapReduce examples: word count • Example (Shuffle & Sort) 132 (“The”,1) (“The”,1)(“fast”,1) (“cat”,1)(“cat”,1) (“in”,1) (“wears”,1) (“the”,1)(“no”,1) (“hat”,1)(“hat”,1) (“.”,1) (“ran”,1) (“fast”,1) (“.”,1) (“.”,1) (“.”,1) (“cat”,1) (“cat”,1) (“fast”,1) (“fast”,1) (“hat”,1) (“hat”,1) Mapper 1 Mapper 2 Shuffle & Sort
  • 133. http://dataminingtrend.com http://facebook.com/datacube.th MapReduce examples: word count • Example (Shuffle & Sort) 133 (“The”,1) (“The”,1)(“fast”,1) (“cat”,1)(“cat”,1) (“in”,1) (“wears”,1) (“the”,1)(“no”,1) (“hat”,1)(“hat”,1) (“.”,1) (“ran”,1) (“fast”,1) (“.”,1) (“.”,1) (“.”,1) (“cat”,1) (“cat”,1) (“fast”,1) (“fast”,1) (“hat”,1) (“hat”,1) (“in”,1) Mapper 1 Mapper 2 Shuffle & Sort
  • 134. http://dataminingtrend.com http://facebook.com/datacube.th MapReduce examples: word count • Example (Shuffle & Sort) 134 (“The”,1) (“The”,1)(“fast”,1) (“cat”,1)(“cat”,1) (“in”,1) (“wears”,1) (“the”,1)(“no”,1) (“hat”,1)(“hat”,1) (“.”,1) (“ran”,1) (“fast”,1) (“.”,1) (“.”,1) (“.”,1) (“cat”,1) (“cat”,1) (“fast”,1) (“fast”,1) (“hat”,1) (“hat”,1) (“in”,1) (“no”,1) Mapper 1 Mapper 2 Shuffle & Sort
  • 135. http://dataminingtrend.com http://facebook.com/datacube.th MapReduce examples: word count • Example (Shuffle & Sort) 135 (“The”,1) (“The”,1)(“fast”,1) (“cat”,1)(“cat”,1) (“in”,1) (“wears”,1) (“the”,1)(“no”,1) (“hat”,1)(“hat”,1) (“.”,1) (“ran”,1) (“fast”,1) (“.”,1) (“.”,1) (“.”,1) (“cat”,1) (“cat”,1) (“fast”,1) (“fast”,1) (“hat”,1) (“hat”,1) (“in”,1) (“no”,1) (“ran”,1) Mapper 1 Mapper 2 Shuffle & Sort
  • 136. http://dataminingtrend.com http://facebook.com/datacube.th MapReduce examples: word count • Example (Shuffle & Sort) 136 (“The”,1) (“The”,1)(“fast”,1) (“cat”,1)(“cat”,1) (“in”,1) (“wears”,1) (“the”,1)(“no”,1) (“hat”,1)(“hat”,1) (“.”,1) (“ran”,1) (“fast”,1) (“.”,1) (“.”,1) (“.”,1) (“cat”,1) (“cat”,1) (“fast”,1) (“fast”,1) (“hat”,1) (“hat”,1) (“in”,1) (“no”,1) (“ran”,1) (“the”,1) Mapper 1 Mapper 2 Shuffle & Sort
  • 137. http://dataminingtrend.com http://facebook.com/datacube.th MapReduce examples: word count • Example (Shuffle & Sort) 137 (“The”,1) (“The”,1)(“fast”,1) (“cat”,1)(“cat”,1) (“in”,1) (“wears”,1) (“the”,1)(“no”,1) (“hat”,1)(“hat”,1) (“.”,1) (“ran”,1) (“fast”,1) (“.”,1) (“.”,1) (“.”,1) (“cat”,1) (“cat”,1) (“fast”,1) (“fast”,1) (“hat”,1) (“hat”,1) (“in”,1) (“no”,1) (“ran”,1) (“the”,1) (“wears”,1) Mapper 1 Mapper 2 Shuffle & Sort
  • 138. http://dataminingtrend.com http://facebook.com/datacube.th MapReduce examples: word count • Example (Shuffle & Sort) 138 Mapper 1 (“The”,1) (“The”,1)(“fast”,1) (“cat”,1)(“cat”,1) (“in”,1) (“wears”,1) (“the”,1)(“no”,1) (“hat”,1)(“hat”,1) (“.”,1) (“ran”,1) (“fast”,1) (“.”,1) (“.”,1) (“.”,1) (“cat”,1) (“cat”,1) (“fast”,1) (“fast”,1) (“hat”,1) (“hat”,1) (“in”,1) (“no”,1) (“ran”,1) (“the”,1) (“wears”,1) (“The”,1) (“The”,1) Mapper 2 Shuffle & Sort
  • 139. http://dataminingtrend.com http://facebook.com/datacube.th MapReduce examples: word count • Example (Reduce) 139 (“.”,1) (“.”,1) (“cat”,1) (“cat”,1) (“fast”,1) (“fast”,1) (“hat”,1) (“hat”,1) (“in”,1) (“no”,1) (“ran”,1) (“the”,1) (“wears”,1) (“The”,1) (“The”,1) Shuffle & Sort Reduce (“.”,2)
  • 140. http://dataminingtrend.com http://facebook.com/datacube.th MapReduce examples: word count • Example (Reduce) 140 (“.”,1) (“.”,1) (“cat”,1) (“cat”,1) (“fast”,1) (“fast”,1) (“hat”,1) (“hat”,1) (“in”,1) (“no”,1) (“ran”,1) (“the”,1) (“wears”,1) (“The”,1) (“The”,1) Shuffle & Sort Reduce (“.”,2) (“cat”,2)
  • 141. http://dataminingtrend.com http://facebook.com/datacube.th MapReduce examples: word count • Example (Reduce) 141 (“.”,1) (“.”,1) (“cat”,1) (“cat”,1) (“fast”,1) (“fast”,1) (“hat”,1) (“hat”,1) (“in”,1) (“no”,1) (“ran”,1) (“the”,1) (“wears”,1) (“The”,1) (“The”,1) Shuffle & Sort Reduce (“.”,2) (“cat”,2) (“fast”,2)
  • 142. http://dataminingtrend.com http://facebook.com/datacube.th MapReduce examples: word count • Example (Reduce) 142 (“.”,1) (“.”,1) (“cat”,1) (“cat”,1) (“fast”,1) (“fast”,1) (“hat”,1) (“hat”,1) (“in”,1) (“no”,1) (“ran”,1) (“the”,1) (“wears”,1) (“The”,1) (“The”,1) Shuffle & Sort Reduce (“.”,2) (“cat”,2) (“fast”,2) (“hat”,2)
  • 143. http://dataminingtrend.com http://facebook.com/datacube.th MapReduce examples: word count • Example (Reduce) 143 (“.”,1) (“.”,1) (“cat”,1) (“cat”,1) (“fast”,1) (“fast”,1) (“hat”,1) (“hat”,1) (“in”,1) (“no”,1) (“ran”,1) (“the”,1) (“wears”,1) (“The”,1) (“The”,1) Shuffle & Sort Reduce (“.”,2) (“cat”,2) (“fast”,2) (“hat”,2) (“in”,1)
  • 144. http://dataminingtrend.com http://facebook.com/datacube.th MapReduce examples: word count • Example (Reduce) 144 (“.”,1) (“.”,1) (“cat”,1) (“cat”,1) (“fast”,1) (“fast”,1) (“hat”,1) (“hat”,1) (“in”,1) (“no”,1) (“ran”,1) (“the”,1) (“wears”,1) (“The”,1) (“The”,1) Shuffle & Sort Reduce (“.”,2) (“cat”,2) (“fast”,2) (“hat”,2) (“in”,1) (“no”,1)
  • 145. http://dataminingtrend.com http://facebook.com/datacube.th MapReduce examples: word count • Example (Reduce) 145 (“.”,1) (“.”,1) (“cat”,1) (“cat”,1) (“fast”,1) (“fast”,1) (“hat”,1) (“hat”,1) (“in”,1) (“no”,1) (“ran”,1) (“the”,1) (“wears”,1) (“The”,1) (“The”,1) Shuffle & Sort Reduce (“.”,2) (“cat”,2) (“fast”,2) (“hat”,2) (“in”,1) (“no”,1) (“ran”,1)
  • 146. http://dataminingtrend.com http://facebook.com/datacube.th MapReduce examples: word count • Example (Reduce) 146 (“.”,1) (“.”,1) (“cat”,1) (“cat”,1) (“fast”,1) (“fast”,1) (“hat”,1) (“hat”,1) (“in”,1) (“no”,1) (“ran”,1) (“the”,1) (“wears”,1) (“The”,1) (“The”,1) Shuffle & Sort Reduce (“.”,2) (“cat”,2) (“fast”,2) (“hat”,2) (“in”,1) (“no”,1) (“ran”,1) (“the”,1)
  • 147. http://dataminingtrend.com http://facebook.com/datacube.th MapReduce examples: word count • Example (Reduce) 147 (“.”,1) (“.”,1) (“cat”,1) (“cat”,1) (“fast”,1) (“fast”,1) (“hat”,1) (“hat”,1) (“in”,1) (“no”,1) (“ran”,1) (“the”,1) (“wears”,1) (“The”,1) (“The”,1) Shuffle & Sort Reduce (“.”,2) (“cat”,2) (“fast”,2) (“hat”,2) (“in”,1) (“no”,1) (“ran”,1) (“the”,1) (“wears”,1)
  • 148. http://dataminingtrend.com http://facebook.com/datacube.th MapReduce examples: word count • Example (Reduce) 148 (“.”,1) (“.”,1) (“cat”,1) (“cat”,1) (“fast”,1) (“fast”,1) (“hat”,1) (“hat”,1) (“in”,1) (“no”,1) (“ran”,1) (“the”,1) (“wears”,1) (“The”,1) (“The”,1) Shuffle & Sort Reduce (“.”,2) (“cat”,2) (“fast”,2) (“hat”,2) (“in”,1) (“no”,1) (“ran”,1) (“the”,1) (“wears”,1) (“The”,2)
  • 149. http://dataminingtrend.com http://facebook.com/datacube.th MapReduce examples • In order to demonstrate how data flows through a map and reduce computational pipeline, we will present 3 examples • word counting • IoT data • shared friendship 149
  • 150. http://dataminingtrend.com http://facebook.com/datacube.th MapReduce examples: IoT • IoT applications create an enormous amount of data that has to be processed. This data is generated by physical sensors who take measurements, like room temperature at 8.00 o’Clock. • Every measurement consists of • a key (the timestamp when the measurement has been taken) and • a value (the actual value measured by the sensor). • for example, (2016-05-01 01:02:03, 1). • The goal of this exercise is to create average daily values of that sensor’s data. 150
  • 151. http://dataminingtrend.com http://facebook.com/datacube.th MapReduce examples: IoT • Example(Map) 151 input Mapper 1 Mapper 2 Mapper 3 (“2016-05-01 01:02:03”,1) (“2016-05-02 12:09:04”,2) (“2016-05-03 09:21:07”,3) (“2016-05-03 09:21:45”,4) (“2016-05-01 01:02:04”,5) (“2016-05-02 12:09:01”,6) (“2016-05-02 12:09:30”,7) (“2016-05-03 09:21:31”,8) (“2016-05-01 01:02:05”,9)
  • 152. http://dataminingtrend.com http://facebook.com/datacube.th MapReduce examples: IoT • Example(Map) 152 input Mapper 1 Mapper 2 Mapper 3 (“2016-05-01 01:02:03”,1) (“2016-05-02 12:09:04”,2) (“2016-05-03 09:21:07”,3) (“2016-05-03 09:21:45”,4) (“2016-05-01 01:02:04”,5) (“2016-05-02 12:09:01”,6) (“2016-05-02 12:09:30”,7) (“2016-05-03 09:21:31”,8) (“2016-05-01 01:02:05”,9) (“2016-05-01 01:02:03”,1) (“2016-05-02 12:09:04”,2) (“2016-05-03 09:21:07”,3) (“2016-05-03 09:21:45”,4) (“2016-05-01 01:02:04”,5) (“2016-05-02 12:09:01”,6) (“2016-05-02 12:09:30”,7) (“2016-05-03 09:21:31”,8) (“2016-05-01 01:02:05”,9)
  • 153. http://dataminingtrend.com http://facebook.com/datacube.th MapReduce examples: IoT • Example(Map) 153 input Mapper 1 Mapper 2 Mapper 3 (“2016-05-01 01:02:03”,1) (“2016-05-02 12:09:04”,2) (“2016-05-03 09:21:07”,3) (“2016-05-03 09:21:45”,4) (“2016-05-01 01:02:04”,5) (“2016-05-02 12:09:01”,6) (“2016-05-02 12:09:30”,7) (“2016-05-03 09:21:31”,8) (“2016-05-01 01:02:05”,9) (“2016-05-01”,1) (“2016-05-02”,2) (“2016-05-03”,3) (“2016-05-03”,4) (“2016-05-01”,5) (“2016-05-02”,6) (“2016-05-02”,7) (“2016-05-03”,8) (“2016-05-01”,9)
  • 154. http://dataminingtrend.com http://facebook.com/datacube.th MapReduce examples: IoT • Example(Shuffle & Sort) 154 Mapper 1 Mapper 2 Mapper 3 (“2016-05-01”,1) (“2016-05-02”,2) (“2016-05-03”,3) (“2016-05-03”,4) (“2016-05-01”,5) (“2016-05-02”,6) (“2016-05-02”,7) (“2016-05-03”,8) (“2016-05-01”,9) (“2016-05-01”,1) (“2016-05-01”,5) (“2016-05-01”,9) Shuffle & Sort
  • 155. http://dataminingtrend.com http://facebook.com/datacube.th MapReduce examples: IoT • Example(Shuffle & Sort) 155 Mapper 1 Mapper 2 Mapper 3 (“2016-05-01”,1) (“2016-05-02”,2) (“2016-05-03”,3) (“2016-05-03”,4) (“2016-05-01”,5) (“2016-05-02”,6) (“2016-05-02”,7) (“2016-05-03”,8) (“2016-05-01”,9) (“2016-05-01”,1) (“2016-05-01”,5) (“2016-05-01”,9) (“2016-05-02”,2) (“2016-05-02”,6) (“2016-05-02”,7) Shuffle & Sort
  • 156. http://dataminingtrend.com http://facebook.com/datacube.th MapReduce examples: IoT • Example(Shuffle & Sort) 156 Mapper 1 Mapper 2 Mapper 3 (“2016-05-01”,1) (“2016-05-02”,2) (“2016-05-03”,3) (“2016-05-03”,4) (“2016-05-01”,5) (“2016-05-02”,6) (“2016-05-02”,7) (“2016-05-03”,8) (“2016-05-01”,9) (“2016-05-01”,1) (“2016-05-01”,5) (“2016-05-01”,9) (“2016-05-02”,2) (“2016-05-02”,6) (“2016-05-02”,7) (“2016-05-03”,3) (“2016-05-03”,4) (“2016-05-03”,8) Shuffle & Sort
  • 157. http://dataminingtrend.com http://facebook.com/datacube.th MapReduce examples: IoT • Example(Reduce) 157 (“2016-05-01”,1) (“2016-05-01”,5) (“2016-05-01”,9) (“2016-05-02”,2) (“2016-05-02”,6) (“2016-05-02”,7) (“2016-05-03”,3) (“2016-05-03”,4) (“2016-05-03”,8) Shuffle & Sort (“2016-05-01”,5)value = (1+5+9)/3 Reduce
  • 158. http://dataminingtrend.com http://facebook.com/datacube.th MapReduce examples: IoT • Example(Reduce) 158 (“2016-05-01”,1) (“2016-05-01”,5) (“2016-05-01”,9) (“2016-05-02”,2) (“2016-05-02”,6) (“2016-05-02”,7) (“2016-05-03”,3) (“2016-05-03”,4) (“2016-05-03”,8) Shuffle & Sort Reduce (“2016-05-01”,5) value = (2+6+7)/3 (“2016-05-02”,5)
  • 159. http://dataminingtrend.com http://facebook.com/datacube.th MapReduce examples: IoT • Example(Reduce) 159 (“2016-05-01”,1) (“2016-05-01”,5) (“2016-05-01”,9) (“2016-05-02”,2) (“2016-05-02”,6) (“2016-05-02”,7) (“2016-05-03”,3) (“2016-05-03”,4) (“2016-05-03”,8) Shuffle & Sort (“2016-05-01”,5) value = (3+4+8)/3 (“2016-05-02”,5) (“2016-05-03”,5) Reduce
  • 160. http://dataminingtrend.com http://facebook.com/datacube.th MapReduce examples • In order to demonstrate how data flows through a map and reduce computational pipeline, we will present 3 examples • word counting • IoT data • shared friendship 160
  • 161. http://dataminingtrend.com http://facebook.com/datacube.th MapReduce examples: shared friendship • In the shared friendship task, the goal is to analyze a social network to see which friend relationships users have in common. • Given an input data source where the key is the name of a user and the value is a comma-separated list of friends. 161
  • 162. http://dataminingtrend.com http://facebook.com/datacube.th MapReduce examples: shared friendship • The following Python pseudocode demonstrates how to perform this computation: 162 def map(person, friends): for friend in friends.split(“,”): pair = sort([person, friend]) emit(pair,friends) def reduce(pair, friends): shared = set(friend[0]) shared = shared.intersection(friends[1]) emit(pair,shared)
  • 163. http://dataminingtrend.com http://facebook.com/datacube.th MapReduce examples: shared friendship • The mapper create an intermediate keycap of all of the possible (friend, friend) tuples that exist from the initial dataset. • This allows us to analyze the dataset on a per-relationship basis as the value is the list of associated friends. • The pair is sorted, which ensures that the input (“Mike”,“Linda”) and (“Linda”,“Mike”) end up being the same key during aggregation in the reducer. 163
  • 164. http://dataminingtrend.com http://facebook.com/datacube.th MapReduce examples: shared friendship • Example(Map) 164 input (“Allen”,”Betty, Chris, David”) (“Betty”,”Allen, Chris, David, Ellen”) (“Chris”,”Allen, Betty, David,Ellen”) (“David”,”Allen, Betty, Chris, Ellen”) (“Ellen”,”Betty, Chris, David”)
  • 165. http://dataminingtrend.com http://facebook.com/datacube.th MapReduce examples: shared friendship • Example(Map) 165 input Mapper 1 Mapper 2 (“Allen”,”Betty, Chris, David”) (“Betty”,”Allen, Chris, David, Ellen”) (“Chris”,”Allen, Betty, David,Ellen”) (“David”,”Allen, Betty, Chris, Ellen”) (“Ellen”,”Betty, Chris, David”) (“Allen, Betty”,”Betty, Chris, David”) (“Allen, Chris”,”Betty, Chris, David”) (“Allen, David”,”Betty, Chris, David”) (“Betty, Ellen”,”Betty, Chris, David”) (“Chris, Ellen”,”Betty, Chris, David”) (“David, Ellen”,”Betty, Chris, David”)
  • 166. http://dataminingtrend.com http://facebook.com/datacube.th MapReduce examples: shared friendship • Example(Map) 166 input Mapper 3 Mapper 4 (“Allen”,”Betty, Chris, David”) (“Betty”,”Allen, Chris, David, Ellen”) (“Chris”,”Allen, Betty, David,Ellen”) (“David”,”Allen, Betty, Chris, Ellen”) (“Ellen”,”Betty, Chris, David”) (“Allen, Betty”,”Allen, Chris, David, Ellen”) (“Betty, Chris”,”Allen, Chris, David, Ellen”) (“Betty, David”,”Allen, Chris, David, Ellen”) (“Betty, Ellen”,”Allen, Chris, David, Ellen”) (“Allen, Chris”,”Allen, Betty, David,Ellen”) (“Betty, Chris”,”Allen, Betty, David,Ellen”) (“Chris, David”,”Allen, Betty, David,Ellen”) (“Chris, Ellen”,”Allen, Betty, David,Ellen”)
  • 167. http://dataminingtrend.com http://facebook.com/datacube.th MapReduce examples: shared friendship • Example(Map) 167 input Mapper 5 (“Allen”,”Betty, Chris, David”) (“Betty”,”Allen, Chris, David, Ellen”) (“Chris”,”Allen, Betty, David,Ellen”) (“David”,”Allen, Betty, Chris, Ellen”) (“Ellen”,”Betty, Chris, David”) (“Allen, David”,”Allen, Betty, Chris, Ellen”) (“Betty, David”,”Allen, Betty, Chris, Ellen”) (“Chris, David”,”Allen, Betty, Chris, Ellen”) (“David, Ellen”,”Allen, Betty, Chris, Ellen”)
  • 168. http://dataminingtrend.com http://facebook.com/datacube.th MapReduce examples: shared friendship • Example (Shuffle & Sort) 168 Shuffle & Sort (“Allen, Betty”,”Betty, Chris, David”)(“Allen, Betty”,”Allen, Chris, David, Ellen”) (“Allen, Chris”,”Allen, Betty, David,Ellen”) (“Allen, Chris”,”Betty, Chris, David”) (“Allen, David”,”Allen, Betty, Chris, Ellen”) (“Allen, David”,”Betty, Chris, David”) (“Betty, Chris”,”Allen, Betty, David,Ellen”) (“Betty, Chris”,”Allen, Chris, David, Ellen”) (“Betty, David”,”Allen, Chris, David, Ellen”)(“Betty, David”,”Allen, Betty, Chris, Ellen”) (“Betty, Ellen”,”Allen, Chris, David, Ellen”) (“Betty, Ellen”,”Betty, Chris, David”) (“Chris, David”,”Allen, Betty, David,Ellen”)(“Chris, David”,”Allen, Betty, Chris, Ellen”) (“Chris, Ellen”,”Betty, Chris, David”)(“Chris, Ellen”,”Allen, Betty, David,Ellen”) (“David, Ellen”,”Allen, Betty, Chris, Ellen”) (“David, Ellen”,”Betty, Chris, David”)
  • 169. http://dataminingtrend.com http://facebook.com/datacube.th MapReduce examples: shared friendship • Example (Shuffle & Sort) 169 Shuffle & Sort (“Allen, Betty”,”Betty, Chris, David”)(“Allen, Betty”,”Allen, Chris, David, Ellen”) (“Allen, Chris”,”Allen, Betty, David,Ellen”) (“Allen, Chris”,”Betty, Chris, David”) (“Allen, David”,”Allen, Betty, Chris, Ellen”) (“Allen, David”,”Betty, Chris, David”) (“Betty, Chris”,”Allen, Betty, David, Ellen”) (“Betty, Chris”,”Allen, Chris, David, Ellen”) (“Betty, David”,”Allen, Chris, David, Ellen”)(“Betty, David”,”Allen, Betty, Chris, Ellen”) (“Betty, Ellen”,”Allen, Chris, David, Ellen”) (“Betty, Ellen”,”Betty, Chris, David”) (“Chris, David”,”Allen, Betty, David,Ellen”)(“Chris, David”,”Allen, Betty, Chris, Ellen”) (“Chris, Ellen”,”Betty, Chris, David”)(“Chris, Ellen”,”Allen, Betty, David,Ellen”) (“David, Ellen”,”Allen, Betty, Chris, Ellen”) (“David, Ellen”,”Betty, Chris, David”)
  • 170. http://dataminingtrend.com http://facebook.com/datacube.th MapReduce examples: shared friendship • Example (Reduce) 170 (“Allen, Betty”, “Chris, David”) (“Allen, Chris”, “Betty, David”) (“Allen, David”, “Betty, Chris”) (“Betty, Chris”, “Allen, David, Ellen”) (“Betty, David”, “Allen, Chris, Ellen”) (“Betty, Ellen”, “Chris, David”) (“Chris, David”, “Allen, Betty, Ellen”) (“Chris, Ellen”, “Betty, David”) (“David, Ellen”, “Betty, Chris”)
  • 171. http://dataminingtrend.com http://facebook.com/datacube.th Hadoop Streaming • Hadoop streaming is a utility that comes packaged with the Hadoop distribution and allows MapReduce jobs to be created with any executable as the mapper and/or the reducer. • The Hadoop streaming utility enables Python, shell scripts, or any other language to be used as a mapper, reducer, or both. • The mapper and reducer are both executables that • read input, line by line, from the standard input (stdin), • and write output to the standard output (stdout). • The Hadoop streaming utility creates a MapReduce job, submits the job to the cluster, and monitors its progress until it is complete. 171
  • 172. http://dataminingtrend.com http://facebook.com/datacube.th Hadoop Streaming • When the mapper is initialized, each map task launches the specified executable as a separate process. • The mapper reads the input file and presents each line to the executable via stdin. After the executable processes each line of input, the mapper collects the output from stdout and converts each line to a key-value pair. • The key consists of the part of the line before the first tab character, and the value consists of the part of the line after the first tab character. 172
  • 173. http://dataminingtrend.com http://facebook.com/datacube.th Hadoop Streaming • When the reducer is initialized, each reduce task launches the specified executable as a separate process. • The reducer converts the input key-value pair to lines that are presented to the executable via stdin. • The reducer collects the executables result from stdout and converts each line to a key-value pair. • Similar to the mapper, the executable specifies key-value pairs by separating the key and value by a tab character. 173
  • 174. http://dataminingtrend.com http://facebook.com/datacube.th Hadoop Streaming • Data flow in Hadoop Streaming via Python mapper.py and reducer.py scripts 174 Image source: "Data Analytics with Hadoop: An Introduction for Data Scientists", Benjamin Bengfort and Jenny Kim, 2016
  • 175. http://dataminingtrend.com http://facebook.com/datacube.th Hadoop Streaming example • The WordCount application can be implemented as two Python programs: mapper.py and reducer.py. • mapper.py is the Python program that implements the logic in the map phase of WordCount. • It reads data from stdin, splits the lines into words, and outputs each word with its intermediate count to stdout. 175
  • 176. http://dataminingtrend.com http://facebook.com/datacube.th Hadoop Streaming example • mapper.py 176 #!/usr/bin/env python import sys # input comes from STDIN (standard input) for line in sys.stdin: # remove leading and trailing whitespace line = line.strip() # split the line into words words = line.split() # increase counters for word in words: # write the results to STDOUT (standard output); # what we output here will be the input for the # Reduce step, i.e. the input for reducer.py # # tab-delimited; the trivial word count is 1 print '%st%s' % (word, 1)
  • 177. http://dataminingtrend.com http://facebook.com/datacube.th Hadoop Streaming example • reducer.py is the Python program that implements the logic in the reduce phase of WordCount. • It reads the results of mapper.py from stdin, sums the occurrences of each word, and writes the result to stdout. • reducer.py 177 #!/usr/bin/env python from operator import itemgetter import sys current_word = None current_count = 0 word = None
  • 178. http://dataminingtrend.com http://facebook.com/datacube.th Hadoop Streaming example • reducer.py (cont’) 178 # input comes from STDIN for line in sys.stdin: # remove leading and trailing whitespace line = line.strip() # parse the input we got from mapper.py word, count = line.split('t', 1) # convert count (currently a string) to int try: count = int(count) except ValueError: # count was not a number, so silently # ignore/discard this line continue
  • 179. http://dataminingtrend.com http://facebook.com/datacube.th Hadoop Streaming example • reducer.py (cont’) 179 # this IF-switch only works because Hadoop sorts map output # by key (here: word) before it is passed to the reducer if current_word == word: current_count += count else: if current_word: # write result to STDOUT print '%st%s' % (current_word, current_count) current_count = count current_word = word # do not forget to output the last word if needed! if current_word == word: print '%st%s' % (current_word, current_count)
  • 180. http://dataminingtrend.com http://facebook.com/datacube.th Hadoop Streaming example • Before attempting to execute the code, ensure that the mapper.py and reducer.py files have execution permission. • The following command will enable this for both files: • Also ensure that the first line of each file contains the proper path to Python. This line enables mapper.py and reducer.py to execute as standalone executables. • It is highly recommended to test all programs locally before running them across a Hadoop cluster. 180 $ chmod +x mapper.py reducer.py $ echo ‘The fast cat wears no hat’ | ./mapper.py | sort -t 1| ./reducer.py
  • 181. http://dataminingtrend.com http://facebook.com/datacube.th Hadoop Streaming example • Download 3 ebooks from Project Gutenberg • The Outline of Science, Vol. 1 (of 4) by J. Arthur Thomson (659 KB) • The Notebooks of Leonardo Da Vinci (1.4 MB) • Ulysses by James Joyce (1.5 MB) • Before we run the actual MapReduce job, we must first copy the files from our local file system to Hadoop’s HDFS. 181 
 $ hadoop fs -put pg20417.txt books/pg20417.txt $ hadoop fs -put 5000-8.txt books/5000-8.txt $ hadoop fs -put 4300-0.txt books/4300-0.txt $ hadoop fs -ls books
  • 182. http://dataminingtrend.com http://facebook.com/datacube.th Hadoop Streaming example • The mapper and reducer programs can be run as a MapReduce application using the Hadoop streaming utility. • The command to run the Python programs mapper.py and reducer.py on a Hadoop cluster is as follows: 182 
 $ hadoop jar /usr/lib/hadoop-0.20-mapreduce/contrib/streaming/ 
 hadoop-streaming-2.0.0-mr1-cdh*.jar -files mapper.py, reducer.py -mapper mapper.py -reducer reducer.py -input /user/hduser/books/* -output /user/hduser/books/output
  • 183. http://dataminingtrend.com http://facebook.com/datacube.th Hadoop Streaming example • Options for Hadoop streaming 183 Option Description -files A command-separated list of files to be copied to the MapReduce cluster -mapper The command to be run as the mapper -reducer The command to be run as the reducer -input The DFS input path for the Map step -output The DFS output directory for the Reduce step
  • 184. http://dataminingtrend.com http://facebook.com/datacube.th Python MapReduce library: mrjob • mrjob is a Python MapReduce library, created by Yelp, that wraps Hadoop streaming, allowing MapReduce applications to be written in a more Pythonic manner. • mrjob enables multistep MapReduce jobs to be written in pure Python. • MapReduce jobs written with mrjob can be tested locally, run on a Hadoop cluster, or run in the cloud using Amazon Elastic MapReduce (EMR). 184
  • 185. http://dataminingtrend.com http://facebook.com/datacube.th Python MapReduce library: mrjob • Installation • First, install python pip on CDH VM • The installation of mrjob is simple; it can be installed with pip by using the following command: 185 $ yum -y install python-pip $ pip install mrjob
  • 186. http://dataminingtrend.com http://facebook.com/datacube.th mrjob example • word_count.py • To run the job locally and count the frequency of words within a file named pg20417.txt, use the following command: 186 from mrjob.job import MRJob class MRWordCount(MRJob): def mapper(self, _, line): for word in line.split(): yield(word, 1) def reducer(self, word, counts): yield(word, sum(counts)) if __name__ == '__main__': MRWordCount.run() $ python word_count.py books/pg20419.txt
  • 187. http://dataminingtrend.com http://facebook.com/datacube.th mrjob example • The MapReduce job is defined as the class, MRWordCount. Within the mrjob library, the class that inherits from MRJob contains the methods that define the steps of the MapReduce job. • The steps within an mrjob application are mapper, combiner, and reducer. The class inheriting MRJob only needs to define one of these steps. • The mapper() method defines the mapper for the MapReduce job. It takes key and value as arguments and yields tuples of (output_key, output_value). • In the WordCount example, the mapper ignored the input key and split the input value to produce words and counts. 187
  • 188. http://dataminingtrend.com http://facebook.com/datacube.th mrjob example • The combiner is a process that runs after the mapper and before the reducer. • It receives, as input, all of the data emitted by the mapper, and the output of the combiner is sent to the reducer. The combiner yields tuples of (output_key, output_value) as output. • The reducer() method defines the reducer for the MapReduce job. • It takes a key and an iterator of values as arguments and yields tuples of (output_key, output_value). • In example, the reducer sums the value for each key, which represents the frequency of words in the input. 188
  • 189. http://dataminingtrend.com http://facebook.com/datacube.th mrjob example • The final component of a MapReduce job written with the mrjob library is the two lines at the end of the file: if __name__ == '__main__': MRWordCount.run() • These lines enable the execution of mrjob; without them, the application will not work. • Executing a MapReduce application with mrjob is similar to executing any other Python program. The command line must contain the name of the mrjob application and the input file: 189 $ python mr_job.py input.txt
  • 190. http://dataminingtrend.com http://facebook.com/datacube.th mrjob example • By default, mrjob runs locally, allowing code to be developed and debugged before being submitted to a Hadoop cluster. • To change how the job is run, specify the -r/--runner option. 190 $ python word_count.py -r hadoop hdfs:books/pg20419.txt
  • 191. http://dataminingtrend.com http://facebook.com/datacube.th Outline • Part 1: Introduction to Big Data • Part 2: Introduction to NoSQL • Part 3: Introduction to MapReduce and Hadoop • Part 4: Introduction to Hive, HBase and Sqoop 191
  • 192. http://dataminingtrend.com http://facebook.com/datacube.th Introduction • The Hadoop ecosystem emerged as a cost effective way of working with large datasets • It imposes a particular programming model, called MapReduce, for breaking up computation tasks into units that can be distributed around a cluster of commodity • Underneath this computation model is a distributed file system called Hadoop Distributed Filesystem (HDFS) • However, a challenge remains; how do you move an existing data infrastructure to Hadoop, when that infrastructure is based on traditional relational databases and the Structured Query Language (SQL)? 192
  • 193. http://dataminingtrend.com http://facebook.com/datacube.th Introduction • This is where Hive comes in. Hive provides an SQL dialect, called Hive Query Language (abbreviated HiveQL or just HQL) for querying data stored in a Hadoop cluster. • SQL knowledge is widespread for a reason; it’s an effective, reasonably intuitive model for organizing and using data. • Mapping these familiar data operations to the low-level MapReduce Java API can be daunting, even for experienced Java developers. • Hive does this dirty work for you, so you can focus on the query itself. Hive translates most queries to MapReduce jobs, thereby exploiting the scalability of Hadoop, while presenting a familiar SQL abstraction. 193
  • 194. http://dataminingtrend.com http://facebook.com/datacube.th Introduction • Hive is most suited for data warehouse applications, where relatively static data is analyzed, fast response times are not required, and when the data is not changing rapidly. • Apache Hive is a “data warehousing” framework built on top of Hadoop. • Hive provides data analysts with a familiar SQL-based interface to Hadoop, which allows them to attach structured schemas to data in HDFS and access and analyze that data using SQL queries. • Hive has made it possible for developers who are fluent in SQL to leverage the scalability and resilience of Hadoop without requiring them to learn Java or the native MapReduce API. 194
  • 195. http://dataminingtrend.com http://facebook.com/datacube.th Hive in the Hadoop Ecosystem • Hive modules 195 Image source: “Programming Hive: Data Warehouse and Query Language for Hadoop”, Edward Capriolo, Dean Wampler and Jason Rutherglen, 2012
  • 196. http://dataminingtrend.com http://facebook.com/datacube.th Hive in the Hadoop Ecosystem • There are several ways to interact with Hive • CLI: command-line interface • GUI: Graphic User Interface • Karmasphere (http://karmasphere.com) • Cloudera’s open source Hue (https://github.com/cloudera/hue) • All commands and queries go to the Driver, which compiles the input, optimizes the computation required, and executes the required steps, usually with MapReduce jobs. 196
  • 197. http://dataminingtrend.com http://facebook.com/datacube.th Hive in the Hadoop Ecosystem • Hive communicates with the JobTracker to initiate the MapReduce job. • Hive does not have to be running on the same master node with the JobTracker. In larger clusters, it’s common to have edge nodes where tools like Hive run. • They communicate remotely with the JobTracker on the master node to execute jobs. Usually, the data files to be processed are in HDFS, which is managed by the NameNode. • The Metastore is a separate relational database (usually a MySQL instance) where Hive persists table schemas and other system metadata. 197
  • 198. http://dataminingtrend.com http://facebook.com/datacube.th Structured Data Queries with Hive • Hive provides its own dialect of SQL called the Hive Query Language, or HQL. • HQL supports many commonly used SQL statements, including data definition statements (DDLs) (e.g., CREATE DATABASE/ SCHEMA/ TABLE), data manipulation statements (DMSs) (e.g., INSERT, UPDATE, LOAD), and data retrieval queries (e.g., SELECT). • Hive commands and HQL queries are compiled into an execution plan or a series of HDFS operations and/ or MapReduce jobs, which are then executed on a Hadoop cluster. 198
  • 199. http://dataminingtrend.com http://facebook.com/datacube.th Structured Data Queries with Hive • Additionally, Hive queries entail higher-latency due to the overhead required to generate and launch the compiled MapReduce jobs on the cluster; even small queries that would complete within a few seconds on a traditional RDBMS may take several minutes to finish in Hive. • On the plus side, Hive provides the high-scalability and high- throughput that you would expect from any Hadoop-based application. • It is very well suited to batch-level workloads for online analytical processing (OLAP) of very large datasets at the terabyte and petabyte scale. 199
  • 200. http://dataminingtrend.com http://facebook.com/datacube.th The Hive Command-Line Interface (CLI) • Hive’s installation comes packaged with a handy command-line interface (CLI), which we will use to interact with Hive and run our HQL statements. • This will initiate the CLI and bootstrap the logger (if configured) and Hive history file, and finally display a Hive CLI prompt: • You can view the full list of Hive options for the CLI by using the -H flag: 200 $ hive hive> $ hive -H
  • 201. http://dataminingtrend.com http://facebook.com/datacube.th HUE: Apache Hadoop UI • HUE (Hadoop User Experience) is a Web interface for analyzing data with Apache Hadoop. • Go to quick start.cloudera:8888/about • username: cloudera • password: cloudera 201
  • 203. http://dataminingtrend.com http://facebook.com/datacube.th Example: web logs database • Choose default database • HQL: SELECT * FROM web_logs 203
  • 204. http://dataminingtrend.com http://facebook.com/datacube.th Example: web logs database • HQL: SELECT web_logs.country_name, count(1) AS count
 FROM web_logs 
 GROUP BY country_name 204
  • 205. http://dataminingtrend.com http://facebook.com/datacube.th Creating a database • Creating a database in Hive is very similar to creating a database in a SQL-based RDBMS, by using the CREATE DATABASE or CREATE SCHEMA statement: • When Hive creates a new database, the schema definition data is stored in the Hive metastore. • Hive will raise an error if the database already exists in the metastore; we can check for the existence of the database by using IF NOT EXISTS: • HQL: CREATE DATABASE IF NOT EXISTS flight_data; 205
  • 206. http://dataminingtrend.com http://facebook.com/datacube.th Creating a database • We can then run SHOW DATABASES to verify that our database has been created. Hive will return all databases found in the metastore, along with the default Hive database: • HQL: SHOW DATABASES; 206
  • 207. http://dataminingtrend.com http://facebook.com/datacube.th Creating tables • Hive provides a SQL-like CREATE TABLE statement, which in its simplest form takes a table name and column definitions: • HQL: CREATE TABLE airlines (code INT, 
 description STRING) 
 ROW FORMAT DELIMITED FIELDS TERMINATED BY 't' 
 STORED AS TEXTFILE; • However, because Hive data is stored in the file system, usually in HDFS or the local file system • the CREATE TABLE command also takes optional clauses to specify the row format with the ROW FORMAT clause that tells Hive how to read each row in the file and map to our columns. 207
  • 208. http://dataminingtrend.com http://facebook.com/datacube.th Loading data • It’s important to note one important distinction between Hive and traditional RDBMSs with regards to schema enforcement: • Traditional relational databases enforce the schema on writes by rejecting any data that does not conform to the schema as defined; • Hive can only enforce queries on schema reads. If in reading the data file, the file structure does not match the defined schema, Hive will generally return null values for missing fields or type mismatches 208
  • 209. http://dataminingtrend.com http://facebook.com/datacube.th Loading data • Data loading in Hive is done in batch-oriented fashion using a bulk LOAD DATA command or by inserting results from another query with the INSERT command. • LOAD DATA is Hive’s bulk loading command. INPATH takes an argument to a path on the default file system (in this case, HDFS). • We can also specify a path on the local file system by using LOCAL INPATH instead. Hive proceeds to move the file into the warehouse location. • If the OVERWRITE keyword is used, then any existing data in the target table will be deleted and replaced by the data file input; otherwise, the new data is added to the table. 209
  • 210. http://dataminingtrend.com http://facebook.com/datacube.th Loading data • Examples • HQL: LOAD DATA LOCAL INPATH '/home/cloudera/
 Downloads/flight_data/ontime_flights.tsv' 
 OVERWRITE INTO TABLE flights; • HQL: LOAD DATA LOCAL INPATH '/home/cloudera/
 Downloads/flight_data/airlines.tsv' 
 OVERWRITE INTO TABLE airlines; • HQL: LOAD DATA LOCAL INPATH '/home/cloudera/
 Downloads/flight_data/carriers.tsv' 
 OVERWRITE INTO TABLE carriers; • HQL: LOAD DATA LOCAL INPATH '/home/cloudera/
 Downloads/flight_data/cancellation_reasons.tsv' 
 OVERWRITE INTO TABLE cancellation_reasons; 210
  • 211. http://dataminingtrend.com http://facebook.com/datacube.th Data Analysis with Hive • Grouping • HQL: SELECT airline_code, COUNT(1) AS num_flights 
 FROM flights 
 GROUP BY airline_code 
 ORDER BY num_flights DESC; 211
  • 212. http://dataminingtrend.com http://facebook.com/datacube.th Data Analysis with Hive • Aggregations • HQL: 
 SELECT airline_code, COUNT(1) AS num_flights, SUM(IF(depart_delay > 0, 1, 0)) AS num_depart_delays, SUM(IF(arrive_delay > 0, 1, 0)) AS num_arrive_delays, SUM(IF(is_cancelled, 1, 0)) AS num_cancelled, FROM flights GROUP BY airline_code; 212
  • 213. http://dataminingtrend.com http://facebook.com/datacube.th Data Analysis with Hive • Aggregations • HQL: 
 SELECT airline_code, COUNT(1) AS num_flights, SUM(IF(depart_delay > 0, 1, 0)) AS num_depart_delays, ROUND(SUM(IF(depart_delay > 0, 1, 0))/COUNT(1), 2) 
 AS depart_delay_rate, SUM(IF(arrive_delay > 0, 1, 0)) AS num_arrive_delays, ROUND(SUM(IF(arrive_delay > 0, 1, 0))/COUNT(1), 2) 
 AS arrive_delay_rate, SUM(IF(is_cancelled, 1, 0)) AS num_cancelled, ROUND(SUM(IF(is_cancelled, 1, 0))/COUNT(1), 2) 
 AS cancellation_rate FROM flights GROUP BY airline_code ORDER by cancellation_rate DESC, arrive_delay_rate DESC, 
 depart_delay_rate DESC; 213
  • 214. http://dataminingtrend.com http://facebook.com/datacube.th Introduction to HBase • While Hive provides a familiar data manipulation paradigm within Hadoop, it doesn’t change the storage and processing paradigm, which still utilizes HDFS and MapReduce in a batch-oriented fashion. • Thus, for use cases that require random, real-time read/ write access to data, we need to look outside of standard MapReduce and Hive for our data persistence and processing layer. • The real-time applications need to record high volumes of time-based events that tend to have many possible structural variations. • The data may be keyed on a certain value, like User, but the value is often represented as a collection of arbitrary metadata. 214
  • 215. http://dataminingtrend.com http://facebook.com/datacube.th Introduction to HBase • For example, two events, “Like” and “Share”, which require different column values, as shown in table. • In a relational model, rows are sparse but columns are not. That is, upon inserting a new row to a table, the database allocates storage for every column regardless of whether a value exists for that field or not. • However, in applications where data is represented as a collection of arbitrary fields or sparse columns, each row may use only a subset of available columns, which can make a standard relational schema both a wasteful and awkward fit. 215
  • 216. http://dataminingtrend.com http://facebook.com/datacube.th Column-Oriented Databases • NoSQL is a broad term that generally refers to non-relational databases and encompasses a wide collection of data storage models, including • graph databases • document databases • key/ value data stores • column-family databases. • HBase is classified as a column-family or column-oriented database, modelled on Google’s Big Table architecture. 216
  • 217. http://dataminingtrend.com http://facebook.com/datacube.th Column-Oriented Databases • HBase organizes data into tables that contain rows. Within a table, rows are identified by their unique row key, which do not have a data type. • Row key are similar to the concept of primary keys in relational databases, in that they are automatically indexed. 217
  • 218. http://dataminingtrend.com http://facebook.com/datacube.th Column-Oriented Databases • In HBase, table rows are sorted by their row key and because row keys are byte arrays, almost anything can serve as a row key from strings to binary representations of longs or even serialized data structures. • HBase stores its data key/value pairs, where all table lookups are performed via the table’s row key, or unique identifier to the stored record data. • Data within a row is grouped into column families, which consist of related columns. 218
  • 220. http://dataminingtrend.com http://facebook.com/datacube.th Column-Oriented Databases • Storing data in columns rather than rows has particular benefits for data warehouses and analytical databases where aggregates are computed over large sets of data with potentially sparse values, where not all columns values are present. • Another interesting feature of HBase and BigTable-based column- oriented databases is that the table cells, or the intersection of row and column coordinates, are versioned by timestamp. • HBase is thus also described as being a multidimensional map where time provides the third dimension 220
  • 221. http://dataminingtrend.com http://facebook.com/datacube.th Column-Oriented Databases • The time dimension is indexed in decreasing order, so that when reading from an HBase store, the most recent values are found first. • The contents of a cell can be 
 referenced by a 
 {rowkey, column, timestamp} 
 tuple, or we can scan for a 
 range of cell values by time 
 range. 221
  • 222. http://dataminingtrend.com http://facebook.com/datacube.th Real-Time Analytics with HBase • For the purposes of this HBase overview, we define and work with the HBase shell to design a schema for a linkshare tracker that tracks the number of times a link has been shared. • Generating a schema • When designing schemas in HBase, it’s important to think in terms of the column-family structure of the data model and how it affects data access patterns. • Furthermore, because HBase doesn’t support joins and provides only a single indexed rowkey, we must be careful to ensure that the schema can fully support all use cases. 222
  • 223. http://dataminingtrend.com http://facebook.com/datacube.th Real-Time Analytics with HBase • First, we need to declare the table name, and at least one column-family name at the time of table definition. • If no namespace is declared, HBase will use the default namespace • We just created a single table called linkshare in the default namespace with one column-family, named link • To alter the table after creation, such as changing or adding column families, we need to first disable the table so that clients will not be able to access the table during the alter operation: 223 hbase> create ‘linkshare’,’link’
  • 224. http://dataminingtrend.com http://facebook.com/datacube.th Real-Time Analytics with HBase • Good row key design affects not only how we query the table, but the performance and complexity of data access. • By default, HBase stores rows in sorted order by row key, so that similar keys are stored to the same RegionServer. • Thus, in addition to enabling our data access use cases, we also need to be mindful to account for row key distribution across regions. • For the current example, let’s assume that we will use the unique reversed link URL for the row key. 224 hbase> disable ‘linkshare’ hbase> alter ‘linkshare’, ‘statistics’ hbase> enable ‘linkshare’
  • 225. http://dataminingtrend.com http://facebook.com/datacube.th Real-Time Analytics with HBase • In our linkshare application, we want to store descriptive data about the link, such as its title, while maintaining a frequency counter that tracks the number of times the link has been shared. • We can insert, or put, a value in a cell at the specified table/ row/ column and optionally timestamp coordinates. • To put a cell value into table linkshare at row with row key org.hbase.www under column-family link and column title marked with the current timestamp 225 hbase> put 'linkshare', 'org.hbase.www', 'link:title', 'Apache HBase' hbase> put 'linkshare', 'org.hadoop.www', 'link:title', 'Apache Hadoop' hbase> put 'linkshare', 'com.oreilly.www', 'link:title', ‘O’Reilly.com’
  • 226. http://dataminingtrend.com http://facebook.com/datacube.th Real-Time Analytics with HBase • The put operation works great for inserting a value for a single cell, but for incrementing frequency counters, HBase provides a special mechanism to treat columns as counters. • To increment a counter, we use the command incr instead of put. • The last option passed is the increment value, which in this case is 1. • Incrementing a counter will return the updated counter value, but you can also access a counter’s current value any time using the get_counter command, specifying the table name, row key, and column: 226 hbase> incr ‘linkshare’, ‘org.hbase.www’, ‘statistics:share’, 1 hbase> incr ‘linkshare’, ‘org.hbase.www’, ‘statistics:like’, 1
  • 227. http://dataminingtrend.com http://facebook.com/datacube.th Real-Time Analytics with HBase • HBase provides two general methods to retrieve data from a table: • the get command performs lookups by row key to retrieve attributes for a specific row, • and the scan command, which takes a set of filter specifications and iterates over multiple rows based on the indicated specifications. • In its simplest form, the get command accepts the table name followed by the row key, and returns the most recent version timestamp and cell value for columns in the row. 227 hbase> incr ‘linkshare’, ‘org.hbase.www’, ‘statistics:share’, 1 hbase> get_counter ‘linkshare’, ‘org.hbase.www’, 
 ‘statistics:share’ hbase> get ‘linkshare’, ‘org.hbase.www’
  • 228. http://dataminingtrend.com http://facebook.com/datacube.th Real-Time Analytics with HBase • The get command also accepts an optional dictionary of parameters to specify the column( s), timestamp, timerange, and version of the cell values we want to retrieve. For example, we can specify the column( s) of interest • A scan operation is akin to database cursors or iterators, and takes advantage of the underlying sequentially sorted storage mechanism, iterating through row data to match against the scanner specifications. • With scan, we can scan an entire HBase table or specify a range of rows to scan. 228 hbase> get ‘linkshare’, ‘org.hbase.www’, ‘link:title’ hbase> get ‘linkshare’, ‘org.hbase.www’, ‘link:title’, 
 ‘statistics:share’
  • 229. http://dataminingtrend.com http://facebook.com/datacube.th Real-Time Analytics with HBase • You can specify an optional STARTROW and/ or STOPROW parameter, which can be used to limit the scan to a specific range of rows. • If neither STARTROW nor STOPROW are provided, the scan operation will scan through the entire table. • You can, in fact, call scan with the table name to display all the contents of a table. 229 hbase> scan ‘linkshare’ hbase> scan 'linkshare', {COLUMNS = > [' link:title'], 
 STARTROW = > 'org.hbase.www'}
  • 230. http://dataminingtrend.com http://facebook.com/datacube.th Introduction to Sqoop • However, in cases where the input data is already structured because it resides in a relational database, it would be convenient to leverage this known schema to import the data into Hadoop in a more efficient manner than uploading CSVs to HDFS and parsing them manually. • Sqoop (SQL-to-Hadoop) is designed to transfer data between relational database management systems (RDBMS) and Hadoop. • It automates most of the data transfer process by reading the schema information directly from the RDBMS. • Sqoop then uses MapReduce to import and export the data to and from Hadoop. 230
  • 231. http://dataminingtrend.com http://facebook.com/datacube.th Introduction to Sqoop • Sqoop gives us the flexibility to maintain our data in its production state while copying it into Hadoop to make it available for further analysis without modifying the production database. • We’ll walk through a few ways to use Sqoop to import data from a MySQL database into various Hadoop data stores, including HDFS, Hive, and HBase. • We will use MySQL as the source and target RDBMS for the examples in this chapter, so we also assume that a MySQL database resides on the same host as your Hadoop/ Sqoop services and is accessible via localhost and the default port, 3306. 231
  • 232. http://dataminingtrend.com http://facebook.com/datacube.th Importing from MySQL to HDFS • When importing data from relational databases like MySQL, Sqoop reads the source database to gather the necessary metadata for the data being imported. • Sqoop then submits a map-only Hadoop job to transfer the actual table data based on the metadata that was captured in the previous step. • This job produces a set of serialized files, which may be delimited text files, binary format, or SequenceFiles containing a copy of the imported table or datasets. • By default, the files are saved as comma-separated files to a directory on HDFS with a name that corresponds to the source table name. 232