This document provides an introduction and overview of big data technologies. It begins with an outline that covers introductions to big data, NoSQL databases, MapReduce and Hadoop, and Hive, HBase and Sqoop. It then discusses relational databases and SQL before introducing NoSQL databases. Key reasons for using NoSQL databases are explained, including improved scalability, lower costs, flexibility in data structures, and high availability. Examples of big data applications and the internet of things are also presented.
26. http://dataminingtrend.com http://facebook.com/datacube.th
What is Big Data?
• Big Data ประกอบด้วย 3 V
• Volume
• ข้อมูลมีจำนวนเพิ่มขึ้นอย่างมหาศาล
• Velocity
• ข้อมูลเพิ่มขึ้นอย่างรวดเร็ว
• Variety
• ข้อมูลมีความหลากหลายมากขึ้น
26
source: https://upxacademy.com/beginners-guide-to-big-data/
30. http://dataminingtrend.com http://facebook.com/datacube.th
What is Big Data?
• Huge volume of data
• ข้อมูลมีขนาดใหญ่มากๆ เช่น มีจำนวนเป็นพันล้านแถว (billion row) หรือ
เป็นล้านคอลัมน์ (million columns)
• Speed of new data creation and growth
• ข้อมูลเกิดขึ้นอย่างรวดเร็วมากๆ
30
32. http://dataminingtrend.com http://facebook.com/datacube.th
What is Big Data?
• Huge volume of data
• ข้อมูลมีขนาดใหญ่มากๆ เช่น มีจำนวนเป็นพันล้านแถว (billion row) หรือ
เป็นล้านคอลัมน์ (million columns)
• Speed of new data creation and growth
• ข้อมูลเกิดขึ้นอย่างรวดเร็วมากๆ
• Complexity of data types and structures
• ข้อมูลมีความหลากหลาย ไม่ได้อยู่ในรูปแบบของตารางเท่านั้น อาจจะเป็น
รูปแบบของข้อความ (text) รูปภาพ (images) หรือ วิดีโอ (video clip)
32
46. http://dataminingtrend.com http://facebook.com/datacube.th
Relational database & SQL
• Databases are made up of tables and each table is made up of
rows and columns
• SQL is a database interaction language that allows you to add,
retrieve, edit and delete information stored in databases
46
ID Mark Code Title
S103 72 DBS Database Systems
S103 58 IAI Intro to AI
S104 68 PR1 Programming 1
S104 65 IAI Intro to AI
S106 43 PR2 Programming 2
S107 76 PR1 Programming 1
S107 60 PR2 Programming 2
S107 35 IAI Intro to AI
47. http://dataminingtrend.com http://facebook.com/datacube.th
Relational database & SQL
• SQL primarily works with two types of operations to query data
• Read consists of the SELECT command, which has three
common clauses
• SELECT
• FROM
• WHERE
47image source:https://justbablog.files.wordpress.com/2017/03/sql_beginners.jpg
49. http://dataminingtrend.com http://facebook.com/datacube.th
Why NoSQL?
• Relational databases have been the dominate type of database used
for application for decades.
• With the advent of the Web, however, the limitations of relational
databases became increasingly problematic.
• Companies such as Google, LinkedIn, Yahoo! and Amazon found that
supporting large numbers of users on the Web was different from
supporting much smaller numbers of business users.
49
51. http://dataminingtrend.com http://facebook.com/datacube.th
Why NoSQL?
• Web application needed to support
• Large volumes of read and write operations
• Low latency response times
• High availability
• These requirement were difficult to realise using relational databases.
• There are limits to how many CPUs and memory can be supported in a
single server.
• Another option is to use multiple servers with a relational database.
• operating single RDBMS over multiple servers is a complex operation
51
53. http://dataminingtrend.com http://facebook.com/datacube.th
Why NoSQL?: Scalability
• Scalability is the ability to efficiently meet the needs for varying
workloads.
• For example, if there is a spike in traffic to a website, additional
servers can be brought online to handle the additional load.
• When the spike subsides and traffic returns to normal, some of
those additional servers can be shut down.
• Adding servers as needed is called scaling out.
53
55. http://dataminingtrend.com http://facebook.com/datacube.th
Why NoSQL?: Scalability
• Scaling out is more flexible than scaling up.
• Servers can be added or removed as needed when scaling up.
• NoSQL are designed to utilise servers available in a cluster with
minimal intervention by database administrators.
55
56. http://dataminingtrend.com http://facebook.com/datacube.th
Why NoSQL?: Cost
• Commercial software vendors employ a variety of licensing
models that include charging by
• the size of the server running the RDBMS
• the number of concurrent users on the database
• the number of named users allowed to use the software
• The major NoSQL databases are available as open source. It’s free to
use on as many servers of whatever size needed
56
58. http://dataminingtrend.com http://facebook.com/datacube.th
Why NoSQL?: Flexibility
• Database designers expect to know at the start of a project all
the tables and columns that will be needed to support an
application.
• It is also commonly assumed that most of the columns in a table
will be needed by most of the rows.
• Unlike relational databases, some NoSQL databases do not
require a fixed table structure.
• For example, in a document database, a program could
dynamically add new attributes as needed without having to have a
database designer alter the database design.
58
59. http://dataminingtrend.com http://facebook.com/datacube.th
Why NoSQL?: Availability
• Many of us have come to expect websites and web applications
to be available whenever we want to use them.
• NoSQL databases are designed to take advantage of multiple,
low-cost servers.
• When one server fails or is taken out of service for maintenance,
the other servers in the cluster can take on the entire workload.
59
61. http://dataminingtrend.com http://facebook.com/datacube.th
Key-Value databases
• Key-value databases are the simplest form of NoSQL
databases.
• These databases are modelled on two components:
keys and values
• Data is stored in a key-value pairs, where attribute is the Key
and content is the Value
• Data can only be queries and retrieved using the key only.
61
62. http://dataminingtrend.com http://facebook.com/datacube.th
Key-Value databases
• use cases
• caching data from
relational databases to
improve performance
• storing data from
sensors (IoT)
• software
• redis
• Amazon DynamoDB
62
3876941. accountNumber
Jane Washington1. Name
31.numItems
Loyalty Member1.custType
Keys Values
63. http://dataminingtrend.com http://facebook.com/datacube.th
Key-Value databases
• Redis example (http://try.redis.io)
• Set or update value against a key:
• SET university "DPU" // set string
• GET university // get string
• HSET student firstName "Manee" // Hash – set field
value
• HGET student firstName // Hash – get field value
• LPUSH "alice:sales" "10" "20" // List create/append
• LSET "alice:sales" "0" "4" // List update
• LRANGE "alice:sales" 0 1 // view list
63
64. http://dataminingtrend.com http://facebook.com/datacube.th
Key-Value databases
• Set or update value against a key:
• SET quantities 1
• INCR quantities
• SADD "alice:friends" "f1" "f2" //Set – create/
update
• SADD "bob:friends" "f2" "f1" //Set – create/update
• Set operations:
• intersection
• SINTER "alice:friends" "bob:friends"
• union
• SUNION "alice:friends" “bob:friends"
64
66. http://dataminingtrend.com http://facebook.com/datacube.th
Document Databases
• A document store allows the inserting, retrieving, and
manipulating of semi-structured data.
• Compared to RDBMS, the documents themselves act as
records (or rows), however, it is semi-structured as compared to
rigid RDBMS.
• It can store the data that have different set of data fields
(columns)
• Most of the databases available under this category use XML,
JSON
66
69. http://dataminingtrend.com http://facebook.com/datacube.th
Document Databases
• MongoDB examples
• Download MongoDB from https://www.mongodb.com/download-
center?jmp=nav#community
• MongoDB’s default data directory path is the absolute path datadb
on the drive from which you start MongoDB
• You can specify an alternate path for data files using the --dbpath
option to mongod.exe
• Import example data
69
"C:Program FilesMongoDBServer3.4binmongod.exe"
--dbpath d:testmongodbdata
mongoimport --db test --collection restaurants --drop --file
downloads/primer-dataset.json
74. http://dataminingtrend.com http://facebook.com/datacube.th
Column-oriented databases
• Store data as columns as opposed to rows that is prominent in
RDBMS
• A relational database shows the data as two-dimensional tables
comprising of rows and columns but stores, retrieves, and
processes it one row at a time
• A column-oriented database stores each column continuously.
i.e. on disk or in-memory each column on the left will be stored
in sequential blocks.
74
76. http://dataminingtrend.com http://facebook.com/datacube.th
Column-oriented databases
• Advantages of column-based tables:
• Faster Data Access:
• Only affected columns have to be read during the selection
process of a query. Any of the columns can serve as an index.
• Better Compression:
• Columnar data storage allows highly efficient compression
because the majority of the columns contain only few distinct
values (compared to number of rows).
76
77. http://dataminingtrend.com http://facebook.com/datacube.th
Column-oriented databases
• Advantages of column-based tables:
• Better parallel Processing:
• In a column store, data is already vertically partitioned. This
means that operations on different columns can easily be
processed in parallel.
• If multiple columns need to be searched or aggregated, each of
these operations can be assigned to a different processor core.
77
78. http://dataminingtrend.com http://facebook.com/datacube.th
Column-oriented databases
• In case of analytic applications, where aggregations are used and
faster search & processing are required, row-based storage are not
good.
• In row based tables all data stored in a row has to be read even
though the requirement may be there to access data from a few
columns.
• Hence, these queries on huge amounts of data would take lots of
times.
• In columnar tables, this information is stored physically next to each
other, that significantly increases the speed of certain data queries.
78
79. http://dataminingtrend.com http://facebook.com/datacube.th
Column-oriented databases
• Column storage is most useful for OLAP queries (queries using
any SQL aggregate functions). Because, these queries get just
a few attributes from every data entry.
• But for traditional OLTP queries (queries not using any SQL
aggregate functions), it is more advantageous to store all
attributes side-by-side in row tables
79
85. http://dataminingtrend.com http://facebook.com/datacube.th
Graph databases
• Graph databases are the most specialized of the 4 NoSQL databases.
• Instead of modelling data using columns and rows, a graph database uses
structures called nodes and relationships.
• more formal discussions, they are called vertices and edges
• A node is an object that has an identifier and a set of attributes
• A relationship is a link between two nodes that contain attributes about that
relation.
• Graph databases are designed to model adjacency between objects. Every
node in the database contains pointers to adjacent objects in the database.
• This allows for fast operations that require following paths through a graph.
85
89. http://dataminingtrend.com http://facebook.com/datacube.th
Hadoop architecture
• Hadoop is composed of two primary components that
implement the basic concepts of distributed storage and
computation: HDFS and YARN
• HDFS (sometimes shortened to DFS) is the Hadoop Distributed
File System, responsible for managing data stored on disks
across the cluster.
• YARN acts as a cluster resource manager, allocating
computational assets (processing availability and memory on
worker nodes) to applications that wish to perform a distributed
computation.
89
91. http://dataminingtrend.com http://facebook.com/datacube.th
Hadoop architecture
• HDFS and YARN work in concert to minimize the amount of
network traffic in the cluster primarily by ensuring that data is
local to the required computation.
• A set of machines that is running HDFS and YARN is known as a
cluster, and the individual machines are called nodes.
• A cluster can have a single node, or many thousands of nodes,
but all clusters scale horizontally, meaning as you add more
nodes, the cluster increases in both capacity and performance
in a linear fashion.
91
92. http://dataminingtrend.com http://facebook.com/datacube.th
Hadoop architecture
• Each node in the cluster is identified by the type of process that
it runs:
• Master nodes
• These nodes run coordinating services for Hadoop workers and
are usually the entry points for user access to the cluster.
• Worker nodes
• Worker nodes run services that accept tasks from master nodes
either to store or retrieve data or to run a particular application.
• A distributed computation is run by parallelizing the analysis
across worker nodes.
92
93. http://dataminingtrend.com http://facebook.com/datacube.th
Hadoop architecture
• For HDFS, the master and worker services are as follows:
• NameNode (Master)
• Stores the directory tree of the file system, file metadata, and the
location of each file in the cluster.
• Clients wanting to access HDFS must first locate the appropriate
storage nodes by requesting information from the NameNode.
• DataNode (Worker)
• Stores and manages HDFS blocks on the local disk.
• Reports health and status of individual data stores back to the
NameNode
93
95. http://dataminingtrend.com http://facebook.com/datacube.th
Hadoop architecture
• When data is accessed from HDFS
• a client application must first make a request to the NameNode to
locate the data on disk.
• The NameNode will reply with a list of DataNodes that store the
data.
• the client must then directly request each block of data from the
DataNode.
95
96. http://dataminingtrend.com http://facebook.com/datacube.th
Hadoop architecture
• YARN has multiple master services and a worker service as
follows:
• ResourceManager (Master)
• Allocates and monitors available cluster resources (e.g.,
physical assets like memory and processor cores)
• handling scheduling of jobs on the cluster
• ApplicationMaster (Master)
• Coordinates a particular application being run on the cluster as
scheduled by the ResourceManager
96
99. http://dataminingtrend.com http://facebook.com/datacube.th
Hadoop architecture
• Clients that wish to execute a job
• must first request resources from the ResourceManager, which
assigns an application-specific ApplicationMaster for the duration
of the job.
• the ApplicationMaster tracks the execution of the job.
• the ResourceManager tracks the status of the nodes
• each individual NodeManager creates containers and executes
tasks within them
99
100. http://dataminingtrend.com http://facebook.com/datacube.th
Hadoop architecture
• Finally, one other type of cluster is important to note: a single node
cluster.
• In “pseudo-distributed mode” a single machine runs all Hadoop
daemons as though it were part of a cluster, but network traffic occurs
through the local loopback network interface.
• Hadoop developers typically work in a pseudo-distributed environment,
usually inside of a virtual machine to which they connect via SSH.
• Cloudera, Hortonworks, and other popular distributions of Hadoop
provide pre-built virtual machine images that you can download and
get started with right away.
100
101. http://dataminingtrend.com http://facebook.com/datacube.th
Hadoop Distributed File System (HDFS)
• HDFS provides redundant storage for big data by storing that
data across a cluster of cheap, unreliable computers, thus
extending the amount of available storage capacity that a single
machine alone might have.
• HDFS performs best with modest number of very large files
• millions of large files (100 MB or more) rather than billions of smaller
files that might occupy the same volume.
• It is not a good fit as a data backend for applications that require
updates in real-time, interactive data analysis, or record-based
transactional support.
101
102. http://dataminingtrend.com http://facebook.com/datacube.th
Hadoop Distributed File System (HDFS)
• HDFS files are split into blocks, usually of either 64MB or
128MB.
• Blocks allow very large files to be split across and distributed to
many machines at run time.
• Additionally, blocks will be replicated across the DataNodes.
• by default, the replication is three fold
• Therefore, each block exists on three different machines and three
different disks, and if even two node fail, the data will not be lost.
102
103. http://dataminingtrend.com http://facebook.com/datacube.th
Interacting with HDFS
• Interacting with HDFS is primarily performed from the command
line using the script named hdfs. The hdfs script has the
following usage:
• The -option argument is the name of a specific option for the
specified command, and <arg> is one or more arguments that that
are specified for this option.
• For example, show help
103
$ hadoop fs [-option <arg>]
$ hadoop fs -help
104. http://dataminingtrend.com http://facebook.com/datacube.th
Interacting with HDFS
• List directory contents
• use -ls command:
• Running the -ls command on a new cluster will not return any
results. This is because the -ls command, without any
arguments, will attempt to display the contents of the user’s
home directory on HDFS.
• Providing -ls with the forward slash (/) as an argument displays the
contents of the root of HDFS:
104
$ hadoop fs -ls
$ hadoop fs -ls /
105. http://dataminingtrend.com http://facebook.com/datacube.th
Interacting with HDFS
• Creating a directory
• To create the books directory within HDFS, use the -mkdir
command:
• For example, create books directory in home directory
• Use the -ls command to verify that the previous directories were
created:
105
$ hadoop fs -mkdir [directory name]
$ hadoop fs -mkdir books
$ hadoop fs -ls
106. http://dataminingtrend.com http://facebook.com/datacube.th
Interacting with HDFS
• Copy Data onto HDFS
• After a directory has been created for the current user, data can
be uploaded to the user’s HDFS home directory with the -put
command:
• For example, copy book file from local to HDFS
• Use the -ls command to verify that pg20417.txt was moved to
HDFS:
106
$ hadoop fs -put [source file] [destination file]
$ hadoop fs -put pg20417.txt books/pg20417.txt
$ hadoop fs -ls books
107. http://dataminingtrend.com http://facebook.com/datacube.th
Interacting with HDFS
• Retrieve (view) Data from HDFS
• Multiple commands allow data to be retrieved from HDFS.
• To simply view the contents of a file, use the -cat command. -cat
reads a file on HDFS and displays its contents to stdout.
• The following command uses -cat to display the contents of
pg20417.txt
•
107
$ hadoop fs -cat books/pg20417.txt
108. http://dataminingtrend.com http://facebook.com/datacube.th
Interacting with HDFS
• Retrieve (view) Data from HDFS
• Data can also be copied from HDFS to the local filesystem using
the -get command. The -get command is the opposite of the -put
command:
• For example, This command copies pg20417.txt from HDFS to the
local filesystem.
108
$ hadoop fs -get [source file] [destination file]
$ hadoop fs -get pg20417.txt .
109. http://dataminingtrend.com http://facebook.com/datacube.th
MapReduce
• MapReduce is a programming model that enables large volumes of data
to be processed and generated by dividing work into independent tasks
and executing the tasks in parallel across a cluster of machines.
• At a high level, every MapReduce program transforms a list of input data
elements into a list of output data elements twice, once in the map phase
and once in the reduce phase.
• The MapReduce framework is composed of three major phases: map,
shuffle and sort, and reduce.
109
Image source: "Data Analytics with Hadoop: An Introduction for Data Scientists", Benjamin Bengfort and Jenny Kim, 2016
110. http://dataminingtrend.com http://facebook.com/datacube.th
MapReduce
• Map
• The first phase of a MapReduce application is the map phase.
Within the map phase, a function (called the mapper) processes a
series of key-value pairs.
• The mapper sequentially processes each key-value pair
individually, producing zero or more output key-value pairs
• As an example, consider a mapper whose purpose is to transform
sentences into words.
110
111. http://dataminingtrend.com http://facebook.com/datacube.th
MapReduce
• Map
• The input to this mapper would be strings that contain sentences,
and the mapper’s function would be to split the sentences into
words and output the words
111
Image source: "Hadoop with Python", Zachary Radtka and Donald Miner, 2016
112. http://dataminingtrend.com http://facebook.com/datacube.th
MapReduce
• Shuffle and Sort
• As the mappers begin completing, the intermediate outputs from
the map phase are moved to the reducers. This process of moving
output from the mappers to the reducers is known as shuffling.
• Shuffling is handled by a partition function, known as the
partitioner. The partitioner ensures that all of the values for the
same key are sent to the same reducer.
• The intermediate keys and values for each partition are sorted by
the Hadoop framework before being presented to the reducer.
112
113. http://dataminingtrend.com http://facebook.com/datacube.th
MapReduce
• Reduce
• Within the reducer phase, an iterator of values is provided to a
function known as the reducer. The iterator of values is a nonunique
set of values for each unique key from the output of the map phase.
• The reducer aggregates the values for each unique key and
produces zero or more output key-value pairs
• As an example, consider a reducer whose purpose is to sum all of
the values for a key. The input to this reducer is an iterator of all of
the values for a key, and the reducer sums all of the values.
113
117. http://dataminingtrend.com http://facebook.com/datacube.th
MapReduce examples: word count
• The word-counting application takes as input one or more text
files and produces a list of word and their frequencies as output.
117
Image source: "Data Analytics with Hadoop: An Introduction for Data Scientists", Benjamin Bengfort and Jenny Kim, 2016
118. http://dataminingtrend.com http://facebook.com/datacube.th
MapReduce examples: word count
• Because Hadoop utilizes key/value pairs the input key is a file
ID and line number and the input value is a string, while the
output key is a word and the output value is an integer.
• The following Python pseudocode shows how this algorithm is
implemented:
118
# emit is a function that performs hadoop I/O
def map(dockey, line):
for word in value.split():
emit(word, 1)
def reduce(word, values):
count = sum(value for value in values)
emit(word,count)
126. http://dataminingtrend.com http://facebook.com/datacube.th
MapReduce examples: word count
• Example
(Map)
126
(27183, “The fast cat wears no hat.”)
(31416, “The cat in the hat ran fast.”)
(“The”,1) (“The”,1)(“fast”,1) (“cat”,1)(“cat”,1) (“in”,1)
(“wears”,1) (“the”,1)(“no”,1) (“hat”,1)(“hat”,1) (“ran”,1)
input
Mapper 1 Mapper 2
127. http://dataminingtrend.com http://facebook.com/datacube.th
MapReduce examples: word count
• Example
(Map)
127
(27183, “The fast cat wears no hat.”)
(31416, “The cat in the hat ran fast.”)
(“The”,1) (“The”,1)(“fast”,1) (“cat”,1)(“cat”,1) (“in”,1)
(“wears”,1) (“the”,1)(“no”,1) (“hat”,1)(“hat”,1)
(“.”,1)
(“ran”,1)
(“fast”,1)
input
Mapper 1 Mapper 2
128. http://dataminingtrend.com http://facebook.com/datacube.th
MapReduce examples: word count
• Example
(Map)
128
(27183, “The fast cat wears no hat.”)
(31416, “The cat in the hat ran fast.”)
(“The”,1) (“The”,1)(“fast”,1) (“cat”,1)(“cat”,1) (“in”,1)
(“wears”,1) (“the”,1)(“no”,1) (“hat”,1)(“hat”,1)
(“.”,1)
(“ran”,1)
(“fast”,1) (“.”,1)
input
Mapper 1 Mapper 2
150. http://dataminingtrend.com http://facebook.com/datacube.th
MapReduce examples: IoT
• IoT applications create an enormous amount of data that has to
be processed. This data is generated by physical sensors who
take measurements, like room temperature at 8.00 o’Clock.
• Every measurement consists of
• a key (the timestamp when the measurement has been taken) and
• a value (the actual value measured by the sensor).
• for example, (2016-05-01 01:02:03, 1).
• The goal of this exercise is to create average daily values of that
sensor’s data.
150
161. http://dataminingtrend.com http://facebook.com/datacube.th
MapReduce examples: shared friendship
• In the shared friendship task, the goal is to analyze a social
network to see which friend relationships users have in
common.
• Given an input data source where the key is the name of a user
and the value is a comma-separated list of friends.
161
162. http://dataminingtrend.com http://facebook.com/datacube.th
MapReduce examples: shared friendship
• The following Python pseudocode demonstrates how to perform
this computation:
162
def map(person, friends):
for friend in friends.split(“,”):
pair = sort([person, friend])
emit(pair,friends)
def reduce(pair, friends):
shared = set(friend[0])
shared = shared.intersection(friends[1])
emit(pair,shared)
163. http://dataminingtrend.com http://facebook.com/datacube.th
MapReduce examples: shared friendship
• The mapper create an intermediate keycap of all of the possible
(friend, friend) tuples that exist from the initial dataset.
• This allows us to analyze the dataset on a per-relationship basis as the
value is the list of associated friends.
• The pair is sorted, which ensures that the input (“Mike”,“Linda”)
and (“Linda”,“Mike”) end up being the same key during
aggregation in the reducer.
163
171. http://dataminingtrend.com http://facebook.com/datacube.th
Hadoop Streaming
• Hadoop streaming is a utility that comes packaged with the
Hadoop distribution and allows MapReduce jobs to be created
with any executable as the mapper and/or the reducer.
• The Hadoop streaming utility enables Python, shell scripts, or any
other language to be used as a mapper, reducer, or both.
• The mapper and reducer are both executables that
• read input, line by line, from the standard input (stdin),
• and write output to the standard output (stdout).
• The Hadoop streaming utility creates a MapReduce job, submits the job
to the cluster, and monitors its progress until it is complete.
171
172. http://dataminingtrend.com http://facebook.com/datacube.th
Hadoop Streaming
• When the mapper is initialized, each map task launches the
specified executable as a separate process.
• The mapper reads the input file and presents each line to the
executable via stdin. After the executable processes each line
of input, the mapper collects the output from stdout and
converts each line to a key-value pair.
• The key consists of the part of the line before the first tab
character, and the value consists of the part of the line after the
first tab character.
172
173. http://dataminingtrend.com http://facebook.com/datacube.th
Hadoop Streaming
• When the reducer is initialized, each reduce task launches the
specified executable as a separate process.
• The reducer converts the input key-value pair to lines that are
presented to the executable via stdin.
• The reducer collects the executables result from stdout and
converts each line to a key-value pair.
• Similar to the mapper, the executable specifies key-value pairs
by separating the key and value by a tab character.
173
175. http://dataminingtrend.com http://facebook.com/datacube.th
Hadoop Streaming example
• The WordCount application can be implemented as two Python
programs: mapper.py and reducer.py.
• mapper.py is the Python program that implements the logic in
the map phase of WordCount.
• It reads data from stdin, splits the lines into words, and outputs
each word with its intermediate count to stdout.
175
176. http://dataminingtrend.com http://facebook.com/datacube.th
Hadoop Streaming example
• mapper.py
176
#!/usr/bin/env python
import sys
# input comes from STDIN (standard input)
for line in sys.stdin:
# remove leading and trailing whitespace
line = line.strip()
# split the line into words
words = line.split()
# increase counters
for word in words:
# write the results to STDOUT (standard output);
# what we output here will be the input for the
# Reduce step, i.e. the input for reducer.py
#
# tab-delimited; the trivial word count is 1
print '%st%s' % (word, 1)
177. http://dataminingtrend.com http://facebook.com/datacube.th
Hadoop Streaming example
• reducer.py is the Python program that implements the logic in
the reduce phase of WordCount.
• It reads the results of mapper.py from stdin, sums the
occurrences of each word, and writes the result to stdout.
• reducer.py
177
#!/usr/bin/env python
from operator import itemgetter
import sys
current_word = None
current_count = 0
word = None
178. http://dataminingtrend.com http://facebook.com/datacube.th
Hadoop Streaming example
• reducer.py (cont’)
178
# input comes from STDIN
for line in sys.stdin:
# remove leading and trailing whitespace
line = line.strip()
# parse the input we got from mapper.py
word, count = line.split('t', 1)
# convert count (currently a string) to int
try:
count = int(count)
except ValueError:
# count was not a number, so silently
# ignore/discard this line
continue
179. http://dataminingtrend.com http://facebook.com/datacube.th
Hadoop Streaming example
• reducer.py (cont’)
179
# this IF-switch only works because Hadoop sorts map output
# by key (here: word) before it is passed to the reducer
if current_word == word:
current_count += count
else:
if current_word:
# write result to STDOUT
print '%st%s' % (current_word, current_count)
current_count = count
current_word = word
# do not forget to output the last word if needed!
if current_word == word:
print '%st%s' % (current_word, current_count)
180. http://dataminingtrend.com http://facebook.com/datacube.th
Hadoop Streaming example
• Before attempting to execute the code, ensure that the
mapper.py and reducer.py files have execution permission.
• The following command will enable this for both files:
• Also ensure that the first line of each file contains the proper
path to Python. This line enables mapper.py and reducer.py to
execute as standalone executables.
• It is highly recommended to test all programs locally before
running them across a Hadoop cluster.
180
$ chmod +x mapper.py reducer.py
$ echo ‘The fast cat wears no hat’ | ./mapper.py | sort -t 1| ./reducer.py
181. http://dataminingtrend.com http://facebook.com/datacube.th
Hadoop Streaming example
• Download 3 ebooks from Project Gutenberg
• The Outline of Science, Vol. 1 (of 4) by J. Arthur Thomson (659 KB)
• The Notebooks of Leonardo Da Vinci (1.4 MB)
• Ulysses by James Joyce (1.5 MB)
• Before we run the actual MapReduce job, we must first copy the
files from our local file system to Hadoop’s HDFS.
181
$ hadoop fs -put pg20417.txt books/pg20417.txt
$ hadoop fs -put 5000-8.txt books/5000-8.txt
$ hadoop fs -put 4300-0.txt books/4300-0.txt
$ hadoop fs -ls books
182. http://dataminingtrend.com http://facebook.com/datacube.th
Hadoop Streaming example
• The mapper and reducer programs can be run as a
MapReduce application using the Hadoop streaming utility.
• The command to run the Python programs mapper.py and
reducer.py on a Hadoop cluster is as follows:
182
$ hadoop jar /usr/lib/hadoop-0.20-mapreduce/contrib/streaming/
hadoop-streaming-2.0.0-mr1-cdh*.jar
-files mapper.py, reducer.py
-mapper mapper.py
-reducer reducer.py
-input /user/hduser/books/*
-output /user/hduser/books/output
183. http://dataminingtrend.com http://facebook.com/datacube.th
Hadoop Streaming example
• Options for Hadoop streaming
183
Option Description
-files A command-separated list of files to be copied to the
MapReduce cluster
-mapper The command to be run as the mapper
-reducer The command to be run as the reducer
-input The DFS input path for the Map step
-output The DFS output directory for the Reduce step
184. http://dataminingtrend.com http://facebook.com/datacube.th
Python MapReduce library: mrjob
• mrjob is a Python MapReduce library, created by Yelp, that
wraps Hadoop streaming, allowing MapReduce applications to
be written in a more Pythonic manner.
• mrjob enables multistep MapReduce jobs to be written in pure
Python.
• MapReduce jobs written with mrjob can be tested locally, run on
a Hadoop cluster, or run in the cloud using Amazon Elastic
MapReduce (EMR).
184
186. http://dataminingtrend.com http://facebook.com/datacube.th
mrjob example
• word_count.py
• To run the job locally and count the frequency of words within a
file named pg20417.txt, use the following command:
186
from mrjob.job import MRJob
class MRWordCount(MRJob):
def mapper(self, _, line):
for word in line.split():
yield(word, 1)
def reducer(self, word, counts):
yield(word, sum(counts))
if __name__ == '__main__':
MRWordCount.run()
$ python word_count.py books/pg20419.txt
187. http://dataminingtrend.com http://facebook.com/datacube.th
mrjob example
• The MapReduce job is defined as the class, MRWordCount. Within the
mrjob library, the class that inherits from MRJob contains the methods
that define the steps of the MapReduce job.
• The steps within an mrjob application are mapper, combiner, and
reducer. The class inheriting MRJob only needs to define one of these
steps.
• The mapper() method defines the mapper for the MapReduce job. It
takes key and value as arguments and yields tuples of (output_key,
output_value).
• In the WordCount example, the mapper ignored the input key and split
the input value to produce words and counts.
187
188. http://dataminingtrend.com http://facebook.com/datacube.th
mrjob example
• The combiner is a process that runs after the mapper and before
the reducer.
• It receives, as input, all of the data emitted by the mapper, and the
output of the combiner is sent to the reducer. The combiner yields
tuples of (output_key, output_value) as output.
• The reducer() method defines the reducer for the MapReduce job.
• It takes a key and an iterator of values as arguments and yields
tuples of (output_key, output_value).
• In example, the reducer sums the value for each key, which
represents the frequency of words in the input.
188
189. http://dataminingtrend.com http://facebook.com/datacube.th
mrjob example
• The final component of a MapReduce job written with the mrjob
library is the two lines at the end of the file:
if __name__ == '__main__':
MRWordCount.run()
• These lines enable the execution of mrjob; without them, the
application will not work.
• Executing a MapReduce application with mrjob is similar to
executing any other Python program. The command line must
contain the name of the mrjob application and the input file:
189
$ python mr_job.py input.txt
190. http://dataminingtrend.com http://facebook.com/datacube.th
mrjob example
• By default, mrjob runs locally, allowing code to be developed
and debugged before being submitted to a Hadoop cluster.
• To change how the job is run, specify the -r/--runner option.
190
$ python word_count.py -r hadoop hdfs:books/pg20419.txt
192. http://dataminingtrend.com http://facebook.com/datacube.th
Introduction
• The Hadoop ecosystem emerged as a cost effective way of working
with large datasets
• It imposes a particular programming model, called MapReduce, for
breaking up computation tasks into units that can be distributed around
a cluster of commodity
• Underneath this computation model is a distributed file system called
Hadoop Distributed Filesystem (HDFS)
• However, a challenge remains; how do you move an existing data
infrastructure to Hadoop, when that infrastructure is based on traditional
relational databases and the Structured Query Language (SQL)?
192
193. http://dataminingtrend.com http://facebook.com/datacube.th
Introduction
• This is where Hive comes in. Hive provides an SQL dialect, called
Hive Query Language (abbreviated HiveQL or just HQL) for querying
data stored in a Hadoop cluster.
• SQL knowledge is widespread for a reason; it’s an effective,
reasonably intuitive model for organizing and using data.
• Mapping these familiar data operations to the low-level MapReduce
Java API can be daunting, even for experienced Java developers.
• Hive does this dirty work for you, so you can focus on the query itself.
Hive translates most queries to MapReduce jobs, thereby exploiting
the scalability of Hadoop, while presenting a familiar SQL abstraction.
193
194. http://dataminingtrend.com http://facebook.com/datacube.th
Introduction
• Hive is most suited for data warehouse applications, where relatively
static data is analyzed, fast response times are not required, and when
the data is not changing rapidly.
• Apache Hive is a “data warehousing” framework built on top of
Hadoop.
• Hive provides data analysts with a familiar SQL-based interface to
Hadoop, which allows them to attach structured schemas to data in
HDFS and access and analyze that data using SQL queries.
• Hive has made it possible for developers who are fluent in SQL to
leverage the scalability and resilience of Hadoop without requiring them
to learn Java or the native MapReduce API.
194
196. http://dataminingtrend.com http://facebook.com/datacube.th
Hive in the Hadoop Ecosystem
• There are several ways to interact with Hive
• CLI: command-line interface
• GUI: Graphic User Interface
• Karmasphere (http://karmasphere.com)
• Cloudera’s open source Hue (https://github.com/cloudera/hue)
• All commands and queries go to the Driver, which compiles the
input, optimizes the computation required, and executes the
required steps, usually with MapReduce jobs.
196
197. http://dataminingtrend.com http://facebook.com/datacube.th
Hive in the Hadoop Ecosystem
• Hive communicates with the JobTracker to initiate the MapReduce job.
• Hive does not have to be running on the same master node with the
JobTracker. In larger clusters, it’s common to have edge nodes where
tools like Hive run.
• They communicate remotely with the JobTracker on the master node
to execute jobs. Usually, the data files to be processed are in HDFS,
which is managed by the NameNode.
• The Metastore is a separate relational database (usually a MySQL
instance) where Hive persists table schemas and other system
metadata.
197
198. http://dataminingtrend.com http://facebook.com/datacube.th
Structured Data Queries with Hive
• Hive provides its own dialect of SQL called the Hive Query Language,
or HQL.
• HQL supports many commonly used SQL statements, including data
definition statements (DDLs) (e.g., CREATE DATABASE/ SCHEMA/ TABLE),
data manipulation statements (DMSs) (e.g., INSERT, UPDATE, LOAD),
and data retrieval queries (e.g., SELECT).
• Hive commands and HQL queries are compiled into an execution plan
or a series of HDFS operations and/ or MapReduce jobs, which are
then executed on a Hadoop cluster.
198
199. http://dataminingtrend.com http://facebook.com/datacube.th
Structured Data Queries with Hive
• Additionally, Hive queries entail higher-latency due to the overhead
required to generate and launch the compiled MapReduce jobs on the
cluster; even small queries that would complete within a few seconds
on a traditional RDBMS may take several minutes to finish in Hive.
• On the plus side, Hive provides the high-scalability and high-
throughput that you would expect from any Hadoop-based
application.
• It is very well suited to batch-level workloads for online analytical
processing (OLAP) of very large datasets at the terabyte and petabyte
scale.
199
200. http://dataminingtrend.com http://facebook.com/datacube.th
The Hive Command-Line Interface (CLI)
• Hive’s installation comes packaged with a handy command-line
interface (CLI), which we will use to interact with Hive and run
our HQL statements.
• This will initiate the CLI and bootstrap the logger (if configured)
and Hive history file, and finally display a Hive CLI prompt:
• You can view the full list of Hive options for the CLI by using the
-H flag:
200
$ hive
hive>
$ hive -H
205. http://dataminingtrend.com http://facebook.com/datacube.th
Creating a database
• Creating a database in Hive is very similar to creating a
database in a SQL-based RDBMS, by using the CREATE
DATABASE or CREATE SCHEMA statement:
• When Hive creates a new database, the schema definition data
is stored in the Hive metastore.
• Hive will raise an error if the database already exists in the
metastore; we can check for the existence of the database by
using IF NOT EXISTS:
• HQL: CREATE DATABASE IF NOT EXISTS flight_data;
205
206. http://dataminingtrend.com http://facebook.com/datacube.th
Creating a database
• We can then run SHOW DATABASES to verify that our database has
been created. Hive will return all databases found in the
metastore, along with the default Hive database:
• HQL: SHOW DATABASES;
206
207. http://dataminingtrend.com http://facebook.com/datacube.th
Creating tables
• Hive provides a SQL-like CREATE TABLE statement, which in its
simplest form takes a table name and column definitions:
• HQL: CREATE TABLE airlines (code INT,
description STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY 't'
STORED AS TEXTFILE;
• However, because Hive data is stored in the file system, usually
in HDFS or the local file system
• the CREATE TABLE command also takes optional clauses to
specify the row format with the ROW FORMAT clause that tells
Hive how to read each row in the file and map to our columns.
207
208. http://dataminingtrend.com http://facebook.com/datacube.th
Loading data
• It’s important to note one important distinction between Hive and
traditional RDBMSs with regards to schema enforcement:
• Traditional relational databases enforce the schema on writes
by rejecting any data that does not conform to the schema as
defined;
• Hive can only enforce queries on schema reads. If in reading
the data file, the file structure does not match the defined
schema, Hive will generally return null values for missing fields
or type mismatches
208
209. http://dataminingtrend.com http://facebook.com/datacube.th
Loading data
• Data loading in Hive is done in batch-oriented fashion using a bulk LOAD
DATA command or by inserting results from another query with the
INSERT command.
• LOAD DATA is Hive’s bulk loading command. INPATH takes an argument
to a path on the default file system (in this case, HDFS).
• We can also specify a path on the local file system by using LOCAL
INPATH instead. Hive proceeds to move the file into the warehouse
location.
• If the OVERWRITE keyword is used, then any existing data in the target
table will be deleted and replaced by the data file input; otherwise, the
new data is added to the table.
209
210. http://dataminingtrend.com http://facebook.com/datacube.th
Loading data
• Examples
• HQL: LOAD DATA LOCAL INPATH '/home/cloudera/
Downloads/flight_data/ontime_flights.tsv'
OVERWRITE INTO TABLE flights;
• HQL: LOAD DATA LOCAL INPATH '/home/cloudera/
Downloads/flight_data/airlines.tsv'
OVERWRITE INTO TABLE airlines;
• HQL: LOAD DATA LOCAL INPATH '/home/cloudera/
Downloads/flight_data/carriers.tsv'
OVERWRITE INTO TABLE carriers;
• HQL: LOAD DATA LOCAL INPATH '/home/cloudera/
Downloads/flight_data/cancellation_reasons.tsv'
OVERWRITE INTO TABLE cancellation_reasons;
210
212. http://dataminingtrend.com http://facebook.com/datacube.th
Data Analysis with Hive
• Aggregations
• HQL:
SELECT airline_code,
COUNT(1) AS num_flights,
SUM(IF(depart_delay > 0, 1, 0)) AS
num_depart_delays,
SUM(IF(arrive_delay > 0, 1, 0)) AS
num_arrive_delays,
SUM(IF(is_cancelled, 1, 0)) AS num_cancelled,
FROM flights
GROUP BY airline_code;
212
213. http://dataminingtrend.com http://facebook.com/datacube.th
Data Analysis with Hive
• Aggregations
• HQL:
SELECT airline_code,
COUNT(1) AS num_flights,
SUM(IF(depart_delay > 0, 1, 0)) AS num_depart_delays,
ROUND(SUM(IF(depart_delay > 0, 1, 0))/COUNT(1), 2)
AS depart_delay_rate,
SUM(IF(arrive_delay > 0, 1, 0)) AS num_arrive_delays,
ROUND(SUM(IF(arrive_delay > 0, 1, 0))/COUNT(1), 2)
AS arrive_delay_rate,
SUM(IF(is_cancelled, 1, 0)) AS num_cancelled,
ROUND(SUM(IF(is_cancelled, 1, 0))/COUNT(1), 2)
AS cancellation_rate
FROM flights
GROUP BY airline_code
ORDER by cancellation_rate DESC, arrive_delay_rate DESC,
depart_delay_rate DESC;
213
214. http://dataminingtrend.com http://facebook.com/datacube.th
Introduction to HBase
• While Hive provides a familiar data manipulation paradigm within
Hadoop, it doesn’t change the storage and processing paradigm,
which still utilizes HDFS and MapReduce in a batch-oriented fashion.
• Thus, for use cases that require random, real-time read/ write access
to data, we need to look outside of standard MapReduce and Hive for
our data persistence and processing layer.
• The real-time applications need to record high volumes of time-based
events that tend to have many possible structural variations.
• The data may be keyed on a certain value, like User, but the value is
often represented as a collection of arbitrary metadata.
214
215. http://dataminingtrend.com http://facebook.com/datacube.th
Introduction to HBase
• For example, two events, “Like” and “Share”, which require different column
values, as shown in table.
• In a relational model, rows are sparse but columns are not. That is, upon
inserting a new row to a table, the database allocates storage for every column
regardless of whether a value exists for that field or not.
• However, in applications where data is represented as a collection of arbitrary
fields or sparse columns, each row may use only a subset of available columns,
which can make a standard relational schema both a wasteful and awkward fit.
215
216. http://dataminingtrend.com http://facebook.com/datacube.th
Column-Oriented Databases
• NoSQL is a broad term that generally refers to non-relational
databases and encompasses a wide collection of data storage
models, including
• graph databases
• document databases
• key/ value data stores
• column-family databases.
• HBase is classified as a column-family or column-oriented database,
modelled on Google’s Big Table architecture.
216
217. http://dataminingtrend.com http://facebook.com/datacube.th
Column-Oriented Databases
• HBase organizes data into tables that contain rows. Within a
table, rows are identified by their unique row key, which do not
have a data type.
• Row key are similar to the concept of primary keys in relational
databases, in that they are automatically indexed.
217
218. http://dataminingtrend.com http://facebook.com/datacube.th
Column-Oriented Databases
• In HBase, table rows are sorted by their row key and because
row keys are byte arrays, almost anything can serve as a row
key from strings to binary representations of longs or even
serialized data structures.
• HBase stores its data key/value pairs, where all table lookups
are performed via the table’s row key, or unique identifier to the
stored record data.
• Data within a row is grouped into column families, which consist
of related columns.
218
220. http://dataminingtrend.com http://facebook.com/datacube.th
Column-Oriented Databases
• Storing data in columns rather than rows has particular benefits for
data warehouses and analytical databases where aggregates are
computed over large sets of data with potentially sparse values, where
not all columns values are present.
• Another interesting feature of HBase and BigTable-based column-
oriented databases is that the table cells, or the intersection of row and
column coordinates, are versioned by timestamp.
• HBase is thus also described as being a multidimensional map where
time provides the third dimension
220
221. http://dataminingtrend.com http://facebook.com/datacube.th
Column-Oriented Databases
• The time dimension is indexed in decreasing order, so that
when reading from an HBase store, the most recent values are
found first.
• The contents of a cell can be
referenced by a
{rowkey, column, timestamp}
tuple, or we can scan for a
range of cell values by time
range.
221
222. http://dataminingtrend.com http://facebook.com/datacube.th
Real-Time Analytics with HBase
• For the purposes of this HBase overview, we define and work with the
HBase shell to design a schema for a linkshare tracker that tracks the
number of times a link has been shared.
• Generating a schema
• When designing schemas in HBase, it’s important to think in terms
of the column-family structure of the data model and how it affects
data access patterns.
• Furthermore, because HBase doesn’t support joins and provides
only a single indexed rowkey, we must be careful to ensure that the
schema can fully support all use cases.
222
223. http://dataminingtrend.com http://facebook.com/datacube.th
Real-Time Analytics with HBase
• First, we need to declare the table name, and at least one
column-family name at the time of table definition.
• If no namespace is declared, HBase will use the default
namespace
• We just created a single table called linkshare in the default
namespace with one column-family, named link
• To alter the table after creation, such as changing or adding column
families, we need to first disable the table so that clients will not be able
to access the table during the alter operation:
223
hbase> create ‘linkshare’,’link’
224. http://dataminingtrend.com http://facebook.com/datacube.th
Real-Time Analytics with HBase
• Good row key design affects not only how we query the table, but the
performance and complexity of data access.
• By default, HBase stores rows in sorted order by row key, so that
similar keys are stored to the same RegionServer.
• Thus, in addition to enabling our data access use cases, we also need
to be mindful to account for row key distribution across regions.
• For the current example, let’s assume that we will use the unique
reversed link URL for the row key.
224
hbase> disable ‘linkshare’
hbase> alter ‘linkshare’, ‘statistics’
hbase> enable ‘linkshare’
225. http://dataminingtrend.com http://facebook.com/datacube.th
Real-Time Analytics with HBase
• In our linkshare application, we want to store descriptive data about
the link, such as its title, while maintaining a frequency counter that
tracks the number of times the link has been shared.
• We can insert, or put, a value in a cell at the specified table/ row/
column and optionally timestamp coordinates.
• To put a cell value into table linkshare at row with row key
org.hbase.www under column-family link and column title marked with
the current timestamp
225
hbase> put 'linkshare', 'org.hbase.www', 'link:title', 'Apache HBase'
hbase> put 'linkshare', 'org.hadoop.www', 'link:title', 'Apache Hadoop'
hbase> put 'linkshare', 'com.oreilly.www', 'link:title', ‘O’Reilly.com’
226. http://dataminingtrend.com http://facebook.com/datacube.th
Real-Time Analytics with HBase
• The put operation works great for inserting a value for a single cell, but for
incrementing frequency counters, HBase provides a special mechanism
to treat columns as counters.
• To increment a counter, we use the command incr instead of put.
• The last option passed is the increment value, which in this case is 1.
• Incrementing a counter will return the updated counter value, but you can
also access a counter’s current value any time using the get_counter
command, specifying the table name, row key, and column:
226
hbase> incr ‘linkshare’, ‘org.hbase.www’, ‘statistics:share’, 1
hbase> incr ‘linkshare’, ‘org.hbase.www’, ‘statistics:like’, 1
227. http://dataminingtrend.com http://facebook.com/datacube.th
Real-Time Analytics with HBase
• HBase provides two general methods to retrieve data from a table:
• the get command performs lookups by row key to retrieve attributes
for a specific row,
• and the scan command, which takes a set of filter specifications and
iterates over multiple rows based on the indicated specifications.
• In its simplest form, the get command accepts the table name
followed by the row key, and returns the most recent version timestamp
and cell value for columns in the row.
227
hbase> incr ‘linkshare’, ‘org.hbase.www’, ‘statistics:share’, 1
hbase> get_counter ‘linkshare’, ‘org.hbase.www’,
‘statistics:share’
hbase> get ‘linkshare’, ‘org.hbase.www’
228. http://dataminingtrend.com http://facebook.com/datacube.th
Real-Time Analytics with HBase
• The get command also accepts an optional dictionary of parameters to
specify the column( s), timestamp, timerange, and version of the cell
values we want to retrieve. For example, we can specify the column( s) of
interest
• A scan operation is akin to database cursors or iterators, and takes
advantage of the underlying sequentially sorted storage mechanism,
iterating through row data to match against the scanner specifications.
• With scan, we can scan an entire HBase table or specify a range of rows
to scan.
228
hbase> get ‘linkshare’, ‘org.hbase.www’, ‘link:title’
hbase> get ‘linkshare’, ‘org.hbase.www’, ‘link:title’,
‘statistics:share’
229. http://dataminingtrend.com http://facebook.com/datacube.th
Real-Time Analytics with HBase
• You can specify an optional STARTROW and/ or STOPROW
parameter, which can be used to limit the scan to a specific
range of rows.
• If neither STARTROW nor STOPROW are provided, the scan
operation will scan through the entire table.
• You can, in fact, call scan with the table name to display all the
contents of a table.
229
hbase> scan ‘linkshare’
hbase> scan 'linkshare', {COLUMNS = > [' link:title'],
STARTROW = > 'org.hbase.www'}
230. http://dataminingtrend.com http://facebook.com/datacube.th
Introduction to Sqoop
• However, in cases where the input data is already structured because
it resides in a relational database, it would be convenient to leverage
this known schema to import the data into Hadoop in a more efficient
manner than uploading CSVs to HDFS and parsing them manually.
• Sqoop (SQL-to-Hadoop) is designed to transfer data between
relational database management systems (RDBMS) and Hadoop.
• It automates most of the data transfer process by reading the schema
information directly from the RDBMS.
• Sqoop then uses MapReduce to import and export the data to and
from Hadoop.
230
231. http://dataminingtrend.com http://facebook.com/datacube.th
Introduction to Sqoop
• Sqoop gives us the flexibility to maintain our data in its production
state while copying it into Hadoop to make it available for further
analysis without modifying the production database.
• We’ll walk through a few ways to use Sqoop to import data from a
MySQL database into various Hadoop data stores, including HDFS,
Hive, and HBase.
• We will use MySQL as the source and target RDBMS for the examples
in this chapter, so we also assume that a MySQL database resides on
the same host as your Hadoop/ Sqoop services and is accessible via
localhost and the default port, 3306.
231
232. http://dataminingtrend.com http://facebook.com/datacube.th
Importing from MySQL to HDFS
• When importing data from relational databases like MySQL, Sqoop
reads the source database to gather the necessary metadata for the
data being imported.
• Sqoop then submits a map-only Hadoop job to transfer the actual table
data based on the metadata that was captured in the previous step.
• This job produces a set of serialized files, which may be delimited text
files, binary format, or SequenceFiles containing a copy of the imported
table or datasets.
• By default, the files are saved as comma-separated files to a directory
on HDFS with a name that corresponds to the source table name.
232