And introdution to MR and Hadoop and an view on the opportunities to use MR with databases i.e., SQL-MapReduce by Teradata and In-database MR by Oracle.
The presentation was used during a class of Datenbanken Implementierungstechniken in 2013.
4. 2004: Google
publishes the
papers
2006:
Apache releases
Hadoop.
Is the first Open
Source
implementation of
GFS and
MapReduce.
Now:
Amazon, AOL,
eBay, Facebook,
HP, IBM, Last.fm,
LinkedIn, Microsoft,
Spotify,
Twitter and more
are using Hadoop.
A Brief History
Introduction MapReduce Hadoop MR&Databases
β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β
5. β Data start to be really big: more than >10TB.
E.g: Large Synoptic Survey Telescope (30TB / night)
β The best idea is to scale out (not scale up) the
system, but . . .
ξ How do we scale to more than 1000+ machines?
ξ How do we handle machine failures?
ξ How can we facilitate communications between nodes?
ξ If we change system, do we lose all our optimisation
work?
β Google needed to recreate the index of the web.
Motivations
Introduction MapReduce Hadoop MR&Databases
β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β
6. βMapReduce is a programming model and an
associated implementation for processing and
generating large data sets.β β Google, Inc.
MapReduce paper, 2004.
It is a really simple API that has just two serial
functions, map() and reduce() and is language
independent (Java, Python, Perl β¦).
What is MapReduce?
Introduction MapReduce Hadoop MR&Databases
β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β
7. MapReduce hides messy details in the runtime
library:
β Parallelization and Distribution
β Load balancing
β Network and disk transfer optimization
β Handling of machine failures
β Fault tolerance
β Monitoring & status updates
All users obtain benefits from improvements on the
core library.
Why is MapReduce useful?
Introduction MapReduce Hadoop MR&Databases
β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β
8. 1. Read a lot of data
2. Map: extract something we care about from each record
3. Shuffle and Sort
4. Reduce: aggregate, summarize, filter, or transform
5. Write the results
From an outside view is the same (read, elaborate,
write), map and reduce change to fit the problem.
Typical problem solved by MapReduce
Introduction MapReduce Hadoop MR&Databases
β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β
9. β Single master controls job execution on multiple slaves.
β Mappers preferentially placed on same node or same
rack as their input block β minimizes network usage!!!
β Mappers save outputs to local disk before serving them
to reducers.
β If a map or reduce crashes: Re-execute!
β Allows having more mappers and reducers than nodes.
Some Execution Details
Introduction MapReduce Hadoop MR&Databases
β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β
11. Programmer has to write two primary methods:
map (k1,v1) β list(k2,v2)
reduce (k2,list(v2)) β list(k2,v2)
β All v' with the same k' are reduced together, in
order.
β The input keys and values are drawn from a
different domain than the output keys and values.
MapReduce Programming Model
Introduction MapReduce Hadoop MR&Databases
β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β
12. map(String key, String value):
// key: document name
// value: document contents
for each word w in value:
EmitIntermediate(w, "1");
reduce(String key, Iterator values):
// key: a word
// values: a list of counts
int result = 0;
for each v in values:
result += ParseInt(v);
Emit(AsString(result));
Example: Words Frequency
βdocumentxβ, βTo be or not to beβ
βbeβ, 2
βnotβ, 1
βorβ, 1
βtoβ, 2
Introduction MapReduce Hadoop MR&Databases
β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β
14. β Inverted index
- Find what documents contain a specific word.
- Map: parse document, emit <word, document-ID> pairs.
- Reduce: for each word, sort the corresponding document Ids.
Emit <word, list(document-ID)>
β’ Reverse web-link graph
- Find where page links come from.
- Map: output <target, source> for each link to target in a page
source.
- Reduce: concatenate the list of all source URLs associated
with a target.
Emit <target, list(source)>
Others examples
Introduction MapReduce Hadoop MR&Databases
β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β
15. β Proven to be a useful abstraction
β Really simplifies large-scala computations
β Fun to use:
- Focus on problem
- Let the library deal with messy details
Conclusions on MapReduce
Introduction MapReduce Hadoop MR&Databases
β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β
17. β Is a framework for distributed processing
β It is Open Source (Apache v2 Licence)
β It is a top-level Apache Project
β Written in Java
β Batch processing centric
β Runs on commodity hardware
What is Hadoop?
Introduction MapReduce Hadoop MR&Databases
β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β
18. Introduction MapReduce Hadoop MR&Databases
β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β
Hadoop Distributed File System
β For very large files: TBs, PBs.
β Each file is partitioned into chunks of 64MB.
β Each chunk is replicated several times (>=3), on
different racks, for fault tolerance.
β Is an abstract FS, disks are formatted on ext3, ext4
or XFS.
19. Introduction MapReduce Hadoop MR&Databases
β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β
Hadoop Architecture
β TaskTracker is the MapReduce server
(processing part)
β DataNode is the HDFS server
(data part)
TaskTracker
DataNode
Machine
26. @Google
β Index construction for Google Search
β Article clustering for Google News
β Statistical machine translation
@Yahoo! (4100 nodes)
β βWeb mapβ powering Yahoo! Search
β Spam detection for Yahoo! Mail
@Facebook (>100 PB of storage)
β Data mining
β Ad optimization
β Spam detection
What is MapReduce/Hadoop used for?
Introduction MapReduce Hadoop MR&Databases
β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β
27. MapReduce's use of input files and lack of schema
support prevents the performance improvements
enabled by features like B-trees and hash
partitioning . . .
. . . most of the data in companies are stored on
databases!
but . . .
Introduction MapReduce Hadoop MR&Databases
β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β
28. β SQL-MapReduce by Teradata Aster
β In-Database Map-Reduce by Oracle
β Connectors to allow external Hadoop
programs to access data from databases
and to store Hadoop output in databases
Solutions
Introduction MapReduce Hadoop MR&Databases
β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β
29. Is a framework to allow developers to write SQL-
MapReduce functions in languages such as Java,
C#, Python and C++ and push them into the
database for advanced in-database analytics.
SQL-MapReduce
Introduction MapReduce Hadoop MR&Databases
β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β
30. MR functions can be used like custom SQL operators and
can implement any algorithm or transformation.
SQL-MapReduce - Syntax
http://www.asterdata.com/resources/mapreduce.php
Introduction MapReduce Hadoop MR&Databases
β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β
31. Demo #1: Map (Tokenization) and Reduce (WordCount) in SQL/MR
SELECT key AS word, value AS wordcount
FROM WordCountReduce (
ON Tokenize ( ON blogs )
PARTITION BY key
)
ORDER BY wordcount DESC
LIMIT 20;
Example: Words Frequency
Introduction MapReduce Hadoop MR&Databases
β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β
32. Demo #1: Map (Tokenization) and Reduce (WordCount) in SQL/MR
SELECT key AS word, value AS wordcount
FROM WordCountReduce (
ON Tokenize ( ON blogs )
PARTITION BY key
)
ORDER BY wordcount DESC
LIMIT 20;
Demo #2: Why do Reduce when we have SQL?
SELECT word, count(*) AS wordcount
FROM Tokenize ( ON blogs )
GROUP BY word
ORDER BY wordcount DESC
LIMIT 20;
Example: Words Frequency
Introduction MapReduce Hadoop MR&Databases
β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β
33. β Uses Table Functions to implement Map-Reduce within
the database.
β Parallelization is provided by the Oracle Parallel
Execution framework.
Using this in combination with SQL, Oracle provides an
simple mechanism for database developers to
develop Map-Reduce functionality using languages they
know.
In-Database Map-Reduce by Oracle
Introduction MapReduce Hadoop MR&Databases
β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β
34. SELECT *
FROM table(oracle_map_reduce.reducer(
cursor(
SELECT value(map_result).word word
FROM table(oracle_map_reduce.mapper(
cursor(
SELECT a FROM documents), ' '
)
)
map_result
)
));
Example: Words Frequency
Introduction MapReduce Hadoop MR&Databases
β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β
35. However this solutions are not source
compatible with Hadoop.
Native Hadoop programs need to be
rewritten before becoming usable in
databases.
Still not perfect!
Introduction MapReduce Hadoop MR&Databases
β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β