8. Big Data Challenges
“Fat” servers implies high cost
– use cheap commodity nodes instead
Large # of cheap nodes implies often failures
– leverage automatic fault-tolerance
fault-tolerance
9. Big Data Challenges
We need new data-parallel programming
model for clusters of commodity machines
11. MapReduce
Published in 2004 by Google
– MapReduce: Simplified Data Processing on Large Clusters
Popularized by Apache Hadoop project
– used by Yahoo!, Facebook, Twitter, Amazon, …
13. Word Count Example
Input Map Shuffle & Sort Reduce Output
the quick the, 3
brown Map brown, 2
fox Reduce fox, 2
how, 1
now, 1
the fox
ate the Map
mouse
quick, 1
ate, 1
Reduce mouse, 1
how now
brown Map cow, 1
cow
14. Word Count Example
Input Map Shuffle & Sort Reduce Output
the, 1 the, 1
the quick quick, 1 brown, 1
brown Map brown, 1 fox, 1
fox fox, 1 the, 1 Reduce
fox, 1
the, 1
the, 1 how, 1
the fox fox, 1 now, 1
ate the Map ate, 1 brown, 1
mouse the, 1
mouse, 1
quick, 1
ate, 1
how, 1 mouse, 1 Reduce
how now now, 1 cow, 1
brown Map brown, 1
cow cow, 1
15. Word Count Example
Input Map Shuffle & Sort Reduce Output
the, [1,1,1]
the quick brown, [1,1] the, 3
brown Map fox, [1,1] brown, 2
fox how, [1] Reduce fox, 2
now, [1] how, 1
now, 1
the fox
ate the Map
mouse
quick, [1] quick, 1
ate, [1] ate, 1
mouse, [1] Reduce mouse, 1
how now
cow, [1] cow, 1
brown Map
cow
18. Hadoop Overview
Open source implementation of
– Google MapReduce paper
– Google File System (GFS) paper
First release in 2008 by Yahoo!
– wide adoption by Facebook, Twitter, Amazon, etc.
20. Hadoop Core (HDFS)
MapReduce (Job Scheduling / Execution System)
Hadoop Distributed File System (HDFS)
• Name Node stores file metadata
• files split into 64 MB blocks
• blocks replicated across 3 Data Nodes
21. Hadoop Core (HDFS)
MapReduce (Job Scheduling / Execution System)
Hadoop Distributed File System (HDFS)
Name Node Data Node
22. Hadoop Core (MapReduce)
• Job Tracker distributes tasks and handles failures
• tasks are assigned based on data locality
• Task Trackers can execute multiple tasks
MapReduce (Job Scheduling / Execution System)
Hadoop Distributed File System (HDFS)
Name Node Data Node
23. Hadoop Core (MapReduce)
Job Tracker Task Tracker
MapReduce (Job Scheduling / Execution System)
Hadoop Distributed File System (HDFS)
Name Node Data Node
24. Hadoop Core (Job submission)
Task Tracker
Client
Job Tracker
Name Node Data Node
26. JavaScript MapReduce
var map = function (key, value, context) {
var words = value.split(/[^a-zA-Z]/);
for (var i = 0; i < words.length; i++) {
if (words[i] !== "") {
context.write(words[i].toLowerCase(), 1);
}
}
};
var reduce = function (key, values, context) {
var sum = 0;
while (values.hasNext()) {
sum += parseInt(values.next());
}
context.write(key, sum);
};
27. Pig
words = LOAD '/example/count' AS (
word: chararray,
count: int
);
popular_words = ORDER words BY count DESC;
top_popular_words = LIMIT popular_words 10;
DUMP top_popular_words;
28. Hive
CREATE EXTERNAL TABLE WordCount (
word string,
count int
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY 't'
LINES TERMINATED BY 'n'
STORED AS TEXTFILE
LOCATION "/example/count";
SELECT * FROM WordCount ORDER BY count DESC LIMIT 10;
Volume – exceeds physical limits of vertical scalabilityVelocity – decision window small compared to data change rateVariety – many different formats makes integration expensiveVariability – many options or variable interpretations confound analysis
--run MapReducerunJs("/example/mr/WordCount.js", "/example/data/davinci.txt", "/example/count");--create Hive table for the existing dataCREATE EXTERNAL TABLE WordCount ( word string, count int)ROW FORMAT DELIMITED FIELDS TERMINATED BY '\\t' LINES TERMINATED BY '\\n' STORED AS TEXTFILELOCATION "/example/count";--select top wordsSELECT * FROM WordCountORDER BY count DESC LIMIT 10;--execute Hive selecthive.exec("select * from WordCount order by count desc limit 10;");--execute LINQ style Hive queryhive.from("WordCount").orderBy("count DESC").take(10).run();--execute Pig scriptwords = LOAD '/example/count' AS ( word: chararray, count: int);popular_words = ORDER words by count DESC; top_popular_words = LIMIT popular_words 10;DUMP top_popular_words;--execute LINQ style Pig scriptpig.from("/example/count", "word: chararray, count: int").orderBy("count DESC").take(10).run();
--run MapReducerunJs("/example/mr/WordCount.js", "/example/data/davinci.txt", "/example/count");--create Hive table for the existing dataCREATE EXTERNAL TABLE WordCount ( word string, count int)ROW FORMAT DELIMITED FIELDS TERMINATED BY '\\t' LINES TERMINATED BY '\\n' STORED AS TEXTFILELOCATION "/example/count";--select top wordsSELECT * FROM WordCountORDER BY count DESC LIMIT 10;--execute Hive selecthive.exec("select * from WordCount order by count desc limit 10;");--execute LINQ style Hive queryhive.from("WordCount").orderBy("count DESC").take(10).run();--execute Pig scriptwords = LOAD '/example/count' AS ( word: chararray, count: int);popular_words = ORDER words by count DESC; top_popular_words = LIMIT popular_words 10;DUMP top_popular_words;--execute LINQ style Pig scriptpig.from("/example/count", "word: chararray, count: int").orderBy("count DESC").take(10).run();