3. Problem
● Too much data
○ 90% of all the data in the world has been generated
in the last two years
○ Large Hadron Collider: 25 petabytes per year
○ Walmart: 1M transactions per hour
● Hard disks
○ Cheap!
○ Still slow access time
○ Write even slower
4. Solutions
● Multiple Hard Disks
○ Work in parallel
○ We can reduce access time!
● How to deal with hardware failure?
● What if we need to combine data?
12. HDFS
● Distributed storage
○ Managed across a network of commodity machines
● Blocks
○ About 128Mb
○ Large data sets
● Tolerance to node failure
○ Data replication
● Streaming data access
○ Many access
○ Write once (batch)
13. HDFS
● DataNodes (Workers)
○ Store blocks
● NameNode (Master)
○
○
○
○
○
Maintains metadata
Knows where the blocks are located
Make DataNodes fault tolerant
Single point of failure
Secondary NameNode
17. MapReduce
● Distributed processing paradigm
○ Moving computation is cheaper than moving data
● Map
○ Map(k1,v1) -> list(k2,v2)
● Reduce
○ Reduce(k2,list(v2)) -> list(v3)
18. Word Counter
map (Long key, String value)
for each(String word in value)
emit(word, 1);
reduce (String word, List values)
emit(word, sum(values));
19. Word Counter
map (Long key, String value)
for each(String word in value)
emit(word, 1);
reduce (String word, List values)
emit(word, sum(values));
Java is great
Hadoop is also great
20. Word Counter - Map
map (Long key, String value)
for each(String word in value)
emit(word, 1);
reduce (String word, List values)
emit(word, sum(values));
Key
Value
1
Java is great
2
Hadoop is also great
21. Word Counter - Map
map (Long key, String value)
for each(String word in value)
emit(word, 1);
reduce (String word, List values)
emit(word, sum(values));
Key
Value
1
Java is great
2
Hadoop is also great
map(1, “Java is great”)
22. Word Counter - Map
map (Long key, String value)
map(1, “Java is great”)
for each(String word in value)
emit(word, 1);
Key
reduce (String word, List values)
emit(word, sum(values));
Key
Value
1
Java is great
2
Hadoop is also great
Value
Java
1
23. Word Counter - Map
map (Long key, String value)
map(1, “Java is great”)
for each(String word in value)
emit(word, 1);
Key
Java
1
is
reduce (String word, List values)
Value
1
emit(word, sum(values));
Key
Value
1
Java is great
2
Hadoop is also great
24. Word Counter - Map
map (Long key, String value)
map(1, “Java is great”)
for each(String word in value)
emit(word, 1);
Key
Java
1
is
1
great
reduce (String word, List values)
Value
1
emit(word, sum(values));
Key
Value
1
Java is great
2
Hadoop is also great
25. Word Counter - Map
map (Long key, String value)
for each(String word in value)
map(2, “Hadoop is also
great”)
emit(word, 1);
Key
Java
1
is
1
great
reduce (String word, List values)
Value
1
emit(word, sum(values));
Key
Value
1
Java is great
2
Hadoop is also great
26. Word Counter - Map
map (Long key, String value)
for each(String word in value)
map(2, “Hadoop is also
great”)
emit(word, 1);
Key
Java
1
is
1
great
reduce (String word, List values)
Value
1
emit(word, sum(values));
Key
Value
Hadoop 1
1
Java is great
is
1
2
Hadoop is also great
also
1
great
1
27. Word Count - Group & Sort
map(k1,v1) -> list(k2, v2)
Key
Value
Java
1
is
1
great
1
Hadoop 1
is
1
also
1
great
1
reduce(k2, list(v2)) -> list(v3)
28. Word Count - Group & Sort
map(k1,v1) -> list(k2,v2)
reduce(k2,list(v2)) -> list(v3)
Key
Value
Java
1
Key
Value
is
1
Java
[1]
great
1
is
[1, 1]
Hadoop 1
great
[1, 1]
is
1
Hadoop [1]
also
1
also
great
1
group
[1]
29. Word Count - Group & Sort
map(k1,v1) -> list(k2,v2)
reduce(k2,list(v2)) -> list(v3)
Key
Value
Java
1
Key
Value
Key
Value
is
1
Java
[1]
also
[1]
great
1
is
[1, 1]
great
[1, 1]
sort
group
Hadoop 1
great
is
1
Hadoop [1]
is
[1, 1]
also
1
also
Java
[1]
great
1
[1, 1]
[1]
Hadoop [1]
30. Word Count - Reduce
map (Long key, String value)
for each(String word in value)
emit(word, 1);
reduce (String word, List values)
emit(word, sum(values));
Key
Value
also
[1]
great
[1, 1]
Hadoop [1]
is
[1, 1]
Java
[1]
31. Word Count - Reduce
map (Long key, String value)
for each(String word in value)
emit(word, 1);
reduce (String word, List values)
emit(word, sum(values));
Key
Value
also
[1]
great
[1, 1]
Hadoop [1]
is
[1, 1]
Java
[1]
reduce(“also”, [1])
32. Word Count - Reduce
map (Long key, String value)
reduce(“also”, [1])
for each(String word in value)
emit(word, 1);
Key
reduce (String word, List values)
emit(word, sum(values));
Key
Value
also
[1]
great
[1, 1]
Hadoop [1]
is
[1, 1]
Java
[1]
Value
also
1
33. Word Count - Reduce
map (Long key, String value)
reduce(“great”, [1, 1])
for each(String word in value)
emit(word, 1);
Key
reduce (String word, List values)
emit(word, sum(values));
Key
Value
also
[1]
great
[1, 1]
Hadoop [1]
is
[1, 1]
Java
[1]
Value
also
1
34. Word Count - Reduce
map (Long key, String value)
reduce(“great”, [1, 1])
for each(String word in value)
emit(word, 1);
Key
also
1
great
reduce (String word, List values)
Value
2
emit(word, sum(values));
Key
Value
also
[1]
great
[1, 1]
Hadoop [1]
is
[1, 1]
Java
[1]
35. Word Count - Reduce
map (Long key, String value)
reduce(“Hadoop”, [1])
for each(String word in value)
emit(word, 1);
Key
also
1
great
reduce (String word, List values)
Value
2
emit(word, sum(values));
Key
Value
also
[1]
great
[1, 1]
Hadoop [1]
is
[1, 1]
Java
[1]
Hadoop 1
36. Word Count - Reduce
map (Long key, String value)
reduce(“is”, [1, 1])
for each(String word in value)
emit(word, 1);
Key
also
1
great
reduce (String word, List values)
Value
2
emit(word, sum(values));
Key
Value
also
[1]
great
[1, 1]
Hadoop [1]
is
[1, 1]
Java
[1]
Hadoop 1
is
2
37. Word Count - Reduce
map (Long key, String value)
reduce(“Java”, [1])
for each(String word in value)
emit(word, 1);
Key
also
1
great
reduce (String word, List values)
Value
2
emit(word, sum(values));
Key
Value
also
[1]
great
[1, 1]
Hadoop [1]
is
[1, 1]
Java
[1]
Hadoop 1
is
2
Java
1
38. Distributed?
● Map tasks
○ Each read block executes a map task
● Reduce tasks
○ Partitioning when grouping
39. Word Count - Partition
num partitions = 1
Key
Value
Java
1
Key
Value
Key
Value
is
1
Java
[1]
also
[1]
great
1
is
[1, 1]
great
[1, 1]
sort
group
Hadoop 1
great
is
1
Hadoop [1]
is
[1, 1]
also
1
also
Java
[1]
great
1
[1, 1]
[1]
Hadoop [1]
40. Word Count - Partition
num partitions = 2
is
1
great
Value
Java
1
Key
Value
[1]
is
[1, 1]
is
[1, 1]
Java
[1]
Key
Value
Key
Value
great
[1, 1]
also
[1]
[1, 1]
1
up
Java
Key
Value
sort
gr
o
Key
p
ou
gr
Hadoop 1
is
1
also
1
Hadoop [1]
great
great
1
also
Hadoop [1]
sort
[1]
41. Distributed?
● Map tasks
○ Each read block executes a map task
● Reduce tasks
○ Partitioning when grouping
○ Each partition executes a reduce task
49. Hive
● Data Warehouse
● SQL-Like analysis system
SELECT SPLIT(line, “ ”) AS word, COUNT(*)
FROM table
GROUP BY word
ORDER BY word ASC;
● Executes MapReduce underneath!
51. Hadoop Ecosystem
● ZooKeeper:
○ Centralized coordination system
● Pig
○ Data-flow language to analyze large data sets
● Kafka:
○ Distributed messaging system
● Sqoop:
○ Transfer between RDBMS - HDFS
● ...
53. Trovit
● What is it:
○ Vertical search engine.
○ Real estate, cars, jobs, products, vacations.
● Challenges:
○ Millions of documents to index
○ Traffic generates a huge amount of log files
54. Trovit
● Legacy:
○ Use MySQL as a support to document indexing
○ Didn’t scale!
● Batch processing:
○ Hadoop with a pipeline workflow
○ Problem solved!
● Real time processing:
○ Storm to improve freshness
● More challenges:
○ Content analysis
○ Traffic analysis