ICT role in 21st century education and its challenges
hadoop introduce
1. What mapreduce is ?
• Origin from Google (Operating Systems
Design and Implementation 04)
• A sample programming model for data
processing
• For large dataset processing
6. Triple example
• Let map(k, v) =
•
if (isPrime(v)) then emit(k, v)
• (“foo”, 7) -> (“foo”, 7)
• (“test”, 10) -> (nothing)
7. Reduce example
let reduce(k, vals) =
sum = 0
foreach int v in vals:
sum +=
emit(k, sum)
(“A”, [42, 100, 312]) -> (“A”, 454)
(“B”, [12, 6, -2]) -> (“B”, 16)
9. Caculate the map tasks we need
• Goalsize = Totalsize/mapred.map.tasks
• Mapred.map.tasks(defined in job
configuration ,just a hint)
10.
11.
12. Reduce number
• 0.95 ? 1.75 ?
• At 0.95 all of the reduces can launch
immediately and start transfering map
outputs as the maps finish.
• At 1.75 the faster nodes will finish their
first round of reduces and launch a
second round of reduces doing a much
better job of load balancing.
13.
14.
15. What HDFS is ?
• Origin from Google again [SOSP’03]
Symposium on Operating Systems
Principles
• Redundant storage of massive amounts of
data on cheap and unreliable computers
16. HDFS feature
• Files stored as blocks
• Reliability through replication
• Single master(NN) coordinates
access,metadata
• No data caching
• Familiar interface ,
17. NN SPOF and failure resistance
• Store metadata in different place
(local disk / share storage)
Secondary NN
Merge edit log with Fsimage
Reduce recovery time
NN HA