SlideShare ist ein Scribd-Unternehmen logo
1 von 19
What mapreduce is ?
• Origin from Google (Operating Systems
Design and Implementation 04)
• A sample programming model for data
processing
• For large dataset processing
MapReduce feature
• Parallel
• Run on commodity hardware
• Fault Tolerance
Three phase of MR
• Map
• Shuffle
• Reduce
Example for map
• Let map(k, v) =
•
foreach char c in v:
•
emit(k, c)
• (“A”, “cats”) -> (“A”, “c”), (“A”, “a”),
(“A”, “t”), (“A”, “s”)
Double example
• Let map(k, v) =
•
emit(k.toUpper(), v.toUpper())
• (“foo”, “bar”) -> (“FOO”, “BAR”)
• (“Foo”, “other”) -> (“FOO”, “OTHER”)
Triple example
• Let map(k, v) =
•
if (isPrime(v)) then emit(k, v)
• (“foo”, 7) -> (“foo”, 7)
• (“test”, 10) -> (nothing)
Reduce example
let reduce(k, vals) =
sum = 0
foreach int v in vals:
sum +=
emit(k, sum)
(“A”, [42, 100, 312]) -> (“A”, 454)
(“B”, [12, 6, -2]) -> (“B”, 16)
Interface InputFormat
•
•

Two methods

getSplits
How to split the input data
• getRecordReader
How to read the input data
Caculate the map tasks we need
• Goalsize = Totalsize/mapred.map.tasks
• Mapred.map.tasks(defined in job
configuration ,just a hint)
Reduce number
• 0.95 ? 1.75 ?
• At 0.95 all of the reduces can launch
immediately and start transfering map
outputs as the maps finish.
• At 1.75 the faster nodes will finish their
first round of reduces and launch a
second round of reduces doing a much
better job of load balancing.
What HDFS is ?
• Origin from Google again [SOSP’03]
Symposium on Operating Systems
Principles
• Redundant storage of massive amounts of
data on cheap and unreliable computers
HDFS feature
• Files stored as blocks
• Reliability through replication
• Single master(NN) coordinates
access,metadata
• No data caching
• Familiar interface ,
NN SPOF and failure resistance
• Store metadata in different place
(local disk / share storage)
Secondary NN
Merge edit log with Fsimage
Reduce recovery time
NN HA
Resource & Event
• http://class10e.com/Cloudera/
• http://blog.cloudera.com/blog/
• Hadoop Summit
http://hadoopsummit.org/
• Hadoop World
http://www.hadoopworld.com/
hadoop introduce

Weitere ähnliche Inhalte

Was ist angesagt?

Bolt: Building A Distributed ndarray
Bolt: Building A Distributed ndarrayBolt: Building A Distributed ndarray
Bolt: Building A Distributed ndarrayJen Aman
 
peRm R group. Review of packages for r for market data downloading and analysis
peRm R group. Review of packages for r for market data downloading and analysispeRm R group. Review of packages for r for market data downloading and analysis
peRm R group. Review of packages for r for market data downloading and analysisVyacheslav Arbuzov
 
7B_3_Matterhorn on the horizon
7B_3_Matterhorn on the horizon7B_3_Matterhorn on the horizon
7B_3_Matterhorn on the horizonGISRUK conference
 
確率的プログラミングライブラリEdward
確率的プログラミングライブラリEdward確率的プログラミングライブラリEdward
確率的プログラミングライブラリEdwardYuta Kashino
 
06.09.2017 Computer Science, Machine Learning & Statistiks Meetup - MULTI-GPU...
06.09.2017 Computer Science, Machine Learning & Statistiks Meetup - MULTI-GPU...06.09.2017 Computer Science, Machine Learning & Statistiks Meetup - MULTI-GPU...
06.09.2017 Computer Science, Machine Learning & Statistiks Meetup - MULTI-GPU...Zalando adtech lab
 
800万人の"食べたい"をHadoopで分散処理
800万人の"食べたい"をHadoopで分散処理800万人の"食べたい"をHadoopで分散処理
800万人の"食べたい"をHadoopで分散処理Tatsuya Sasaki
 
AIP seminar - ARCHES WP6
AIP seminar - ARCHES WP6AIP seminar - ARCHES WP6
AIP seminar - ARCHES WP6Alexey Mints
 
Minimum cost maximum flow
Minimum cost maximum flowMinimum cost maximum flow
Minimum cost maximum flowSaruarChowdhury
 
Num Integration
Num IntegrationNum Integration
Num Integrationmuhdisys
 
DCC2014 - Fully Online Grammar Compression in Constant Space
DCC2014 - Fully Online Grammar Compression in Constant SpaceDCC2014 - Fully Online Grammar Compression in Constant Space
DCC2014 - Fully Online Grammar Compression in Constant SpaceYasuo Tabei
 
OpenTuesday: Neues aus der RRDtool Welt
OpenTuesday: Neues aus der RRDtool WeltOpenTuesday: Neues aus der RRDtool Welt
OpenTuesday: Neues aus der RRDtool WeltDigicomp Academy AG
 

Was ist angesagt? (19)

Bolt: Building A Distributed ndarray
Bolt: Building A Distributed ndarrayBolt: Building A Distributed ndarray
Bolt: Building A Distributed ndarray
 
Python grass
Python grassPython grass
Python grass
 
peRm R group. Review of packages for r for market data downloading and analysis
peRm R group. Review of packages for r for market data downloading and analysispeRm R group. Review of packages for r for market data downloading and analysis
peRm R group. Review of packages for r for market data downloading and analysis
 
Ggmap Packages in R
Ggmap Packages in RGgmap Packages in R
Ggmap Packages in R
 
7B_3_Matterhorn on the horizon
7B_3_Matterhorn on the horizon7B_3_Matterhorn on the horizon
7B_3_Matterhorn on the horizon
 
確率的プログラミングライブラリEdward
確率的プログラミングライブラリEdward確率的プログラミングライブラリEdward
確率的プログラミングライブラリEdward
 
06.09.2017 Computer Science, Machine Learning & Statistiks Meetup - MULTI-GPU...
06.09.2017 Computer Science, Machine Learning & Statistiks Meetup - MULTI-GPU...06.09.2017 Computer Science, Machine Learning & Statistiks Meetup - MULTI-GPU...
06.09.2017 Computer Science, Machine Learning & Statistiks Meetup - MULTI-GPU...
 
800万人の"食べたい"をHadoopで分散処理
800万人の"食べたい"をHadoopで分散処理800万人の"食べたい"をHadoopで分散処理
800万人の"食べたい"をHadoopで分散処理
 
Raster package jacob
Raster package jacobRaster package jacob
Raster package jacob
 
2
22
2
 
Maximum flow
Maximum flowMaximum flow
Maximum flow
 
L2 binomial operations
L2 binomial operationsL2 binomial operations
L2 binomial operations
 
AIP seminar - ARCHES WP6
AIP seminar - ARCHES WP6AIP seminar - ARCHES WP6
AIP seminar - ARCHES WP6
 
Minimum cost maximum flow
Minimum cost maximum flowMinimum cost maximum flow
Minimum cost maximum flow
 
Num Integration
Num IntegrationNum Integration
Num Integration
 
DCC2014 - Fully Online Grammar Compression in Constant Space
DCC2014 - Fully Online Grammar Compression in Constant SpaceDCC2014 - Fully Online Grammar Compression in Constant Space
DCC2014 - Fully Online Grammar Compression in Constant Space
 
Max Flow Problem
Max Flow ProblemMax Flow Problem
Max Flow Problem
 
Open source adobe lightroom like
Open source adobe lightroom likeOpen source adobe lightroom like
Open source adobe lightroom like
 
OpenTuesday: Neues aus der RRDtool Welt
OpenTuesday: Neues aus der RRDtool WeltOpenTuesday: Neues aus der RRDtool Welt
OpenTuesday: Neues aus der RRDtool Welt
 

Ähnlich wie hadoop introduce

Presentation: Plotting Systems in R
Presentation: Plotting Systems in RPresentation: Plotting Systems in R
Presentation: Plotting Systems in RIlya Zhbannikov
 
Parallel Computing in R
Parallel Computing in RParallel Computing in R
Parallel Computing in Rmickey24
 
Tools for research plotting
Tools for research plottingTools for research plotting
Tools for research plottingNimrita Koul
 
Tools for research plotting
Tools for research plottingTools for research plotting
Tools for research plottingNimrita Koul
 
楽々Scalaプログラミング
楽々Scalaプログラミング楽々Scalaプログラミング
楽々ScalaプログラミングTomoharu ASAMI
 
[Let'Swift 2019] 실용적인 함수형 프로그래밍 워크샵
[Let'Swift 2019] 실용적인 함수형 프로그래밍 워크샵[Let'Swift 2019] 실용적인 함수형 프로그래밍 워크샵
[Let'Swift 2019] 실용적인 함수형 프로그래밍 워크샵Wanbok Choi
 
Programming the cloud with Skywriting
Programming the cloud with SkywritingProgramming the cloud with Skywriting
Programming the cloud with SkywritingDerek Murray
 
Big Data Analytics with Scala at SCALA.IO 2013
Big Data Analytics with Scala at SCALA.IO 2013Big Data Analytics with Scala at SCALA.IO 2013
Big Data Analytics with Scala at SCALA.IO 2013Samir Bessalah
 
Coq to Rubyによる証明駆動開発@名古屋ruby会議02
Coq to Rubyによる証明駆動開発@名古屋ruby会議02Coq to Rubyによる証明駆動開発@名古屋ruby会議02
Coq to Rubyによる証明駆動開発@名古屋ruby会議02Hiroki Mizuno
 
A Survey Of R Graphics
A Survey Of R GraphicsA Survey Of R Graphics
A Survey Of R GraphicsDataspora
 
Parallel SQL and Analytics with Solr: Presented by Yonik Seeley, Cloudera
Parallel SQL and Analytics with Solr: Presented by Yonik Seeley, ClouderaParallel SQL and Analytics with Solr: Presented by Yonik Seeley, Cloudera
Parallel SQL and Analytics with Solr: Presented by Yonik Seeley, ClouderaLucidworks
 
Robust Operations of Kafka Streams
Robust Operations of Kafka StreamsRobust Operations of Kafka Streams
Robust Operations of Kafka Streamsconfluent
 
Type Checking Scala Spark Datasets: Dataset Transforms
Type Checking Scala Spark Datasets: Dataset TransformsType Checking Scala Spark Datasets: Dataset Transforms
Type Checking Scala Spark Datasets: Dataset TransformsJohn Nestor
 
Extending Spark SQL API with Easier to Use Array Types Operations with Marek ...
Extending Spark SQL API with Easier to Use Array Types Operations with Marek ...Extending Spark SQL API with Easier to Use Array Types Operations with Marek ...
Extending Spark SQL API with Easier to Use Array Types Operations with Marek ...Databricks
 

Ähnlich wie hadoop introduce (20)

Clojure
ClojureClojure
Clojure
 
Presentation: Plotting Systems in R
Presentation: Plotting Systems in RPresentation: Plotting Systems in R
Presentation: Plotting Systems in R
 
Parallel Computing in R
Parallel Computing in RParallel Computing in R
Parallel Computing in R
 
Tools for research plotting
Tools for research plottingTools for research plotting
Tools for research plotting
 
Tools for research plotting
Tools for research plottingTools for research plotting
Tools for research plotting
 
楽々Scalaプログラミング
楽々Scalaプログラミング楽々Scalaプログラミング
楽々Scalaプログラミング
 
Lec2
Lec2Lec2
Lec2
 
[Let'Swift 2019] 실용적인 함수형 프로그래밍 워크샵
[Let'Swift 2019] 실용적인 함수형 프로그래밍 워크샵[Let'Swift 2019] 실용적인 함수형 프로그래밍 워크샵
[Let'Swift 2019] 실용적인 함수형 프로그래밍 워크샵
 
Programming the cloud with Skywriting
Programming the cloud with SkywritingProgramming the cloud with Skywriting
Programming the cloud with Skywriting
 
Big Data Analytics with Scala at SCALA.IO 2013
Big Data Analytics with Scala at SCALA.IO 2013Big Data Analytics with Scala at SCALA.IO 2013
Big Data Analytics with Scala at SCALA.IO 2013
 
Hadoop london
Hadoop londonHadoop london
Hadoop london
 
Coq to Rubyによる証明駆動開発@名古屋ruby会議02
Coq to Rubyによる証明駆動開発@名古屋ruby会議02Coq to Rubyによる証明駆動開発@名古屋ruby会議02
Coq to Rubyによる証明駆動開発@名古屋ruby会議02
 
Polimorfismo cosa?
Polimorfismo cosa?Polimorfismo cosa?
Polimorfismo cosa?
 
Joclad 2010 d
Joclad 2010 dJoclad 2010 d
Joclad 2010 d
 
A Survey Of R Graphics
A Survey Of R GraphicsA Survey Of R Graphics
A Survey Of R Graphics
 
Parallel SQL and Analytics with Solr: Presented by Yonik Seeley, Cloudera
Parallel SQL and Analytics with Solr: Presented by Yonik Seeley, ClouderaParallel SQL and Analytics with Solr: Presented by Yonik Seeley, Cloudera
Parallel SQL and Analytics with Solr: Presented by Yonik Seeley, Cloudera
 
Robust Operations of Kafka Streams
Robust Operations of Kafka StreamsRobust Operations of Kafka Streams
Robust Operations of Kafka Streams
 
Type Checking Scala Spark Datasets: Dataset Transforms
Type Checking Scala Spark Datasets: Dataset TransformsType Checking Scala Spark Datasets: Dataset Transforms
Type Checking Scala Spark Datasets: Dataset Transforms
 
R training5
R training5R training5
R training5
 
Extending Spark SQL API with Easier to Use Array Types Operations with Marek ...
Extending Spark SQL API with Easier to Use Array Types Operations with Marek ...Extending Spark SQL API with Easier to Use Array Types Operations with Marek ...
Extending Spark SQL API with Easier to Use Array Types Operations with Marek ...
 

Kürzlich hochgeladen

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistandanishmna97
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfOverkill Security
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Zilliz
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Victor Rentea
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Orbitshub
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 

Kürzlich hochgeladen (20)

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdf
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 

hadoop introduce

  • 1. What mapreduce is ? • Origin from Google (Operating Systems Design and Implementation 04) • A sample programming model for data processing • For large dataset processing
  • 2. MapReduce feature • Parallel • Run on commodity hardware • Fault Tolerance
  • 3. Three phase of MR • Map • Shuffle • Reduce
  • 4. Example for map • Let map(k, v) = • foreach char c in v: • emit(k, c) • (“A”, “cats”) -> (“A”, “c”), (“A”, “a”), (“A”, “t”), (“A”, “s”)
  • 5. Double example • Let map(k, v) = • emit(k.toUpper(), v.toUpper()) • (“foo”, “bar”) -> (“FOO”, “BAR”) • (“Foo”, “other”) -> (“FOO”, “OTHER”)
  • 6. Triple example • Let map(k, v) = • if (isPrime(v)) then emit(k, v) • (“foo”, 7) -> (“foo”, 7) • (“test”, 10) -> (nothing)
  • 7. Reduce example let reduce(k, vals) = sum = 0 foreach int v in vals: sum += emit(k, sum) (“A”, [42, 100, 312]) -> (“A”, 454) (“B”, [12, 6, -2]) -> (“B”, 16)
  • 8. Interface InputFormat • • Two methods getSplits How to split the input data • getRecordReader How to read the input data
  • 9. Caculate the map tasks we need • Goalsize = Totalsize/mapred.map.tasks • Mapred.map.tasks(defined in job configuration ,just a hint)
  • 10.
  • 11.
  • 12. Reduce number • 0.95 ? 1.75 ? • At 0.95 all of the reduces can launch immediately and start transfering map outputs as the maps finish. • At 1.75 the faster nodes will finish their first round of reduces and launch a second round of reduces doing a much better job of load balancing.
  • 13.
  • 14.
  • 15. What HDFS is ? • Origin from Google again [SOSP’03] Symposium on Operating Systems Principles • Redundant storage of massive amounts of data on cheap and unreliable computers
  • 16. HDFS feature • Files stored as blocks • Reliability through replication • Single master(NN) coordinates access,metadata • No data caching • Familiar interface ,
  • 17. NN SPOF and failure resistance • Store metadata in different place (local disk / share storage) Secondary NN Merge edit log with Fsimage Reduce recovery time NN HA
  • 18. Resource & Event • http://class10e.com/Cloudera/ • http://blog.cloudera.com/blog/ • Hadoop Summit http://hadoopsummit.org/ • Hadoop World http://www.hadoopworld.com/