SlideShare ist ein Scribd-Unternehmen logo
1 von 24
Downloaden Sie, um offline zu lesen
MAPREDUCE SUCCINCTLY
Data everywhere
Problem - We are drowning in data
Hadoop’s place
Effective storage and processing of large chunks of data
Google GFS and MapReduce
• Google was dealing a large amount of data over 10 years ago
• Documented experience in a series of papers
• The MapReduce programming model
• Google File System
• Scalable model that was implemented in Hadoop
Disk speeds
• Processing 10 TB file
• Time – ~430 minutes
• Stored as 1TB on 10 machines
• Time – ~43 minutes
To store data at scale you need to
use multiple disks/machines
Processor trends
• CPU speeds are not growing exponentially
• Processors take less power
• Processors are able to do more in one cycle
Product Name
Intel® Core™ i7-920
Processor (8M Cache,
2.66 GHz, 4.80 GT/s
Intel® QPI)
Intel® Core™ i7-6700K
Processor (8M Cache, up
to 4.20 GHz)
Code Name Bloomfield Skylake
Launch Date Q4'08 Q3'15
Lithography 45 nm 14 nm
Recommended
Customer Price BOX : $305.00 BOX : $350.00
# of Cores 4 4
# of Threads 8 8
Processor Base
Frequency 2.66 GHz 4 GHz
Max Turbo
Frequency 2.93 GHz 4.2 GHz
TDP 130 W 91 W
Source - http://ark.intel.com/compare/88195,37147
To scale you need to use multiple
CPUs/machines
Network speeds
• Gigabit - Speed: 1000 mbps
• Size: 1 TB
• ~ 2 Hours
Don’t move data unless you have to
Example scenario
• Example that we will use to understand the problem
• Data on favorite beverage
• Calculate average cups consumed per day for each beverage
Brianna, coffee, 3
Cameron, milk, 5
Thomas, milk, 4
Wyatt, coffee, 5
coffee, 4
milk, 4.5
Example – Single Threaded
Average cups consumed by tea drinkers is 3.33
Transform
Group by beverage
Summarize and display results
The problem of shared state
Can we avoid
shared state?
Key idea – cooperating units
• Organize program into independent but cooperating units
• Programs need to be broken into a structure that will minimize
the need for any shared state
• Cooperating units can work in parallel without sharing resources
and cooperate as needed
Key idea – avoid shared state
Sum large list
Add list 1
Add list 2
Add list 3
Add and display
sum
How can we apply to our problem?
• Data can be split into blocks
• Each block of data can be processed by a thread
Stage 1 - input Stage 1 - output Stage 2 - output Stage 3 output
Brianna, coffee, 1
Cameron, milk, 5
Thomas, milk, 4
Wyatt, tea, 1
Victoria, coffee, 3
Grace, coffee, 4
David, tea, 4
coffee, 1
milk, 5
milk, 4
tea, 1
coffee, 3
coffee, 4
tea, 4
coffee, {1,3,4}
milk, {5, 4}
tea, {1, 4}
Coffee – 2.67
Milk, 4.5
Tea – 2.5
The Akka Actor model
• Units can send and receive messages
• Mailbox
Implementation structured to avoid shared state
Implementation – Take 2
Implementation – Take 3
MapReduce
Framework
Sorts, groups and
sends data by key
[Sort/Shuffle step]
The MapReduce framework
Preparation Map - input Map - output Sort/shuffle -
output
Reduce output
Break files into
blocks that can
be processed
independently
Locate and use
code to read
each record
Brianna, coffee, 1
Cameron, milk, 5
Thomas, milk, 4
Wyatt, tea, 1
Victoria, coffee, 3
Grace, coffee, 4
David, tea, 4
coffee, 1
milk, 5
milk, 4
tea, 1
coffee, 3
coffee, 4
tea, 4
coffee, {1,3,4}
milk, {5, 4}
tea, {1, 4}
Coffee – 2.67
Milk, 4.5
Tea – 2.5
Hadoop Distributed File System
• Files are split into large blocks
• Each block is stored on multiple nodes
• Namenode tracks block location
Other aspects
• Framework does a lot of the heavy lifting
• Machines can fail
• Tasks can fail
• Stragglers
• Users just write the Map and Reduce functions
Cup count demo – Apache Hadoop
• Demo
• Program is almost identical to what we wrote
Next steps
• Check out sample files on GitHub - https://github.com/danjebaraj/hadoopmr
• Read Google’s paper on Map Reduce and GFS (HDFS)
• http://research.google.com/archive/mapreduce.html
• http://research.google.com/archive/gfs.html
• Get familiar with Hadoop and Apache Spark
• Become familiar with functional programming
• Scala, F#, Clojure
• Check out Syncfusion’s free e-Books on related topics
• If working with Windows checkout Syncfusion’s easy to use Big Data Platform -
http://www.syncfusion.com/products/big-data
http://www.syncfusion.com/products/big-data
http://www.syncfusion.com/resources/techportal/ebooks
Related links
Thank you
Daniel Jebaraj
www.syncfusion.com

Weitere ähnliche Inhalte

Ähnlich wie MapReduce succinctly

Meeting Performance Goals in multi-tenant Hadoop Clusters
Meeting Performance Goals in multi-tenant Hadoop ClustersMeeting Performance Goals in multi-tenant Hadoop Clusters
Meeting Performance Goals in multi-tenant Hadoop ClustersDataWorks Summit/Hadoop Summit
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMichael Hiskey
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File SystemVaibhav Jain
 
Cloud computing UNIT 2.1 presentation in
Cloud computing UNIT 2.1 presentation inCloud computing UNIT 2.1 presentation in
Cloud computing UNIT 2.1 presentation inRahulBhole12
 
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...Lessons Learned Replatforming A Large Machine Learning Application To Apache ...
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...Databricks
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarKognitio
 
Writing Scalable Software in Java
Writing Scalable Software in JavaWriting Scalable Software in Java
Writing Scalable Software in JavaRuben Badaró
 
HDFS_architecture.ppt
HDFS_architecture.pptHDFS_architecture.ppt
HDFS_architecture.pptvijayapraba1
 
Performance Tuning
Performance TuningPerformance Tuning
Performance TuningJannet Peetz
 
5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game Changer5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game ChangerCaserta
 
Scaling ETL with Hadoop - Avoiding Failure
Scaling ETL with Hadoop - Avoiding FailureScaling ETL with Hadoop - Avoiding Failure
Scaling ETL with Hadoop - Avoiding FailureGwen (Chen) Shapira
 
Storage Systems For Scalable systems
Storage Systems For Scalable systemsStorage Systems For Scalable systems
Storage Systems For Scalable systemselliando dias
 
Hardware Provisioning
Hardware Provisioning Hardware Provisioning
Hardware Provisioning MongoDB
 
Azure Data Factory Data Flow Performance Tuning 101
Azure Data Factory Data Flow Performance Tuning 101Azure Data Factory Data Flow Performance Tuning 101
Azure Data Factory Data Flow Performance Tuning 101Mark Kromer
 
Hadoop Data Modeling
Hadoop Data ModelingHadoop Data Modeling
Hadoop Data ModelingAdam Doyle
 
ITI015En-The evolution of databases (I)
ITI015En-The evolution of databases (I)ITI015En-The evolution of databases (I)
ITI015En-The evolution of databases (I)Huibert Aalbers
 

Ähnlich wie MapReduce succinctly (20)

Hadoop
HadoopHadoop
Hadoop
 
Big Data for QAs
Big Data for QAsBig Data for QAs
Big Data for QAs
 
Meeting Performance Goals in multi-tenant Hadoop Clusters
Meeting Performance Goals in multi-tenant Hadoop ClustersMeeting Performance Goals in multi-tenant Hadoop Clusters
Meeting Performance Goals in multi-tenant Hadoop Clusters
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinar
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File System
 
Cloud computing UNIT 2.1 presentation in
Cloud computing UNIT 2.1 presentation inCloud computing UNIT 2.1 presentation in
Cloud computing UNIT 2.1 presentation in
 
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...Lessons Learned Replatforming A Large Machine Learning Application To Apache ...
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinar
 
Writing Scalable Software in Java
Writing Scalable Software in JavaWriting Scalable Software in Java
Writing Scalable Software in Java
 
HDFS_architecture.ppt
HDFS_architecture.pptHDFS_architecture.ppt
HDFS_architecture.ppt
 
Performance Tuning
Performance TuningPerformance Tuning
Performance Tuning
 
5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game Changer5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game Changer
 
Scaling ETL with Hadoop - Avoiding Failure
Scaling ETL with Hadoop - Avoiding FailureScaling ETL with Hadoop - Avoiding Failure
Scaling ETL with Hadoop - Avoiding Failure
 
Storage Systems For Scalable systems
Storage Systems For Scalable systemsStorage Systems For Scalable systems
Storage Systems For Scalable systems
 
Hardware Provisioning
Hardware Provisioning Hardware Provisioning
Hardware Provisioning
 
Azure Data Factory Data Flow Performance Tuning 101
Azure Data Factory Data Flow Performance Tuning 101Azure Data Factory Data Flow Performance Tuning 101
Azure Data Factory Data Flow Performance Tuning 101
 
MapReduce.pptx
MapReduce.pptxMapReduce.pptx
MapReduce.pptx
 
Hadoop Data Modeling
Hadoop Data ModelingHadoop Data Modeling
Hadoop Data Modeling
 
Breaking data
Breaking dataBreaking data
Breaking data
 
ITI015En-The evolution of databases (I)
ITI015En-The evolution of databases (I)ITI015En-The evolution of databases (I)
ITI015En-The evolution of databases (I)
 

Kürzlich hochgeladen

%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...masabamasaba
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplatePresentation.STUDIO
 
%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrandmasabamasaba
 
Announcing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK SoftwareAnnouncing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK SoftwareJim McKeeth
 
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park %in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park masabamasaba
 
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024VictoriaMetrics
 
%in Harare+277-882-255-28 abortion pills for sale in Harare
%in Harare+277-882-255-28 abortion pills for sale in Harare%in Harare+277-882-255-28 abortion pills for sale in Harare
%in Harare+277-882-255-28 abortion pills for sale in Hararemasabamasaba
 
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...masabamasaba
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesVictorSzoltysek
 
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...masabamasaba
 
Define the academic and professional writing..pdf
Define the academic and professional writing..pdfDefine the academic and professional writing..pdf
Define the academic and professional writing..pdfPearlKirahMaeRagusta1
 
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyviewmasabamasaba
 
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...masabamasaba
 
Architecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the pastArchitecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the pastPapp Krisztián
 
8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech studentsHimanshiGarg82
 
Harnessing ChatGPT - Elevating Productivity in Today's Agile Environment
Harnessing ChatGPT  - Elevating Productivity in Today's Agile EnvironmentHarnessing ChatGPT  - Elevating Productivity in Today's Agile Environment
Harnessing ChatGPT - Elevating Productivity in Today's Agile EnvironmentVictorSzoltysek
 
%in Soweto+277-882-255-28 abortion pills for sale in soweto
%in Soweto+277-882-255-28 abortion pills for sale in soweto%in Soweto+277-882-255-28 abortion pills for sale in soweto
%in Soweto+277-882-255-28 abortion pills for sale in sowetomasabamasaba
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension AidPhilip Schwarz
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Steffen Staab
 

Kürzlich hochgeladen (20)

%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation Template
 
%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand
 
Announcing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK SoftwareAnnouncing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK Software
 
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park %in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
 
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
 
%in Harare+277-882-255-28 abortion pills for sale in Harare
%in Harare+277-882-255-28 abortion pills for sale in Harare%in Harare+277-882-255-28 abortion pills for sale in Harare
%in Harare+277-882-255-28 abortion pills for sale in Harare
 
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
 
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
 
Define the academic and professional writing..pdf
Define the academic and professional writing..pdfDefine the academic and professional writing..pdf
Define the academic and professional writing..pdf
 
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
 
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
 
Architecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the pastArchitecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the past
 
8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students
 
Harnessing ChatGPT - Elevating Productivity in Today's Agile Environment
Harnessing ChatGPT  - Elevating Productivity in Today's Agile EnvironmentHarnessing ChatGPT  - Elevating Productivity in Today's Agile Environment
Harnessing ChatGPT - Elevating Productivity in Today's Agile Environment
 
%in Soweto+277-882-255-28 abortion pills for sale in soweto
%in Soweto+277-882-255-28 abortion pills for sale in soweto%in Soweto+277-882-255-28 abortion pills for sale in soweto
%in Soweto+277-882-255-28 abortion pills for sale in soweto
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
 

MapReduce succinctly

  • 2. Data everywhere Problem - We are drowning in data
  • 3. Hadoop’s place Effective storage and processing of large chunks of data
  • 4. Google GFS and MapReduce • Google was dealing a large amount of data over 10 years ago • Documented experience in a series of papers • The MapReduce programming model • Google File System • Scalable model that was implemented in Hadoop
  • 5. Disk speeds • Processing 10 TB file • Time – ~430 minutes • Stored as 1TB on 10 machines • Time – ~43 minutes To store data at scale you need to use multiple disks/machines
  • 6. Processor trends • CPU speeds are not growing exponentially • Processors take less power • Processors are able to do more in one cycle Product Name Intel® Core™ i7-920 Processor (8M Cache, 2.66 GHz, 4.80 GT/s Intel® QPI) Intel® Core™ i7-6700K Processor (8M Cache, up to 4.20 GHz) Code Name Bloomfield Skylake Launch Date Q4'08 Q3'15 Lithography 45 nm 14 nm Recommended Customer Price BOX : $305.00 BOX : $350.00 # of Cores 4 4 # of Threads 8 8 Processor Base Frequency 2.66 GHz 4 GHz Max Turbo Frequency 2.93 GHz 4.2 GHz TDP 130 W 91 W Source - http://ark.intel.com/compare/88195,37147 To scale you need to use multiple CPUs/machines
  • 7. Network speeds • Gigabit - Speed: 1000 mbps • Size: 1 TB • ~ 2 Hours Don’t move data unless you have to
  • 8. Example scenario • Example that we will use to understand the problem • Data on favorite beverage • Calculate average cups consumed per day for each beverage Brianna, coffee, 3 Cameron, milk, 5 Thomas, milk, 4 Wyatt, coffee, 5 coffee, 4 milk, 4.5
  • 9. Example – Single Threaded Average cups consumed by tea drinkers is 3.33 Transform Group by beverage Summarize and display results
  • 10. The problem of shared state Can we avoid shared state?
  • 11. Key idea – cooperating units • Organize program into independent but cooperating units • Programs need to be broken into a structure that will minimize the need for any shared state • Cooperating units can work in parallel without sharing resources and cooperate as needed
  • 12. Key idea – avoid shared state Sum large list Add list 1 Add list 2 Add list 3 Add and display sum
  • 13. How can we apply to our problem? • Data can be split into blocks • Each block of data can be processed by a thread Stage 1 - input Stage 1 - output Stage 2 - output Stage 3 output Brianna, coffee, 1 Cameron, milk, 5 Thomas, milk, 4 Wyatt, tea, 1 Victoria, coffee, 3 Grace, coffee, 4 David, tea, 4 coffee, 1 milk, 5 milk, 4 tea, 1 coffee, 3 coffee, 4 tea, 4 coffee, {1,3,4} milk, {5, 4} tea, {1, 4} Coffee – 2.67 Milk, 4.5 Tea – 2.5
  • 14. The Akka Actor model • Units can send and receive messages • Mailbox
  • 15. Implementation structured to avoid shared state
  • 17. Implementation – Take 3 MapReduce Framework Sorts, groups and sends data by key [Sort/Shuffle step]
  • 18. The MapReduce framework Preparation Map - input Map - output Sort/shuffle - output Reduce output Break files into blocks that can be processed independently Locate and use code to read each record Brianna, coffee, 1 Cameron, milk, 5 Thomas, milk, 4 Wyatt, tea, 1 Victoria, coffee, 3 Grace, coffee, 4 David, tea, 4 coffee, 1 milk, 5 milk, 4 tea, 1 coffee, 3 coffee, 4 tea, 4 coffee, {1,3,4} milk, {5, 4} tea, {1, 4} Coffee – 2.67 Milk, 4.5 Tea – 2.5
  • 19. Hadoop Distributed File System • Files are split into large blocks • Each block is stored on multiple nodes • Namenode tracks block location
  • 20. Other aspects • Framework does a lot of the heavy lifting • Machines can fail • Tasks can fail • Stragglers • Users just write the Map and Reduce functions
  • 21. Cup count demo – Apache Hadoop • Demo • Program is almost identical to what we wrote
  • 22. Next steps • Check out sample files on GitHub - https://github.com/danjebaraj/hadoopmr • Read Google’s paper on Map Reduce and GFS (HDFS) • http://research.google.com/archive/mapreduce.html • http://research.google.com/archive/gfs.html • Get familiar with Hadoop and Apache Spark • Become familiar with functional programming • Scala, F#, Clojure • Check out Syncfusion’s free e-Books on related topics • If working with Windows checkout Syncfusion’s easy to use Big Data Platform - http://www.syncfusion.com/products/big-data