SlideShare ist ein Scribd-Unternehmen logo
1 von 21
Downloaden Sie, um offline zu lesen
MapReduce with
Hadoop at MyLife
June 6, 2013
Speaker: Jeff Meister
Topics of Talk
• What are MapReduce and Hadoop?
• When would you want to use them?
• How do they work?
• What does Hadoop do for you?
• How do you write MapReduce programs
to take advantage of that?
• What do we use them for at MyLife?
What are MapReduce
and Hadoop?
• MapReduce is a programming model for
parallel processing of large datasets
• An idea for how to write programs under
certain constraints
• Hadoop is an open-source implementation
of MapReduce
• Designed for clusters of commodity
machines
Motivation:
Why would you use
MapReduce?
Background:
Disk vs. Memory
• Memory
• Where the computer
keeps data it’s
currently working on
• Fast response time,
random access
supported
• Expensive: typical size
in tens of GB
• Hard disk
• More permanent
storage of data for
future tasks
• Slow response time,
sequential access only
• Cheap: typical size in
hundreds or
thousands of GB
Example Task on
Small Datasets
ID Public record
R1 Steve Jones, 36, 12 Main St, 10001
R2 John Brown, 72, 625 8th Ave, 90210
R3 James Davis, 23, 10 Broadway, 20202
R4 Tom Lewis, 45, 95 Park Pl, 90024
R5 Tim Harris, 33, PO Box 256, 33514
... ...
R20
00 Adam Parker, 59, 82 F St, 45454
Size: 8 MB Size: 3.5 MB
ID Phone number
P1 Robert White, 45121, (654) 321-4702
P2 David Johnson, 07470, (973) 602-2519
P3 Scott Lee, 23910, (602) 412-2255
P4 Steve Jones, 10001, (212) 347-3380
P5 John Wayne, 13284, (312) 446-8878
... ...
P10
00 Tom Lewis, 90024, (650) 945-2319
Real World:
Large Datasets
• 290 million public records = 380 GB
• 228 million phone records = 252 GB
• We could improve previous algorithm, but...
• The machine doesn’t have enough memory
• Would spend lots of time moving pieces of data
between disk and memory
• Disk is so slow, the task is now impractical
• What to do? Use Hadoop MapReduce!
• Divide into smaller tasks, run them in parallel
Hadoop:
What does it do?
How do you work with it?
Components of the
Hadoop System
• Hadoop Distributed File System
(HDFS)
• Splits up files into blocks, stores
them on multiple computers
• Knows which blocks are on
each machine
• Transfers blocks between
machines over the network
• Replicates blocks, designed to
tolerate frequent machine
failures
• MapReduce engine
• Supports distributed
computation
• Programmer writes Map and
Reduce functions
• Engine takes care of
parallelization, so you can focus
on your work
The Map and
Reduce Functions
• map : (K1, V1) List(K2, V2)
• Take an input record and produce (emit) a list of
intermediate (key, value) pairs
• reduce : (K2, List(V2)) List(K3, V3)
• Examine the values for each intermediate key,
produce a list of output records
• Critical observation: output type of map ≠ input type
of reduce!
• What’s going on in between?
The “Magic”:
A Fast Parallel Sort
• The core of Hadoop MapReduce is a
distributed parallel sorting algorithm
• Hadoop guarantees that the input to each
reducer is sorted by key (K2)
• All the (K2, V2) pairs from the mappers
are grouped by key
• The reducer gets a list of values
corresponding to each key
Why Is It Fast?
• Imagine how you might sort a deck of cards
• The most intuitive procedure for humans is
very inefficient for computers
• Turns out the best algorithm, merge sort, is
less straightforward
• Split the data up into smaller pieces, sort
the pieces individually, then merge them
• Hadoop is using HDFS to do a giant parallel
merge sort over its cluster
Example Task
with MapReduce
• map : (source_id, record) List(match_key, source_id)
• For each input record, select the fields to match by, make a
key out of them
• Use the record’s unique identifier as the value
• reduce : (match_key, List(source_id))
List(public_record_id, phone_id)
• For each match key, look through the list of unique IDs
• If we find both a public record ID and a phone ID in the
same list, match!
• The profiles with these IDs share all fields in the key
• Generate the output pair of matched IDs
Example Task on
Small Datasets
ID Public record
R1 Steve Jones, 36, 12 Main St, 10001
R2 John Brown, 72, 625 8th Ave, 90210
R3 James Davis, 23, 10 Broadway, 20202
R4 Tom Lewis, 45, 95 Park Pl, 90024
R5 Tim Harris, 33, PO Box 256, 33514
... ...
R20
00 Adam Parker, 59, 82 F St, 45454
Size: 8 MB Size: 3.5 MB
ID Phone number
P1 Robert White, 45121, (654) 321-4702
P2 David Johnson, 07470, (973) 602-2519
P3 Scott Lee, 23910, (602) 412-2255
P4 Steve Jones, 10001, (212) 347-3380
P5 John Wayne, 13284, (312) 446-8878
... ...
P10
00 Tom Lewis, 90024, (650) 945-2319
When is MapReduce
Appropriate?
• To benefit from using Hadoop:
• The data must be decomposable into many
(key, value) pairs
• Each mapper runs the same operation,
independently of other mappers
• Map output keys should sort values into groups
of similar size
• Sequential algorithms that are more straightforward
may need redesign for the MapReduce model
Common Applications
of MapReduce
• Many common distributed tasks are easily
expressible with MapReduce.A few examples:
• Term frequency counting
• Pattern searching
• Of course, sorting
• Graph algorithms, such as reversal (Web links)
• Inverted index generation
• Data mining (clustering, statistics)
MapReduce at MyLife
Applications of
MapReduce at MyLife
• We regularly run computations over large sets of
people data
• Who’s Searching ForYou
• Content-based aggregation pipeline (1.5 TB)
• Deltas of licensed data updates (300 GB)
• Generating search indexes for old platform
• Various ad hoc jobs involving matching, searching,
extraction, counting, de-duplication, and more
Hadoop Cluster
Specifications
• Currently 63 machines, each configured to run 4 or 6 map or
reduce tasks at once (total capacity 296)
• CPU:
• Each machine: 2x quad-core Opteron @ 2.2 GHz
• Memory:
• Each machine: 32 GB
• Cluster total: 2 TB
• Hard disk:
• Each machine: between 3 and 9 TB
• Total HDFS capacity: 345 TB
Other Companies
Using Hadoop
• Yahoo! - Index calculations for Web search
• Facebook - Analytics and machine learning
• World’s largest Hadoop cluster!
• Amazon - Supports Hadoop on EC2/S3 cloud services
• LinkedIn
• PeopleYou May Know
• Viewers of This Profile AlsoViewed
• Apple - Used in iAds platform
• Twitter - Data warehousing and analytics
• Lots more... http://wiki.apache.org/hadoop/PoweredBy
Further Reading
• Google research papers
• Google File System, SOSP 2003
• MapReduce, OSDI 2004
• BigTable, OSDI 2006
• Hadoop manual: http://hadoop.apache.org/
• Other Hadoop-related projects from
Apache: Cassandra, HBase, Hive, Pig

Weitere ähnliche Inhalte

Was ist angesagt?

Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : BeginnersShweta Patnaik
 
Geek camp
Geek campGeek camp
Geek campjdhok
 
Intro to Hadoop
Intro to HadoopIntro to Hadoop
Intro to Hadoopjeffturner
 
Distributed Computing with Apache Hadoop. Introduction to MapReduce.
Distributed Computing with Apache Hadoop. Introduction to MapReduce.Distributed Computing with Apache Hadoop. Introduction to MapReduce.
Distributed Computing with Apache Hadoop. Introduction to MapReduce.Konstantin V. Shvachko
 
Scaling Storage and Computation with Hadoop
Scaling Storage and Computation with HadoopScaling Storage and Computation with Hadoop
Scaling Storage and Computation with Hadoopyaevents
 
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)Uwe Printz
 
Introduction to Hadoop Technology
Introduction to Hadoop TechnologyIntroduction to Hadoop Technology
Introduction to Hadoop TechnologyManish Borkar
 
Introduction to apache hadoop copy
Introduction to apache hadoop   copyIntroduction to apache hadoop   copy
Introduction to apache hadoop copyMohammad_Tariq
 
Migrating structured data between Hadoop and RDBMS
Migrating structured data between Hadoop and RDBMSMigrating structured data between Hadoop and RDBMS
Migrating structured data between Hadoop and RDBMSBouquet
 
Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1Rohit Agrawal
 
Apache Hadoop Big Data Technology
Apache Hadoop Big Data TechnologyApache Hadoop Big Data Technology
Apache Hadoop Big Data TechnologyJay Nagar
 
Big Data and Hadoop Ecosystem
Big Data and Hadoop EcosystemBig Data and Hadoop Ecosystem
Big Data and Hadoop EcosystemRajkumar Singh
 

Was ist angesagt? (20)

Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : Beginners
 
Geek camp
Geek campGeek camp
Geek camp
 
Intro to Hadoop
Intro to HadoopIntro to Hadoop
Intro to Hadoop
 
Hadoop overview
Hadoop overviewHadoop overview
Hadoop overview
 
Distributed Computing with Apache Hadoop. Introduction to MapReduce.
Distributed Computing with Apache Hadoop. Introduction to MapReduce.Distributed Computing with Apache Hadoop. Introduction to MapReduce.
Distributed Computing with Apache Hadoop. Introduction to MapReduce.
 
An Introduction to the World of Hadoop
An Introduction to the World of HadoopAn Introduction to the World of Hadoop
An Introduction to the World of Hadoop
 
Scaling Storage and Computation with Hadoop
Scaling Storage and Computation with HadoopScaling Storage and Computation with Hadoop
Scaling Storage and Computation with Hadoop
 
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
 
Introduction to Hadoop Technology
Introduction to Hadoop TechnologyIntroduction to Hadoop Technology
Introduction to Hadoop Technology
 
Hadoop Technology
Hadoop TechnologyHadoop Technology
Hadoop Technology
 
Introduction to apache hadoop copy
Introduction to apache hadoop   copyIntroduction to apache hadoop   copy
Introduction to apache hadoop copy
 
Pptx present
Pptx presentPptx present
Pptx present
 
Hadoop and big data
Hadoop and big dataHadoop and big data
Hadoop and big data
 
2. hadoop fundamentals
2. hadoop fundamentals2. hadoop fundamentals
2. hadoop fundamentals
 
Migrating structured data between Hadoop and RDBMS
Migrating structured data between Hadoop and RDBMSMigrating structured data between Hadoop and RDBMS
Migrating structured data between Hadoop and RDBMS
 
Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1
 
Anju
AnjuAnju
Anju
 
An Introduction to Hadoop
An Introduction to HadoopAn Introduction to Hadoop
An Introduction to Hadoop
 
Apache Hadoop Big Data Technology
Apache Hadoop Big Data TechnologyApache Hadoop Big Data Technology
Apache Hadoop Big Data Technology
 
Big Data and Hadoop Ecosystem
Big Data and Hadoop EcosystemBig Data and Hadoop Ecosystem
Big Data and Hadoop Ecosystem
 

Ähnlich wie Map reduce and hadoop at mylife

Real time hadoop + mapreduce intro
Real time hadoop + mapreduce introReal time hadoop + mapreduce intro
Real time hadoop + mapreduce introGeoff Hendrey
 
Large scale computing with mapreduce
Large scale computing with mapreduceLarge scale computing with mapreduce
Large scale computing with mapreducehansen3032
 
Big Data Technologies - Hadoop
Big Data Technologies - HadoopBig Data Technologies - Hadoop
Big Data Technologies - HadoopTalentica Software
 
Hadoop and Mapreduce for .NET User Group
Hadoop and Mapreduce for .NET User GroupHadoop and Mapreduce for .NET User Group
Hadoop and Mapreduce for .NET User GroupCsaba Toth
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewNisanth Simon
 
Map reduce paradigm explained
Map reduce paradigm explainedMap reduce paradigm explained
Map reduce paradigm explainedDmytro Sandu
 
BW Tech Meetup: Hadoop and The rise of Big Data
BW Tech Meetup: Hadoop and The rise of Big Data BW Tech Meetup: Hadoop and The rise of Big Data
BW Tech Meetup: Hadoop and The rise of Big Data Mindgrub Technologies
 
Hadoop: A distributed framework for Big Data
Hadoop: A distributed framework for Big DataHadoop: A distributed framework for Big Data
Hadoop: A distributed framework for Big DataDhanashri Yadav
 
Bigdata workshop february 2015
Bigdata workshop  february 2015 Bigdata workshop  february 2015
Bigdata workshop february 2015 clairvoyantllc
 
Big data & hadoop
Big data & hadoopBig data & hadoop
Big data & hadoopAbhi Goyan
 

Ähnlich wie Map reduce and hadoop at mylife (20)

Real time hadoop + mapreduce intro
Real time hadoop + mapreduce introReal time hadoop + mapreduce intro
Real time hadoop + mapreduce intro
 
Large scale computing with mapreduce
Large scale computing with mapreduceLarge scale computing with mapreduce
Large scale computing with mapreduce
 
Big data
Big dataBig data
Big data
 
Big data
Big dataBig data
Big data
 
002 Introduction to hadoop v3
002   Introduction to hadoop v3002   Introduction to hadoop v3
002 Introduction to hadoop v3
 
Big Data Technologies - Hadoop
Big Data Technologies - HadoopBig Data Technologies - Hadoop
Big Data Technologies - Hadoop
 
Hadoop and Mapreduce for .NET User Group
Hadoop and Mapreduce for .NET User GroupHadoop and Mapreduce for .NET User Group
Hadoop and Mapreduce for .NET User Group
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce Overview
 
Map reduce paradigm explained
Map reduce paradigm explainedMap reduce paradigm explained
Map reduce paradigm explained
 
BW Tech Meetup: Hadoop and The rise of Big Data
BW Tech Meetup: Hadoop and The rise of Big Data BW Tech Meetup: Hadoop and The rise of Big Data
BW Tech Meetup: Hadoop and The rise of Big Data
 
Bw tech hadoop
Bw tech hadoopBw tech hadoop
Bw tech hadoop
 
Hadoop: A distributed framework for Big Data
Hadoop: A distributed framework for Big DataHadoop: A distributed framework for Big Data
Hadoop: A distributed framework for Big Data
 
Map reducecloudtech
Map reducecloudtechMap reducecloudtech
Map reducecloudtech
 
Bigdata workshop february 2015
Bigdata workshop  february 2015 Bigdata workshop  february 2015
Bigdata workshop february 2015
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
ENAR short course
ENAR short courseENAR short course
ENAR short course
 
Apache Spark
Apache SparkApache Spark
Apache Spark
 
Hadoop
HadoopHadoop
Hadoop
 
2012 apache hadoop_map_reduce_windows_azure
2012 apache hadoop_map_reduce_windows_azure2012 apache hadoop_map_reduce_windows_azure
2012 apache hadoop_map_reduce_windows_azure
 
Big data & hadoop
Big data & hadoopBig data & hadoop
Big data & hadoop
 

Kürzlich hochgeladen

08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 

Kürzlich hochgeladen (20)

08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 

Map reduce and hadoop at mylife

  • 1. MapReduce with Hadoop at MyLife June 6, 2013 Speaker: Jeff Meister
  • 2. Topics of Talk • What are MapReduce and Hadoop? • When would you want to use them? • How do they work? • What does Hadoop do for you? • How do you write MapReduce programs to take advantage of that? • What do we use them for at MyLife?
  • 3. What are MapReduce and Hadoop? • MapReduce is a programming model for parallel processing of large datasets • An idea for how to write programs under certain constraints • Hadoop is an open-source implementation of MapReduce • Designed for clusters of commodity machines
  • 4. Motivation: Why would you use MapReduce?
  • 5. Background: Disk vs. Memory • Memory • Where the computer keeps data it’s currently working on • Fast response time, random access supported • Expensive: typical size in tens of GB • Hard disk • More permanent storage of data for future tasks • Slow response time, sequential access only • Cheap: typical size in hundreds or thousands of GB
  • 6. Example Task on Small Datasets ID Public record R1 Steve Jones, 36, 12 Main St, 10001 R2 John Brown, 72, 625 8th Ave, 90210 R3 James Davis, 23, 10 Broadway, 20202 R4 Tom Lewis, 45, 95 Park Pl, 90024 R5 Tim Harris, 33, PO Box 256, 33514 ... ... R20 00 Adam Parker, 59, 82 F St, 45454 Size: 8 MB Size: 3.5 MB ID Phone number P1 Robert White, 45121, (654) 321-4702 P2 David Johnson, 07470, (973) 602-2519 P3 Scott Lee, 23910, (602) 412-2255 P4 Steve Jones, 10001, (212) 347-3380 P5 John Wayne, 13284, (312) 446-8878 ... ... P10 00 Tom Lewis, 90024, (650) 945-2319
  • 7. Real World: Large Datasets • 290 million public records = 380 GB • 228 million phone records = 252 GB • We could improve previous algorithm, but... • The machine doesn’t have enough memory • Would spend lots of time moving pieces of data between disk and memory • Disk is so slow, the task is now impractical • What to do? Use Hadoop MapReduce! • Divide into smaller tasks, run them in parallel
  • 8. Hadoop: What does it do? How do you work with it?
  • 9. Components of the Hadoop System • Hadoop Distributed File System (HDFS) • Splits up files into blocks, stores them on multiple computers • Knows which blocks are on each machine • Transfers blocks between machines over the network • Replicates blocks, designed to tolerate frequent machine failures • MapReduce engine • Supports distributed computation • Programmer writes Map and Reduce functions • Engine takes care of parallelization, so you can focus on your work
  • 10. The Map and Reduce Functions • map : (K1, V1) List(K2, V2) • Take an input record and produce (emit) a list of intermediate (key, value) pairs • reduce : (K2, List(V2)) List(K3, V3) • Examine the values for each intermediate key, produce a list of output records • Critical observation: output type of map ≠ input type of reduce! • What’s going on in between?
  • 11. The “Magic”: A Fast Parallel Sort • The core of Hadoop MapReduce is a distributed parallel sorting algorithm • Hadoop guarantees that the input to each reducer is sorted by key (K2) • All the (K2, V2) pairs from the mappers are grouped by key • The reducer gets a list of values corresponding to each key
  • 12. Why Is It Fast? • Imagine how you might sort a deck of cards • The most intuitive procedure for humans is very inefficient for computers • Turns out the best algorithm, merge sort, is less straightforward • Split the data up into smaller pieces, sort the pieces individually, then merge them • Hadoop is using HDFS to do a giant parallel merge sort over its cluster
  • 13. Example Task with MapReduce • map : (source_id, record) List(match_key, source_id) • For each input record, select the fields to match by, make a key out of them • Use the record’s unique identifier as the value • reduce : (match_key, List(source_id)) List(public_record_id, phone_id) • For each match key, look through the list of unique IDs • If we find both a public record ID and a phone ID in the same list, match! • The profiles with these IDs share all fields in the key • Generate the output pair of matched IDs
  • 14. Example Task on Small Datasets ID Public record R1 Steve Jones, 36, 12 Main St, 10001 R2 John Brown, 72, 625 8th Ave, 90210 R3 James Davis, 23, 10 Broadway, 20202 R4 Tom Lewis, 45, 95 Park Pl, 90024 R5 Tim Harris, 33, PO Box 256, 33514 ... ... R20 00 Adam Parker, 59, 82 F St, 45454 Size: 8 MB Size: 3.5 MB ID Phone number P1 Robert White, 45121, (654) 321-4702 P2 David Johnson, 07470, (973) 602-2519 P3 Scott Lee, 23910, (602) 412-2255 P4 Steve Jones, 10001, (212) 347-3380 P5 John Wayne, 13284, (312) 446-8878 ... ... P10 00 Tom Lewis, 90024, (650) 945-2319
  • 15. When is MapReduce Appropriate? • To benefit from using Hadoop: • The data must be decomposable into many (key, value) pairs • Each mapper runs the same operation, independently of other mappers • Map output keys should sort values into groups of similar size • Sequential algorithms that are more straightforward may need redesign for the MapReduce model
  • 16. Common Applications of MapReduce • Many common distributed tasks are easily expressible with MapReduce.A few examples: • Term frequency counting • Pattern searching • Of course, sorting • Graph algorithms, such as reversal (Web links) • Inverted index generation • Data mining (clustering, statistics)
  • 18. Applications of MapReduce at MyLife • We regularly run computations over large sets of people data • Who’s Searching ForYou • Content-based aggregation pipeline (1.5 TB) • Deltas of licensed data updates (300 GB) • Generating search indexes for old platform • Various ad hoc jobs involving matching, searching, extraction, counting, de-duplication, and more
  • 19. Hadoop Cluster Specifications • Currently 63 machines, each configured to run 4 or 6 map or reduce tasks at once (total capacity 296) • CPU: • Each machine: 2x quad-core Opteron @ 2.2 GHz • Memory: • Each machine: 32 GB • Cluster total: 2 TB • Hard disk: • Each machine: between 3 and 9 TB • Total HDFS capacity: 345 TB
  • 20. Other Companies Using Hadoop • Yahoo! - Index calculations for Web search • Facebook - Analytics and machine learning • World’s largest Hadoop cluster! • Amazon - Supports Hadoop on EC2/S3 cloud services • LinkedIn • PeopleYou May Know • Viewers of This Profile AlsoViewed • Apple - Used in iAds platform • Twitter - Data warehousing and analytics • Lots more... http://wiki.apache.org/hadoop/PoweredBy
  • 21. Further Reading • Google research papers • Google File System, SOSP 2003 • MapReduce, OSDI 2004 • BigTable, OSDI 2006 • Hadoop manual: http://hadoop.apache.org/ • Other Hadoop-related projects from Apache: Cassandra, HBase, Hive, Pig