SlideShare ist ein Scribd-Unternehmen logo
1 von 28
Apache Hadoop MapReduce
What next?


Arun C. Murthy
Founder & Architect
@acmurthy (@hortonworks)




                           Page 1
Hello! I’m Arun
• Founder/Architect at Hortonworks Inc.
  – Lead, Map-Reduce
  – Formerly, Architect Hadoop MapReduce, Yahoo
  – Responsible for running Hadoop MR as a service for all of Yahoo
    (50k nodes footprint)

• Apache Hadoop, ASF
  – VP, Apache Hadoop, ASF (Chair of Apache Hadoop PMC)
  – Long-term Committer/PMC member (full time >6 years)
  – Release Manager for hadoop-2




                                                                 Page 2
Agenda

• Hadoop MapReduce, State of the Art
• Hadoop YARN
   – Overview
   – State of the art

• Art of the possible
   – YARN Runtime
   – MapReduce Framework

• Q&A



                                       Page 3
Hadoop MapReduce
State of the Art




                   Page 4
Hadoop MapReduce Classic
• JobTracker
  – Manages cluster resources and job scheduling
• TaskTracker
  – Per-node agent

  – Manage tasks
Hadoop 1 – Enterprise Ready
• Hadoop 1.x is the most stable & reliable version of
  Hadoop MapReduce ever
   – Proven to be reliable at the most demanding Hadoop clusters
     in the world
• CapacityScheduler for Multi-tenancy
   –   Share clusters at scale
   –   Resource & User limits for fine-grained
   –   Queue & Job ACLs
   –   Resilient to misbehaving/rogue applications, users etc.,
       helping drive SLA for applications, pipelines etc.




                              6
Hadoop 1 – Availability for MR
• JobTracker Restart
  – Enhanced to restart all jobs on rare JT failures


• JobTracker Safemode
  – Admin driven for known issues
  – Auto-monitoring of HDFS for full-stack availability




                            7
Hadoop YARN
Overview & Status Quo




                        Page 8
MapReduce - Areas for Improvement
 • Utilization
 • Scalability
    – Maximum Cluster size – 4,000 nodes
    – Maximum concurrent tasks – 40,000
 • Hard partition of resources into map and reduce slots
 • Lacks support for alternate paradigms
 • Lack of wire-compatible protocols




                         9
Requirements
• Reliability
• Availability
• Utilization
• Wire Compatibility
• Agility & Evolution – Ability for customers to control
  upgrades to the grid software stack.
• Scalability - Clusters of 6,000-10,000 machines
   – Each machine with 16 cores, 48G/96G RAM, 24TB/36TB
     disks
   – 100,000+ concurrent tasks
   – 10,000 concurrent jobs

                           10
Design Centre
• Split up the two major functions of JobTracker
   – Cluster resource management
   – Application life-cycle management
• MapReduce becomes user-land library




                          11
Concepts
• Application
   – Application is a job submitted to the framework
   – Example – Map Reduce Job
• Container
   – Basic unit of allocation
   – Example – container A = 2GB, 1CPU
   – Replaces the fixed map/reduce slots




                           12
Architecture
• Resource Manager
   – Global resource scheduler
   – Hierarchical queues
• Node Manager
   – Per-machine agent
   – Manages the life-cycle of container
   – Container resource monitoring
• Application Master
   – Per-application
   – Manages application scheduling and task execution
   – E.g. MapReduce Application Master

                            13
Architecture

                                             Node
                                             Node
                                            Manager
                                            Manager


                                      Container   App Mstr
                                                  App Mstr


       Client

                           Resource          Node
                                             Node
                           Resource
                           Manager
                           Manager          Manager
                                            Manager
       Client
        Client

                                      App Mstr    Container
                                                  Container




        MapReduce Status                     Node
                                             Node
        MapReduce Status
                                            Manager
                                            Manager
          Job Submission
         Job Submission
           Node Status
          Node Status
        Resource Request
        Resource Request              Container   Container
How do I get it?

• Available in hadoop-2.0.0-alpha release




                          15
Performance

• 2x+ across the board (HDFS, YARN, MapReduce)
• MapReduce
  –Unlock lots of improvements from Terasort record (Owen/Arun,
   2009)
      – Shuffle 30%+
      – Merge improvements
  –Small Jobs – Uber AM
  –Re-use task slots (containers)

   http://hortonworks.com/delivering-on-hadoop-next-benchmarking-performance/




                                                                         Page 16
Resources

hadoop-2.0.0 (alpha release):
http://hadoop.apache.org/common/releases.html

Release Documentation:
http://hadoop.apache.org/common/docs/r2.0.0-alpha/




                                                     Page 17
Art of the possible
YARN Runtime
MapReduce Framework




                      Page 18
Looking ahead

• YARN
  –Runtime Improvements
  –Alternate programming models
  –Long(er) running services

• MapReduce
  –Framework enhancements
  –Unpack!




                                  Page 19
YARN - Roadmap

• Scheduler
  –Multi-dimensional resource scheduling (MAPREDUCE-4327)
  –Preemption (MAPREDUCE-3938)
  –Gang scheduling



• Runtime improvements
  –Container Isolation (MAPREDUCE-4334)




                                                       Page 20
YARN - Data Processing Applications

• OpenMPI on Hadoop
• Spark (UC Berkeley)
  –Shark is Hive-on-Spark
• Real-time data processing
  – Storm (Twitter)
  – Apache S4
• Graph processing – Apache Giraph




                                      Page 21
YARN - Beyond Data Processing Apps

• Apache Hbase
  –Deployment via YARN (HBASE-4329)
  –Co-processors via YARN (HBASE-4047)
• Simple deployment for cluster services




                                           Page 22
MapReduce – Way Forward

• MapReduce Framework Runtime
   –Monolithic software
• MR Runtime?
   –Sort, Merge, Shuffle et al
• Unpack into smaller building blocks!
   –Allow applications and Pig/Hive to ‘plug-n-play’
   –MR framework, as we know today, becomes a particular
    configuration of the building blocks




                                                           Page 23
MapReduce – Pluggable Sort

• Pig & Hive benefit from hash-based aggregation
  –Several queries don’t need full-sort of map-outputs
  –Aggregation suffices
  –Allow for pluggable MapOutputBuffer in MapTask
  –Sort Avoidance - MAPREDUCE-4039
  –External sort plugin – MAPREDUCE-2454




                                                         Page 24
MapReduce – Pluggable Shuffle

• Push v/s Pull shuffle
• Plug shuffle implementation (already in hadoop-2)
   –E.g. RDMA for shuffle
   –MAPREDUCE-4049
• Collation tasks
   –Sailfish - Yahoo Research (includes auto-tuning of reduces)




                                                              Page 25
MapReduce – More ideas

• Allow for Map-Reduce-Reduce
  –Allow for reduce output to be sorted/shuffled
  – JOIN followed by ORDER BY
  – Really big deal for Pig/Hive
• DAG Management for Pig/Hive
  – Scheduling improvements
  – Restart semantics




                                                   Page 26
MapReduce – How do we get there?

• Multiple, concurrent implementations of MapReduce
  –YARN is a really big deal…
  –Allows for safe experiments, much less risky!
  –Exposure surface is highly limited




                                                      Page 27
Questions?




Thank You.
@acmurthy




             Page 28

Weitere ähnliche Inhalte

Andere mochten auch

Hadoop 2 - Going beyond MapReduce
Hadoop 2 - Going beyond MapReduceHadoop 2 - Going beyond MapReduce
Hadoop 2 - Going beyond MapReduceUwe Printz
 
Stock Analyzer Hadoop MapReduce Implementation
Stock Analyzer Hadoop MapReduce ImplementationStock Analyzer Hadoop MapReduce Implementation
Stock Analyzer Hadoop MapReduce ImplementationMaruthi Nataraj K
 
Hadoop hbase mapreduce
Hadoop hbase mapreduceHadoop hbase mapreduce
Hadoop hbase mapreduceFARUK BERKSÖZ
 
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)Hari Shankar Sreekumar
 
An Introduction to MapReduce
An Introduction to MapReduceAn Introduction to MapReduce
An Introduction to MapReduceFrane Bandov
 
MapReduce Design Patterns
MapReduce Design PatternsMapReduce Design Patterns
MapReduce Design PatternsDonald Miner
 
Map reduce: beyond word count
Map reduce: beyond word countMap reduce: beyond word count
Map reduce: beyond word countJeff Patti
 
Mapreduce Algorithms
Mapreduce AlgorithmsMapreduce Algorithms
Mapreduce AlgorithmsAmund Tveit
 
Intro to HDFS and MapReduce
Intro to HDFS and MapReduceIntro to HDFS and MapReduce
Intro to HDFS and MapReduceRyan Tabora
 
Hadoop Real Life Use Case & MapReduce Details
Hadoop Real Life Use Case & MapReduce DetailsHadoop Real Life Use Case & MapReduce Details
Hadoop Real Life Use Case & MapReduce DetailsAnju Singh
 
Big data, map reduce and beyond
Big data, map reduce and beyondBig data, map reduce and beyond
Big data, map reduce and beyonddatasalt
 
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reducerantav
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture EMC
 

Andere mochten auch (15)

Introduction to MapReduce
Introduction to MapReduceIntroduction to MapReduce
Introduction to MapReduce
 
Hadoop 2 - Going beyond MapReduce
Hadoop 2 - Going beyond MapReduceHadoop 2 - Going beyond MapReduce
Hadoop 2 - Going beyond MapReduce
 
Stock Analyzer Hadoop MapReduce Implementation
Stock Analyzer Hadoop MapReduce ImplementationStock Analyzer Hadoop MapReduce Implementation
Stock Analyzer Hadoop MapReduce Implementation
 
Hadoop hbase mapreduce
Hadoop hbase mapreduceHadoop hbase mapreduce
Hadoop hbase mapreduce
 
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
 
An Introduction to MapReduce
An Introduction to MapReduceAn Introduction to MapReduce
An Introduction to MapReduce
 
MapReduce Design Patterns
MapReduce Design PatternsMapReduce Design Patterns
MapReduce Design Patterns
 
Map reduce: beyond word count
Map reduce: beyond word countMap reduce: beyond word count
Map reduce: beyond word count
 
Mapreduce Algorithms
Mapreduce AlgorithmsMapreduce Algorithms
Mapreduce Algorithms
 
Intro to HDFS and MapReduce
Intro to HDFS and MapReduceIntro to HDFS and MapReduce
Intro to HDFS and MapReduce
 
Hadoop Real Life Use Case & MapReduce Details
Hadoop Real Life Use Case & MapReduce DetailsHadoop Real Life Use Case & MapReduce Details
Hadoop Real Life Use Case & MapReduce Details
 
MapReduce in Simple Terms
MapReduce in Simple TermsMapReduce in Simple Terms
MapReduce in Simple Terms
 
Big data, map reduce and beyond
Big data, map reduce and beyondBig data, map reduce and beyond
Big data, map reduce and beyond
 
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reduce
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
 

Mehr von Hortonworks

Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next LevelHortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next LevelHortonworks
 
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT StrategyIoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT StrategyHortonworks
 
Getting the Most Out of Your Data in the Cloud with Cloudbreak
Getting the Most Out of Your Data in the Cloud with CloudbreakGetting the Most Out of Your Data in the Cloud with Cloudbreak
Getting the Most Out of Your Data in the Cloud with CloudbreakHortonworks
 
Johns Hopkins - Using Hadoop to Secure Access Log Events
Johns Hopkins - Using Hadoop to Secure Access Log EventsJohns Hopkins - Using Hadoop to Secure Access Log Events
Johns Hopkins - Using Hadoop to Secure Access Log EventsHortonworks
 
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad GuysCatch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad GuysHortonworks
 
HDF 3.2 - What's New
HDF 3.2 - What's NewHDF 3.2 - What's New
HDF 3.2 - What's NewHortonworks
 
Curing Kafka Blindness with Hortonworks Streams Messaging Manager
Curing Kafka Blindness with Hortonworks Streams Messaging ManagerCuring Kafka Blindness with Hortonworks Streams Messaging Manager
Curing Kafka Blindness with Hortonworks Streams Messaging ManagerHortonworks
 
Interpretation Tool for Genomic Sequencing Data in Clinical Environments
Interpretation Tool for Genomic Sequencing Data in Clinical EnvironmentsInterpretation Tool for Genomic Sequencing Data in Clinical Environments
Interpretation Tool for Genomic Sequencing Data in Clinical EnvironmentsHortonworks
 
IBM+Hortonworks = Transformation of the Big Data Landscape
IBM+Hortonworks = Transformation of the Big Data LandscapeIBM+Hortonworks = Transformation of the Big Data Landscape
IBM+Hortonworks = Transformation of the Big Data LandscapeHortonworks
 
Premier Inside-Out: Apache Druid
Premier Inside-Out: Apache DruidPremier Inside-Out: Apache Druid
Premier Inside-Out: Apache DruidHortonworks
 
Accelerating Data Science and Real Time Analytics at Scale
Accelerating Data Science and Real Time Analytics at ScaleAccelerating Data Science and Real Time Analytics at Scale
Accelerating Data Science and Real Time Analytics at ScaleHortonworks
 
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATATIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATAHortonworks
 
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...Hortonworks
 
Delivering Real-Time Streaming Data for Healthcare Customers: Clearsense
Delivering Real-Time Streaming Data for Healthcare Customers: ClearsenseDelivering Real-Time Streaming Data for Healthcare Customers: Clearsense
Delivering Real-Time Streaming Data for Healthcare Customers: ClearsenseHortonworks
 
Making Enterprise Big Data Small with Ease
Making Enterprise Big Data Small with EaseMaking Enterprise Big Data Small with Ease
Making Enterprise Big Data Small with EaseHortonworks
 
Webinewbie to Webinerd in 30 Days - Webinar World Presentation
Webinewbie to Webinerd in 30 Days - Webinar World PresentationWebinewbie to Webinerd in 30 Days - Webinar World Presentation
Webinewbie to Webinerd in 30 Days - Webinar World PresentationHortonworks
 
Driving Digital Transformation Through Global Data Management
Driving Digital Transformation Through Global Data ManagementDriving Digital Transformation Through Global Data Management
Driving Digital Transformation Through Global Data ManagementHortonworks
 
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming FeaturesHDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming FeaturesHortonworks
 
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...Hortonworks
 
Unlock Value from Big Data with Apache NiFi and Streaming CDC
Unlock Value from Big Data with Apache NiFi and Streaming CDCUnlock Value from Big Data with Apache NiFi and Streaming CDC
Unlock Value from Big Data with Apache NiFi and Streaming CDCHortonworks
 

Mehr von Hortonworks (20)

Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next LevelHortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
 
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT StrategyIoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
 
Getting the Most Out of Your Data in the Cloud with Cloudbreak
Getting the Most Out of Your Data in the Cloud with CloudbreakGetting the Most Out of Your Data in the Cloud with Cloudbreak
Getting the Most Out of Your Data in the Cloud with Cloudbreak
 
Johns Hopkins - Using Hadoop to Secure Access Log Events
Johns Hopkins - Using Hadoop to Secure Access Log EventsJohns Hopkins - Using Hadoop to Secure Access Log Events
Johns Hopkins - Using Hadoop to Secure Access Log Events
 
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad GuysCatch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
 
HDF 3.2 - What's New
HDF 3.2 - What's NewHDF 3.2 - What's New
HDF 3.2 - What's New
 
Curing Kafka Blindness with Hortonworks Streams Messaging Manager
Curing Kafka Blindness with Hortonworks Streams Messaging ManagerCuring Kafka Blindness with Hortonworks Streams Messaging Manager
Curing Kafka Blindness with Hortonworks Streams Messaging Manager
 
Interpretation Tool for Genomic Sequencing Data in Clinical Environments
Interpretation Tool for Genomic Sequencing Data in Clinical EnvironmentsInterpretation Tool for Genomic Sequencing Data in Clinical Environments
Interpretation Tool for Genomic Sequencing Data in Clinical Environments
 
IBM+Hortonworks = Transformation of the Big Data Landscape
IBM+Hortonworks = Transformation of the Big Data LandscapeIBM+Hortonworks = Transformation of the Big Data Landscape
IBM+Hortonworks = Transformation of the Big Data Landscape
 
Premier Inside-Out: Apache Druid
Premier Inside-Out: Apache DruidPremier Inside-Out: Apache Druid
Premier Inside-Out: Apache Druid
 
Accelerating Data Science and Real Time Analytics at Scale
Accelerating Data Science and Real Time Analytics at ScaleAccelerating Data Science and Real Time Analytics at Scale
Accelerating Data Science and Real Time Analytics at Scale
 
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATATIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
 
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
 
Delivering Real-Time Streaming Data for Healthcare Customers: Clearsense
Delivering Real-Time Streaming Data for Healthcare Customers: ClearsenseDelivering Real-Time Streaming Data for Healthcare Customers: Clearsense
Delivering Real-Time Streaming Data for Healthcare Customers: Clearsense
 
Making Enterprise Big Data Small with Ease
Making Enterprise Big Data Small with EaseMaking Enterprise Big Data Small with Ease
Making Enterprise Big Data Small with Ease
 
Webinewbie to Webinerd in 30 Days - Webinar World Presentation
Webinewbie to Webinerd in 30 Days - Webinar World PresentationWebinewbie to Webinerd in 30 Days - Webinar World Presentation
Webinewbie to Webinerd in 30 Days - Webinar World Presentation
 
Driving Digital Transformation Through Global Data Management
Driving Digital Transformation Through Global Data ManagementDriving Digital Transformation Through Global Data Management
Driving Digital Transformation Through Global Data Management
 
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming FeaturesHDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
 
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
 
Unlock Value from Big Data with Apache NiFi and Streaming CDC
Unlock Value from Big Data with Apache NiFi and Streaming CDCUnlock Value from Big Data with Apache NiFi and Streaming CDC
Unlock Value from Big Data with Apache NiFi and Streaming CDC
 

Kürzlich hochgeladen

Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2
 

Kürzlich hochgeladen (20)

DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdf
 

Apache Hadoop MapReduce - What Next? Hadoop Summit 2012

  • 1. Apache Hadoop MapReduce What next? Arun C. Murthy Founder & Architect @acmurthy (@hortonworks) Page 1
  • 2. Hello! I’m Arun • Founder/Architect at Hortonworks Inc. – Lead, Map-Reduce – Formerly, Architect Hadoop MapReduce, Yahoo – Responsible for running Hadoop MR as a service for all of Yahoo (50k nodes footprint) • Apache Hadoop, ASF – VP, Apache Hadoop, ASF (Chair of Apache Hadoop PMC) – Long-term Committer/PMC member (full time >6 years) – Release Manager for hadoop-2 Page 2
  • 3. Agenda • Hadoop MapReduce, State of the Art • Hadoop YARN – Overview – State of the art • Art of the possible – YARN Runtime – MapReduce Framework • Q&A Page 3
  • 4. Hadoop MapReduce State of the Art Page 4
  • 5. Hadoop MapReduce Classic • JobTracker – Manages cluster resources and job scheduling • TaskTracker – Per-node agent – Manage tasks
  • 6. Hadoop 1 – Enterprise Ready • Hadoop 1.x is the most stable & reliable version of Hadoop MapReduce ever – Proven to be reliable at the most demanding Hadoop clusters in the world • CapacityScheduler for Multi-tenancy – Share clusters at scale – Resource & User limits for fine-grained – Queue & Job ACLs – Resilient to misbehaving/rogue applications, users etc., helping drive SLA for applications, pipelines etc. 6
  • 7. Hadoop 1 – Availability for MR • JobTracker Restart – Enhanced to restart all jobs on rare JT failures • JobTracker Safemode – Admin driven for known issues – Auto-monitoring of HDFS for full-stack availability 7
  • 8. Hadoop YARN Overview & Status Quo Page 8
  • 9. MapReduce - Areas for Improvement • Utilization • Scalability – Maximum Cluster size – 4,000 nodes – Maximum concurrent tasks – 40,000 • Hard partition of resources into map and reduce slots • Lacks support for alternate paradigms • Lack of wire-compatible protocols 9
  • 10. Requirements • Reliability • Availability • Utilization • Wire Compatibility • Agility & Evolution – Ability for customers to control upgrades to the grid software stack. • Scalability - Clusters of 6,000-10,000 machines – Each machine with 16 cores, 48G/96G RAM, 24TB/36TB disks – 100,000+ concurrent tasks – 10,000 concurrent jobs 10
  • 11. Design Centre • Split up the two major functions of JobTracker – Cluster resource management – Application life-cycle management • MapReduce becomes user-land library 11
  • 12. Concepts • Application – Application is a job submitted to the framework – Example – Map Reduce Job • Container – Basic unit of allocation – Example – container A = 2GB, 1CPU – Replaces the fixed map/reduce slots 12
  • 13. Architecture • Resource Manager – Global resource scheduler – Hierarchical queues • Node Manager – Per-machine agent – Manages the life-cycle of container – Container resource monitoring • Application Master – Per-application – Manages application scheduling and task execution – E.g. MapReduce Application Master 13
  • 14. Architecture Node Node Manager Manager Container App Mstr App Mstr Client Resource Node Node Resource Manager Manager Manager Manager Client Client App Mstr Container Container MapReduce Status Node Node MapReduce Status Manager Manager Job Submission Job Submission Node Status Node Status Resource Request Resource Request Container Container
  • 15. How do I get it? • Available in hadoop-2.0.0-alpha release 15
  • 16. Performance • 2x+ across the board (HDFS, YARN, MapReduce) • MapReduce –Unlock lots of improvements from Terasort record (Owen/Arun, 2009) – Shuffle 30%+ – Merge improvements –Small Jobs – Uber AM –Re-use task slots (containers) http://hortonworks.com/delivering-on-hadoop-next-benchmarking-performance/ Page 16
  • 17. Resources hadoop-2.0.0 (alpha release): http://hadoop.apache.org/common/releases.html Release Documentation: http://hadoop.apache.org/common/docs/r2.0.0-alpha/ Page 17
  • 18. Art of the possible YARN Runtime MapReduce Framework Page 18
  • 19. Looking ahead • YARN –Runtime Improvements –Alternate programming models –Long(er) running services • MapReduce –Framework enhancements –Unpack! Page 19
  • 20. YARN - Roadmap • Scheduler –Multi-dimensional resource scheduling (MAPREDUCE-4327) –Preemption (MAPREDUCE-3938) –Gang scheduling • Runtime improvements –Container Isolation (MAPREDUCE-4334) Page 20
  • 21. YARN - Data Processing Applications • OpenMPI on Hadoop • Spark (UC Berkeley) –Shark is Hive-on-Spark • Real-time data processing – Storm (Twitter) – Apache S4 • Graph processing – Apache Giraph Page 21
  • 22. YARN - Beyond Data Processing Apps • Apache Hbase –Deployment via YARN (HBASE-4329) –Co-processors via YARN (HBASE-4047) • Simple deployment for cluster services Page 22
  • 23. MapReduce – Way Forward • MapReduce Framework Runtime –Monolithic software • MR Runtime? –Sort, Merge, Shuffle et al • Unpack into smaller building blocks! –Allow applications and Pig/Hive to ‘plug-n-play’ –MR framework, as we know today, becomes a particular configuration of the building blocks Page 23
  • 24. MapReduce – Pluggable Sort • Pig & Hive benefit from hash-based aggregation –Several queries don’t need full-sort of map-outputs –Aggregation suffices –Allow for pluggable MapOutputBuffer in MapTask –Sort Avoidance - MAPREDUCE-4039 –External sort plugin – MAPREDUCE-2454 Page 24
  • 25. MapReduce – Pluggable Shuffle • Push v/s Pull shuffle • Plug shuffle implementation (already in hadoop-2) –E.g. RDMA for shuffle –MAPREDUCE-4049 • Collation tasks –Sailfish - Yahoo Research (includes auto-tuning of reduces) Page 25
  • 26. MapReduce – More ideas • Allow for Map-Reduce-Reduce –Allow for reduce output to be sorted/shuffled – JOIN followed by ORDER BY – Really big deal for Pig/Hive • DAG Management for Pig/Hive – Scheduling improvements – Restart semantics Page 26
  • 27. MapReduce – How do we get there? • Multiple, concurrent implementations of MapReduce –YARN is a really big deal… –Allows for safe experiments, much less risky! –Exposure surface is highly limited Page 27