SlideShare ist ein Scribd-Unternehmen logo
1 von 36
Page1 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Enabling diverse workload scheduling in YARN
June, 2015
Wangda Tan, Hortonworks, (wangda@apache.com)
Craig Welch, Hortonworks, (cwelch@hortonworks.com)
Page2 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
About us
Wangda Tan
• Last 5+ years in big data field,
Hadoop, Open-MPI, etc.
• Past
– Pivotal (PHD team, brings
OpenMPI/GraphLab to YARN)
– Alibaba (ODPS team, platform for
distributed data-mining)
• Now
– Apache Hadoop Committer
@Hortonworks, all in YARN.
– Now spending most of time on
resource scheduling enhancements.
Craig Welch
• Yarn Contributor
Page3 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Hadoop+YARN is the home of
big data processing.
Page4 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Our workloads vary,
Service | Batch | interactive/ real-time
Page5 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
They have different CRAZY requirements
I wanna be fast!
When cluster is busy
Don’t take away
MY RESOURCES
A huge job
needs be scheduled
at a special time
Page6 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
We want to make them
AS HAPPY AS POSSIBLE
to run together in YARN.
Page7 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Let’s start…
Page8 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Agenda today
• Overview
• Node Label
• Resource Preemption
• Reservation system
• Pluggable behavior for Scheduler
• Docker support
• Resource scheduling beyond memory
Page9 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Overview
Page10 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Background
• Resources are
managed by a
hierarchy of queues.
• One queue can have
multiple applications
• Container is the result
resource scheduling,
Which is a bundle of
resources and can run
process(es)
Page11 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
How to manage your workload by queues
• By organization:
–Marketing/Finance
queue
• By workload
–Interactive/Batch queue
• Hybrid
–Finance-
batch/Marketing-
realtime queue
Page12 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Node Label
Page13 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Node Label – Overview
• Types of node labels
– Node partition (Since 2.6)
– Node constraints (WIP)
• Node partition (Today’s focus)
– One node belongs to only one
partition
– Related to resource planning
• Node constraints
– One node can assign multiple
constraints
– Not related to resource planning
Page14 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Node partition – Resource planning
• Nodes belong to “default partition” if not specified
• It’s possible to specify different capacities of queues on different partitions
–For example, sales queue can use different resource on GPU and default partition.
• It’s possible to specify some partition will be only used by some queues
(ACL for partition)
–For example, only sales queue can access “Large memory partition”
Page15 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Node partition – Exclusive vs. Non-exclusive
Snake Partition Bear partition Default partition
Exclusive partition
Non-exclusive partition
Use it when
they're not at home
Resource Request
Page16 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Node Partition – Use cases & best practice
• Dedicate nodes to run important services:
–E.g. Running HBase region server using Apache Slider
• Nodes with special hardware in the cluster are used by organizations.
–E.g. You may want a queue dedicated to the marketing department to use 80% of
these memory-heavy nodes.
• Use non-exclusive node partition to make better resource utilization.
• Be careful about user-limits, capacity, etc. to make sure jobs can be
launched
I will cover more details about implementation & usage in Thursday morning’s
session “YARN Node Labels” with Mayank Bansal from Ebay.
Page17 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Resource Preemption
Page18 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Resource Preemption – Overview
• Queue has configured minimum resource.
• Since it has a minimum resource value, the preemption policy (which
performs preempting resources) is used to insure that:
–When a queue is under its “minimum resource”, and the cluster doesn’t have
available resources, preemption policy can get resource from other queues use
more than their minimum resource.
A
B
C
20%
30%
50%
Page19 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Resource Preemption – Example
• When preemption is not enabled
• When preemption is enabled
Page20 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Resource Preemption – best practice
•Configurations to control the pace of preemption:
–yarn.resourcemanager.monitor.capacity.preemption.max_wait_before_kill
–yarn.resourcemanager.monitor.capacity.preemption.total_preemption_per_round
–yarn.resourcemanager.monitor.capacity.preemption.natural_termination_factor
•Configurations to control when or if preemption happens
–yarn.resourcemanager.monitor.capacity.preemption.max_ignored_over_capacity
(deadzone)
–yarn.scheduler.capacity.<queue-path>.disable_preemption
Page21 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Reservation System
Page22 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Reservation System – Overview
• Reserving resource ahead of time
– Just like ordering table in a restaurant
– “I need a table for X people at Y time”
– “Wait for moment … Reservation
confirmed sir“
– (After some time), “Your table is ready”
–What Reservation System does is:
–Send a reservation request
–RM checks time table
–Send back reservation confirmation ID
–Notify when ready
•Enables more predictable start and
run time for time-critical / resource
intensive applications
Page23 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Reservation System – Use cases
•Gang scheduling
– Currently, YARN can do gang
scheduling from application side (holding
resources until it meets requirements)
– Resources could be wasted and there’s
risk of deadlocks.
–RS lays the foundation for gang scheduling
•Workflow support
– I want to run jobs in stages
– Stage-1 at 1 AM tomorrow, needs 10k
containers
– Stage-2 after stage-1, needs 5k
containers
– Stage-3 after stage-2, needs 2k
containers
– You can submit such requests to RS!
Page24 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Reservation System – Result & References
•Before & After Reservation System
(reports from MSR)
– It increased cluster utilization a lot!
•References
– Design / Discussion / Report : YARN-1051
– More detail about example : YARN-2609
Page25 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Pluggable scheduler behavior
Page26 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Why
• Problem
• It’s difficult to share functionality
between schedulers
• Users cannot achieve the same
behavior with all schedulers
• Fixes and enhancements tend to end up
in one scheduler, not all, leading to
fragmentation
• No simple mechanism exists to mix
behaviors for a given feature in a single
cluster
• Solution
• Move to sharable, pluggable scheduler
behavior
Page27 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
How
• The Goal
–Recast scheduler behavior as
policies – candidates include
–Resource limits for apps, users...
–Ordering for allocation and
preemption
• With this, we can:
–Maximize feature availability and
reduce fragmentation
–Configure different queues for
different workloads in a single
cluster
Flexible Scheduler configuration,
as simple
as building with Legos!
Page28 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Ordering Policy of Capacity Scheduler
• Pluggable ordering policies for
LeafQueues in Capacity Scheduler
–Enables the implementation of different
policies for ordering assignment and
preemption of containers for applications
–Initial implementations include FIFO
(Capacity Scheduler original behavior)
and Fair
–User Limits and Queue Capacity limits
are still respected
• Fair scheduling inside Capacity
Scheduler
–Based on the Fair Sharing logic in
FairScheduler
–Assigns containers to applications in
order of least to greatest resource usage
–Allows many applications to make
progress concurrently
–Lets short jobs finish in reasonable time
while not starving long running jobs
Page29 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Configuration and tuning
• Rough guidelines for when to use Fair
and FIFO ordering policies
• Configuration
–yarn.scheduler.capacity.<queue>.ordering-
policy (“fifo” or “fair”, default “fifo”)
–yarn.scheduler.capacity.<queue>.ordering-
policy.fair.enable-size-based-weight (true or
false)
• Tuning
–Use max-am-resource-percent to
avoid “peanut buttering” from having
too many apps running at once
–Sometimes it’s necessary to separate
large and small apps in different
queues, or use size-based-weight, to
avoid large app starvation
Workloads Policy
On-
demand/interactive/
exploratory
Fair
Predictable/Recu-
rring batch
FIFO
Mix of above two Fair
Page30 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Docker container support
Page31 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Docker container support – Overview
• Containers for the Cluster
–Brings the sandboxing and
dependency isolation of container
technology to Hadoop
–Containers make it simple to use
Hadoop resources for a wider range of
applications
Page32 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Docker container support – Status
• Done
–(V1) Initial implementation
translating Kubernetes to an
Application Master launching
Docker containers from the Cluster
met with success.
–(V2) A custom container launcher
for Docker containers. This brought
the capability more fully under the
management of YARN,
–but a single cluster could not
support both traditional YARN
applications (MapReduce, etc)
and Docker concurrently
• Next phase
–(V3) WIP, is adding support for
running Docker and traditional
YARN applications side-by-side in
a single cluster
Page33 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
It’s not all about memory
Page34 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
It’s not all about Memory - CPU
• What’s in a CPU
–Some workloads are CPU
intensive, without accounting for
this nodes may end up CPU bound
or CPU may be under utilized
cluster-wide
–CPU awareness at the scheduer
level is enabled by selecting the
DominantResourceCalculator.
–Dominant? “Dominant” stands for
the “dominant factor”, or the
“bottleneck”. In simplified terms,
for the resource type which is the
most constrained becomes the
dominant factor for any given
comparison or calculation
–For example, If there is enough
memory but not enough cpu for a
resource request, the cpu
component is dominant ( and the
answer is “No”  )
–See
https://www.cs.berkeley.edu/~alig/pap
ers/drf.pdf for more detail
Page35 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
It’s not all about Memory – CPU - Vcores
• What’s in a CPU
–The unit used to abstract CPU
capability in YARN is the vcore
–Vcore counts are configured per-
node in the yarn-site.xml, typically
1-1 vcore to physical CPU
–If some Nodes’ CPUs outclass
other nodes’, the number of vcores
per physical CPU can be adjusted
upward to compensate
Page36 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Q & A
?

Weitere ähnliche Inhalte

Was ist angesagt?

Apache Ambari: Managing Hadoop and YARN
Apache Ambari: Managing Hadoop and YARNApache Ambari: Managing Hadoop and YARN
Apache Ambari: Managing Hadoop and YARNHortonworks
 
Internet of things Crash Course Workshop
Internet of things Crash Course WorkshopInternet of things Crash Course Workshop
Internet of things Crash Course WorkshopDataWorks Summit
 
Hadoop crashcourse v3
Hadoop crashcourse v3Hadoop crashcourse v3
Hadoop crashcourse v3Hortonworks
 
HDF: Hortonworks DataFlow: Technical Workshop
HDF: Hortonworks DataFlow: Technical WorkshopHDF: Hortonworks DataFlow: Technical Workshop
HDF: Hortonworks DataFlow: Technical WorkshopHortonworks
 
Coexistence and Migration of Vendor HPC based infrastructure to Hadoop Ecosys...
Coexistence and Migration of Vendor HPC based infrastructure to Hadoop Ecosys...Coexistence and Migration of Vendor HPC based infrastructure to Hadoop Ecosys...
Coexistence and Migration of Vendor HPC based infrastructure to Hadoop Ecosys...DataWorks Summit
 
Protecting enterprise Data in Hadoop
Protecting enterprise Data in HadoopProtecting enterprise Data in Hadoop
Protecting enterprise Data in HadoopDataWorks Summit
 
Hp Converged Systems and Hortonworks - Webinar Slides
Hp Converged Systems and Hortonworks - Webinar SlidesHp Converged Systems and Hortonworks - Webinar Slides
Hp Converged Systems and Hortonworks - Webinar SlidesHortonworks
 
Discover HDP 2.1: Apache Solr for Hadoop Search
Discover HDP 2.1: Apache Solr for Hadoop SearchDiscover HDP 2.1: Apache Solr for Hadoop Search
Discover HDP 2.1: Apache Solr for Hadoop SearchHortonworks
 
Hortonworks tech workshop in-memory processing with spark
Hortonworks tech workshop   in-memory processing with sparkHortonworks tech workshop   in-memory processing with spark
Hortonworks tech workshop in-memory processing with sparkHortonworks
 
Hadoop and Spark – Perfect Together
Hadoop and Spark – Perfect TogetherHadoop and Spark – Perfect Together
Hadoop and Spark – Perfect TogetherHortonworks
 
An Overview on Optimization in Apache Hive: Past, Present Future
An Overview on Optimization in Apache Hive: Past, Present FutureAn Overview on Optimization in Apache Hive: Past, Present Future
An Overview on Optimization in Apache Hive: Past, Present FutureDataWorks Summit/Hadoop Summit
 
Deep learning with Hortonworks and Apache Spark - Hortonworks technical workshop
Deep learning with Hortonworks and Apache Spark - Hortonworks technical workshopDeep learning with Hortonworks and Apache Spark - Hortonworks technical workshop
Deep learning with Hortonworks and Apache Spark - Hortonworks technical workshopHortonworks
 
Developing YARN Applications - Integrating natively to YARN July 24 2014
Developing YARN Applications - Integrating natively to YARN July 24 2014Developing YARN Applications - Integrating natively to YARN July 24 2014
Developing YARN Applications - Integrating natively to YARN July 24 2014Hortonworks
 
Hive present-and-feature-shanghai
Hive present-and-feature-shanghaiHive present-and-feature-shanghai
Hive present-and-feature-shanghaiYifeng Jiang
 
Delivering Apache Hadoop for the Modern Data Architecture
Delivering Apache Hadoop for the Modern Data Architecture Delivering Apache Hadoop for the Modern Data Architecture
Delivering Apache Hadoop for the Modern Data Architecture Hortonworks
 
A Comprehensive Approach to Building your Big Data - with Cisco, Hortonworks ...
A Comprehensive Approach to Building your Big Data - with Cisco, Hortonworks ...A Comprehensive Approach to Building your Big Data - with Cisco, Hortonworks ...
A Comprehensive Approach to Building your Big Data - with Cisco, Hortonworks ...Hortonworks
 
Introduction to the Hortonworks YARN Ready Program
Introduction to the Hortonworks YARN Ready ProgramIntroduction to the Hortonworks YARN Ready Program
Introduction to the Hortonworks YARN Ready ProgramHortonworks
 

Was ist angesagt? (20)

Apache Ambari: Managing Hadoop and YARN
Apache Ambari: Managing Hadoop and YARNApache Ambari: Managing Hadoop and YARN
Apache Ambari: Managing Hadoop and YARN
 
Internet of things Crash Course Workshop
Internet of things Crash Course WorkshopInternet of things Crash Course Workshop
Internet of things Crash Course Workshop
 
Hadoop crashcourse v3
Hadoop crashcourse v3Hadoop crashcourse v3
Hadoop crashcourse v3
 
HDF: Hortonworks DataFlow: Technical Workshop
HDF: Hortonworks DataFlow: Technical WorkshopHDF: Hortonworks DataFlow: Technical Workshop
HDF: Hortonworks DataFlow: Technical Workshop
 
Coexistence and Migration of Vendor HPC based infrastructure to Hadoop Ecosys...
Coexistence and Migration of Vendor HPC based infrastructure to Hadoop Ecosys...Coexistence and Migration of Vendor HPC based infrastructure to Hadoop Ecosys...
Coexistence and Migration of Vendor HPC based infrastructure to Hadoop Ecosys...
 
Protecting enterprise Data in Hadoop
Protecting enterprise Data in HadoopProtecting enterprise Data in Hadoop
Protecting enterprise Data in Hadoop
 
Apache NiFi in the Hadoop Ecosystem
Apache NiFi in the Hadoop Ecosystem Apache NiFi in the Hadoop Ecosystem
Apache NiFi in the Hadoop Ecosystem
 
Hp Converged Systems and Hortonworks - Webinar Slides
Hp Converged Systems and Hortonworks - Webinar SlidesHp Converged Systems and Hortonworks - Webinar Slides
Hp Converged Systems and Hortonworks - Webinar Slides
 
Discover HDP 2.1: Apache Solr for Hadoop Search
Discover HDP 2.1: Apache Solr for Hadoop SearchDiscover HDP 2.1: Apache Solr for Hadoop Search
Discover HDP 2.1: Apache Solr for Hadoop Search
 
Hortonworks tech workshop in-memory processing with spark
Hortonworks tech workshop   in-memory processing with sparkHortonworks tech workshop   in-memory processing with spark
Hortonworks tech workshop in-memory processing with spark
 
Hadoop and Spark – Perfect Together
Hadoop and Spark – Perfect TogetherHadoop and Spark – Perfect Together
Hadoop and Spark – Perfect Together
 
An Overview on Optimization in Apache Hive: Past, Present Future
An Overview on Optimization in Apache Hive: Past, Present FutureAn Overview on Optimization in Apache Hive: Past, Present Future
An Overview on Optimization in Apache Hive: Past, Present Future
 
Deep learning with Hortonworks and Apache Spark - Hortonworks technical workshop
Deep learning with Hortonworks and Apache Spark - Hortonworks technical workshopDeep learning with Hortonworks and Apache Spark - Hortonworks technical workshop
Deep learning with Hortonworks and Apache Spark - Hortonworks technical workshop
 
Developing YARN Applications - Integrating natively to YARN July 24 2014
Developing YARN Applications - Integrating natively to YARN July 24 2014Developing YARN Applications - Integrating natively to YARN July 24 2014
Developing YARN Applications - Integrating natively to YARN July 24 2014
 
Apache Hive 2.0: SQL, Speed, Scale
Apache Hive 2.0: SQL, Speed, ScaleApache Hive 2.0: SQL, Speed, Scale
Apache Hive 2.0: SQL, Speed, Scale
 
Hive present-and-feature-shanghai
Hive present-and-feature-shanghaiHive present-and-feature-shanghai
Hive present-and-feature-shanghai
 
Delivering Apache Hadoop for the Modern Data Architecture
Delivering Apache Hadoop for the Modern Data Architecture Delivering Apache Hadoop for the Modern Data Architecture
Delivering Apache Hadoop for the Modern Data Architecture
 
A Comprehensive Approach to Building your Big Data - with Cisco, Hortonworks ...
A Comprehensive Approach to Building your Big Data - with Cisco, Hortonworks ...A Comprehensive Approach to Building your Big Data - with Cisco, Hortonworks ...
A Comprehensive Approach to Building your Big Data - with Cisco, Hortonworks ...
 
The state of SQL-on-Hadoop in the Cloud
The state of SQL-on-Hadoop in the CloudThe state of SQL-on-Hadoop in the Cloud
The state of SQL-on-Hadoop in the Cloud
 
Introduction to the Hortonworks YARN Ready Program
Introduction to the Hortonworks YARN Ready ProgramIntroduction to the Hortonworks YARN Ready Program
Introduction to the Hortonworks YARN Ready Program
 

Andere mochten auch

Big Data Platform Processes Daily Healthcare Data for Clinic Use at Mayo Clinic
Big Data Platform Processes Daily Healthcare Data for Clinic Use at Mayo ClinicBig Data Platform Processes Daily Healthcare Data for Clinic Use at Mayo Clinic
Big Data Platform Processes Daily Healthcare Data for Clinic Use at Mayo ClinicDataWorks Summit
 
Evolution of Big Data at Intel - Crawl, Walk and Run Approach
Evolution of Big Data at Intel - Crawl, Walk and Run ApproachEvolution of Big Data at Intel - Crawl, Walk and Run Approach
Evolution of Big Data at Intel - Crawl, Walk and Run ApproachDataWorks Summit
 
Bigger, Faster, Easier: Building a Real-Time Self Service Data Analytics Ecos...
Bigger, Faster, Easier: Building a Real-Time Self Service Data Analytics Ecos...Bigger, Faster, Easier: Building a Real-Time Self Service Data Analytics Ecos...
Bigger, Faster, Easier: Building a Real-Time Self Service Data Analytics Ecos...DataWorks Summit
 
Airflow - An Open Source Platform to Author and Monitor Data Pipelines
Airflow - An Open Source Platform to Author and Monitor Data PipelinesAirflow - An Open Source Platform to Author and Monitor Data Pipelines
Airflow - An Open Source Platform to Author and Monitor Data PipelinesDataWorks Summit
 
large scale collaborative filtering using Apache Giraph
large scale collaborative filtering using Apache Giraphlarge scale collaborative filtering using Apache Giraph
large scale collaborative filtering using Apache GiraphDataWorks Summit
 
Scaling HDFS to Manage Billions of Files with Key-Value Stores
Scaling HDFS to Manage Billions of Files with Key-Value StoresScaling HDFS to Manage Billions of Files with Key-Value Stores
Scaling HDFS to Manage Billions of Files with Key-Value StoresDataWorks Summit
 
Hadoop Performance Optimization at Scale, Lessons Learned at Twitter
Hadoop Performance Optimization at Scale, Lessons Learned at TwitterHadoop Performance Optimization at Scale, Lessons Learned at Twitter
Hadoop Performance Optimization at Scale, Lessons Learned at TwitterDataWorks Summit
 
Apache Kylin - Balance Between Space and Time
Apache Kylin - Balance Between Space and TimeApache Kylin - Balance Between Space and Time
Apache Kylin - Balance Between Space and TimeDataWorks Summit
 
a Secure Public Cache for YARN Application Resources
a Secure Public Cache for YARN Application Resourcesa Secure Public Cache for YARN Application Resources
a Secure Public Cache for YARN Application ResourcesDataWorks Summit
 
Improving HDFS Availability with IPC Quality of Service
Improving HDFS Availability with IPC Quality of ServiceImproving HDFS Availability with IPC Quality of Service
Improving HDFS Availability with IPC Quality of ServiceDataWorks Summit
 
From Beginners to Experts, Data Wrangling for All
From Beginners to Experts, Data Wrangling for AllFrom Beginners to Experts, Data Wrangling for All
From Beginners to Experts, Data Wrangling for AllDataWorks Summit
 
How to use Parquet as a Sasis for ETL and Analytics
How to use Parquet as a Sasis for ETL and AnalyticsHow to use Parquet as a Sasis for ETL and Analytics
How to use Parquet as a Sasis for ETL and AnalyticsDataWorks Summit
 
Apache Lens: Unified OLAP on Realtime and Historic Data
Apache Lens: Unified OLAP on Realtime and Historic DataApache Lens: Unified OLAP on Realtime and Historic Data
Apache Lens: Unified OLAP on Realtime and Historic DataDataWorks Summit
 
Spark crash course workshop at Hadoop Summit
Spark crash course workshop at Hadoop SummitSpark crash course workshop at Hadoop Summit
Spark crash course workshop at Hadoop SummitDataWorks Summit
 
Hadoop Eagle - Real Time Monitoring Framework for eBay Hadoop
Hadoop Eagle - Real Time Monitoring Framework for eBay HadoopHadoop Eagle - Real Time Monitoring Framework for eBay Hadoop
Hadoop Eagle - Real Time Monitoring Framework for eBay HadoopDataWorks Summit
 
Internet of Things Crash Course Workshop at Hadoop Summit
Internet of Things Crash Course Workshop at Hadoop SummitInternet of Things Crash Course Workshop at Hadoop Summit
Internet of Things Crash Course Workshop at Hadoop SummitDataWorks Summit
 
Sqoop on Spark for Data Ingestion
Sqoop on Spark for Data IngestionSqoop on Spark for Data Ingestion
Sqoop on Spark for Data IngestionDataWorks Summit
 
Complex Analytics using Open Source Technologies
Complex Analytics using Open Source TechnologiesComplex Analytics using Open Source Technologies
Complex Analytics using Open Source TechnologiesDataWorks Summit
 
Harnessing Hadoop Distuption: A Telco Case Study
Harnessing Hadoop Distuption: A Telco Case StudyHarnessing Hadoop Distuption: A Telco Case Study
Harnessing Hadoop Distuption: A Telco Case StudyDataWorks Summit
 
Have your Cake and Eat it Too - Architecture for Batch and Real-time processing
Have your Cake and Eat it Too - Architecture for Batch and Real-time processingHave your Cake and Eat it Too - Architecture for Batch and Real-time processing
Have your Cake and Eat it Too - Architecture for Batch and Real-time processingDataWorks Summit
 

Andere mochten auch (20)

Big Data Platform Processes Daily Healthcare Data for Clinic Use at Mayo Clinic
Big Data Platform Processes Daily Healthcare Data for Clinic Use at Mayo ClinicBig Data Platform Processes Daily Healthcare Data for Clinic Use at Mayo Clinic
Big Data Platform Processes Daily Healthcare Data for Clinic Use at Mayo Clinic
 
Evolution of Big Data at Intel - Crawl, Walk and Run Approach
Evolution of Big Data at Intel - Crawl, Walk and Run ApproachEvolution of Big Data at Intel - Crawl, Walk and Run Approach
Evolution of Big Data at Intel - Crawl, Walk and Run Approach
 
Bigger, Faster, Easier: Building a Real-Time Self Service Data Analytics Ecos...
Bigger, Faster, Easier: Building a Real-Time Self Service Data Analytics Ecos...Bigger, Faster, Easier: Building a Real-Time Self Service Data Analytics Ecos...
Bigger, Faster, Easier: Building a Real-Time Self Service Data Analytics Ecos...
 
Airflow - An Open Source Platform to Author and Monitor Data Pipelines
Airflow - An Open Source Platform to Author and Monitor Data PipelinesAirflow - An Open Source Platform to Author and Monitor Data Pipelines
Airflow - An Open Source Platform to Author and Monitor Data Pipelines
 
large scale collaborative filtering using Apache Giraph
large scale collaborative filtering using Apache Giraphlarge scale collaborative filtering using Apache Giraph
large scale collaborative filtering using Apache Giraph
 
Scaling HDFS to Manage Billions of Files with Key-Value Stores
Scaling HDFS to Manage Billions of Files with Key-Value StoresScaling HDFS to Manage Billions of Files with Key-Value Stores
Scaling HDFS to Manage Billions of Files with Key-Value Stores
 
Hadoop Performance Optimization at Scale, Lessons Learned at Twitter
Hadoop Performance Optimization at Scale, Lessons Learned at TwitterHadoop Performance Optimization at Scale, Lessons Learned at Twitter
Hadoop Performance Optimization at Scale, Lessons Learned at Twitter
 
Apache Kylin - Balance Between Space and Time
Apache Kylin - Balance Between Space and TimeApache Kylin - Balance Between Space and Time
Apache Kylin - Balance Between Space and Time
 
a Secure Public Cache for YARN Application Resources
a Secure Public Cache for YARN Application Resourcesa Secure Public Cache for YARN Application Resources
a Secure Public Cache for YARN Application Resources
 
Improving HDFS Availability with IPC Quality of Service
Improving HDFS Availability with IPC Quality of ServiceImproving HDFS Availability with IPC Quality of Service
Improving HDFS Availability with IPC Quality of Service
 
From Beginners to Experts, Data Wrangling for All
From Beginners to Experts, Data Wrangling for AllFrom Beginners to Experts, Data Wrangling for All
From Beginners to Experts, Data Wrangling for All
 
How to use Parquet as a Sasis for ETL and Analytics
How to use Parquet as a Sasis for ETL and AnalyticsHow to use Parquet as a Sasis for ETL and Analytics
How to use Parquet as a Sasis for ETL and Analytics
 
Apache Lens: Unified OLAP on Realtime and Historic Data
Apache Lens: Unified OLAP on Realtime and Historic DataApache Lens: Unified OLAP on Realtime and Historic Data
Apache Lens: Unified OLAP on Realtime and Historic Data
 
Spark crash course workshop at Hadoop Summit
Spark crash course workshop at Hadoop SummitSpark crash course workshop at Hadoop Summit
Spark crash course workshop at Hadoop Summit
 
Hadoop Eagle - Real Time Monitoring Framework for eBay Hadoop
Hadoop Eagle - Real Time Monitoring Framework for eBay HadoopHadoop Eagle - Real Time Monitoring Framework for eBay Hadoop
Hadoop Eagle - Real Time Monitoring Framework for eBay Hadoop
 
Internet of Things Crash Course Workshop at Hadoop Summit
Internet of Things Crash Course Workshop at Hadoop SummitInternet of Things Crash Course Workshop at Hadoop Summit
Internet of Things Crash Course Workshop at Hadoop Summit
 
Sqoop on Spark for Data Ingestion
Sqoop on Spark for Data IngestionSqoop on Spark for Data Ingestion
Sqoop on Spark for Data Ingestion
 
Complex Analytics using Open Source Technologies
Complex Analytics using Open Source TechnologiesComplex Analytics using Open Source Technologies
Complex Analytics using Open Source Technologies
 
Harnessing Hadoop Distuption: A Telco Case Study
Harnessing Hadoop Distuption: A Telco Case StudyHarnessing Hadoop Distuption: A Telco Case Study
Harnessing Hadoop Distuption: A Telco Case Study
 
Have your Cake and Eat it Too - Architecture for Batch and Real-time processing
Have your Cake and Eat it Too - Architecture for Batch and Real-time processingHave your Cake and Eat it Too - Architecture for Batch and Real-time processing
Have your Cake and Eat it Too - Architecture for Batch and Real-time processing
 

Ähnlich wie June 10 145pm hortonworks_tan & welch_v2

Hadoop Summit San Jose 2015: YARN - Past, Present and Future
Hadoop Summit San Jose 2015: YARN - Past, Present and FutureHadoop Summit San Jose 2015: YARN - Past, Present and Future
Hadoop Summit San Jose 2015: YARN - Past, Present and FutureVinod Kumar Vavilapalli
 
Hadoop Summit - Scheduling policies in YARN - San Jose 2016
Hadoop Summit - Scheduling policies in YARN - San Jose 2016Hadoop Summit - Scheduling policies in YARN - San Jose 2016
Hadoop Summit - Scheduling policies in YARN - San Jose 2016Wangda Tan
 
Apache Hadoop YARN: Present and Future
Apache Hadoop YARN: Present and FutureApache Hadoop YARN: Present and Future
Apache Hadoop YARN: Present and FutureDataWorks Summit
 
Apache Hadoop YARN – Multi-Tenancy, Capacity Scheduler & Preemption - Stamped...
Apache Hadoop YARN – Multi-Tenancy, Capacity Scheduler & Preemption - Stamped...Apache Hadoop YARN – Multi-Tenancy, Capacity Scheduler & Preemption - Stamped...
Apache Hadoop YARN – Multi-Tenancy, Capacity Scheduler & Preemption - Stamped...StampedeCon
 
Dataworks Berlin Summit 18' - Apache hadoop YARN State Of The Union
Dataworks Berlin Summit 18' - Apache hadoop YARN State Of The UnionDataworks Berlin Summit 18' - Apache hadoop YARN State Of The Union
Dataworks Berlin Summit 18' - Apache hadoop YARN State Of The UnionWangda Tan
 
Apache Hadoop YARN: state of the union
Apache Hadoop YARN: state of the unionApache Hadoop YARN: state of the union
Apache Hadoop YARN: state of the unionDataWorks Summit
 
Jun 2017 HUG: YARN Scheduling – A Step Beyond
Jun 2017 HUG: YARN Scheduling – A Step BeyondJun 2017 HUG: YARN Scheduling – A Step Beyond
Jun 2017 HUG: YARN Scheduling – A Step BeyondYahoo Developer Network
 
Driving Enterprise Data Governance for Big Data Systems through Apache Falcon
Driving Enterprise Data Governance for Big Data Systems through Apache FalconDriving Enterprise Data Governance for Big Data Systems through Apache Falcon
Driving Enterprise Data Governance for Big Data Systems through Apache FalconDataWorks Summit
 
Data Governance in Apache Falcon - Hadoop Summit Brussels 2015
Data Governance in Apache Falcon - Hadoop Summit Brussels 2015 Data Governance in Apache Falcon - Hadoop Summit Brussels 2015
Data Governance in Apache Falcon - Hadoop Summit Brussels 2015 Seetharam Venkatesh
 
Debugging Apache Hadoop YARN Cluster in Production
Debugging Apache Hadoop YARN Cluster in ProductionDebugging Apache Hadoop YARN Cluster in Production
Debugging Apache Hadoop YARN Cluster in ProductionXuan Gong
 
Apache Hadoop YARN: best practices
Apache Hadoop YARN: best practicesApache Hadoop YARN: best practices
Apache Hadoop YARN: best practicesDataWorks Summit
 
YARN - Past, Present, & Future
YARN - Past, Present, & FutureYARN - Past, Present, & Future
YARN - Past, Present, & FutureDataWorks Summit
 
Apache Hadoop YARN: state of the union
Apache Hadoop YARN: state of the unionApache Hadoop YARN: state of the union
Apache Hadoop YARN: state of the unionDataWorks Summit
 
The Unbearable Lightness of Ephemeral Processing
The Unbearable Lightness of Ephemeral ProcessingThe Unbearable Lightness of Ephemeral Processing
The Unbearable Lightness of Ephemeral ProcessingDataWorks Summit
 
Bikas saha:the next generation of hadoop– hadoop 2 and yarn
Bikas saha:the next generation of hadoop– hadoop 2 and yarnBikas saha:the next generation of hadoop– hadoop 2 and yarn
Bikas saha:the next generation of hadoop– hadoop 2 and yarnhdhappy001
 

Ähnlich wie June 10 145pm hortonworks_tan & welch_v2 (20)

Hadoop Summit San Jose 2015: YARN - Past, Present and Future
Hadoop Summit San Jose 2015: YARN - Past, Present and FutureHadoop Summit San Jose 2015: YARN - Past, Present and Future
Hadoop Summit San Jose 2015: YARN - Past, Present and Future
 
Hadoop Summit - Scheduling policies in YARN - San Jose 2016
Hadoop Summit - Scheduling policies in YARN - San Jose 2016Hadoop Summit - Scheduling policies in YARN - San Jose 2016
Hadoop Summit - Scheduling policies in YARN - San Jose 2016
 
Scheduling Policies in YARN
Scheduling Policies in YARNScheduling Policies in YARN
Scheduling Policies in YARN
 
Apache Hadoop YARN: Present and Future
Apache Hadoop YARN: Present and FutureApache Hadoop YARN: Present and Future
Apache Hadoop YARN: Present and Future
 
Apache Hadoop YARN – Multi-Tenancy, Capacity Scheduler & Preemption - Stamped...
Apache Hadoop YARN – Multi-Tenancy, Capacity Scheduler & Preemption - Stamped...Apache Hadoop YARN – Multi-Tenancy, Capacity Scheduler & Preemption - Stamped...
Apache Hadoop YARN – Multi-Tenancy, Capacity Scheduler & Preemption - Stamped...
 
Dataworks Berlin Summit 18' - Apache hadoop YARN State Of The Union
Dataworks Berlin Summit 18' - Apache hadoop YARN State Of The UnionDataworks Berlin Summit 18' - Apache hadoop YARN State Of The Union
Dataworks Berlin Summit 18' - Apache hadoop YARN State Of The Union
 
Apache Hadoop YARN: state of the union
Apache Hadoop YARN: state of the unionApache Hadoop YARN: state of the union
Apache Hadoop YARN: state of the union
 
Jun 2017 HUG: YARN Scheduling – A Step Beyond
Jun 2017 HUG: YARN Scheduling – A Step BeyondJun 2017 HUG: YARN Scheduling – A Step Beyond
Jun 2017 HUG: YARN Scheduling – A Step Beyond
 
Driving Enterprise Data Governance for Big Data Systems through Apache Falcon
Driving Enterprise Data Governance for Big Data Systems through Apache FalconDriving Enterprise Data Governance for Big Data Systems through Apache Falcon
Driving Enterprise Data Governance for Big Data Systems through Apache Falcon
 
Data Governance in Apache Falcon - Hadoop Summit Brussels 2015
Data Governance in Apache Falcon - Hadoop Summit Brussels 2015 Data Governance in Apache Falcon - Hadoop Summit Brussels 2015
Data Governance in Apache Falcon - Hadoop Summit Brussels 2015
 
Debugging Apache Hadoop YARN Cluster in Production
Debugging Apache Hadoop YARN Cluster in ProductionDebugging Apache Hadoop YARN Cluster in Production
Debugging Apache Hadoop YARN Cluster in Production
 
Apache Hadoop YARN: best practices
Apache Hadoop YARN: best practicesApache Hadoop YARN: best practices
Apache Hadoop YARN: best practices
 
Apache Hadoop YARN: Past, Present and Future
Apache Hadoop YARN: Past, Present and FutureApache Hadoop YARN: Past, Present and Future
Apache Hadoop YARN: Past, Present and Future
 
YARN - Past, Present, & Future
YARN - Past, Present, & FutureYARN - Past, Present, & Future
YARN - Past, Present, & Future
 
Apache Hadoop 3.0 What's new in YARN and MapReduce
Apache Hadoop 3.0 What's new in YARN and MapReduceApache Hadoop 3.0 What's new in YARN and MapReduce
Apache Hadoop 3.0 What's new in YARN and MapReduce
 
A Multi Colored YARN
A Multi Colored YARNA Multi Colored YARN
A Multi Colored YARN
 
Apache Hadoop YARN: state of the union
Apache Hadoop YARN: state of the unionApache Hadoop YARN: state of the union
Apache Hadoop YARN: state of the union
 
Running Services on YARN
Running Services on YARNRunning Services on YARN
Running Services on YARN
 
The Unbearable Lightness of Ephemeral Processing
The Unbearable Lightness of Ephemeral ProcessingThe Unbearable Lightness of Ephemeral Processing
The Unbearable Lightness of Ephemeral Processing
 
Bikas saha:the next generation of hadoop– hadoop 2 and yarn
Bikas saha:the next generation of hadoop– hadoop 2 and yarnBikas saha:the next generation of hadoop– hadoop 2 and yarn
Bikas saha:the next generation of hadoop– hadoop 2 and yarn
 

Mehr von DataWorks Summit

Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 

Mehr von DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

June 10 145pm hortonworks_tan & welch_v2

  • 1. Page1 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Enabling diverse workload scheduling in YARN June, 2015 Wangda Tan, Hortonworks, (wangda@apache.com) Craig Welch, Hortonworks, (cwelch@hortonworks.com)
  • 2. Page2 © Hortonworks Inc. 2011 – 2014. All Rights Reserved About us Wangda Tan • Last 5+ years in big data field, Hadoop, Open-MPI, etc. • Past – Pivotal (PHD team, brings OpenMPI/GraphLab to YARN) – Alibaba (ODPS team, platform for distributed data-mining) • Now – Apache Hadoop Committer @Hortonworks, all in YARN. – Now spending most of time on resource scheduling enhancements. Craig Welch • Yarn Contributor
  • 3. Page3 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Hadoop+YARN is the home of big data processing.
  • 4. Page4 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Our workloads vary, Service | Batch | interactive/ real-time
  • 5. Page5 © Hortonworks Inc. 2011 – 2014. All Rights Reserved They have different CRAZY requirements I wanna be fast! When cluster is busy Don’t take away MY RESOURCES A huge job needs be scheduled at a special time
  • 6. Page6 © Hortonworks Inc. 2011 – 2014. All Rights Reserved We want to make them AS HAPPY AS POSSIBLE to run together in YARN.
  • 7. Page7 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Let’s start…
  • 8. Page8 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Agenda today • Overview • Node Label • Resource Preemption • Reservation system • Pluggable behavior for Scheduler • Docker support • Resource scheduling beyond memory
  • 9. Page9 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Overview
  • 10. Page10 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Background • Resources are managed by a hierarchy of queues. • One queue can have multiple applications • Container is the result resource scheduling, Which is a bundle of resources and can run process(es)
  • 11. Page11 © Hortonworks Inc. 2011 – 2014. All Rights Reserved How to manage your workload by queues • By organization: –Marketing/Finance queue • By workload –Interactive/Batch queue • Hybrid –Finance- batch/Marketing- realtime queue
  • 12. Page12 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Node Label
  • 13. Page13 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Node Label – Overview • Types of node labels – Node partition (Since 2.6) – Node constraints (WIP) • Node partition (Today’s focus) – One node belongs to only one partition – Related to resource planning • Node constraints – One node can assign multiple constraints – Not related to resource planning
  • 14. Page14 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Node partition – Resource planning • Nodes belong to “default partition” if not specified • It’s possible to specify different capacities of queues on different partitions –For example, sales queue can use different resource on GPU and default partition. • It’s possible to specify some partition will be only used by some queues (ACL for partition) –For example, only sales queue can access “Large memory partition”
  • 15. Page15 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Node partition – Exclusive vs. Non-exclusive Snake Partition Bear partition Default partition Exclusive partition Non-exclusive partition Use it when they're not at home Resource Request
  • 16. Page16 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Node Partition – Use cases & best practice • Dedicate nodes to run important services: –E.g. Running HBase region server using Apache Slider • Nodes with special hardware in the cluster are used by organizations. –E.g. You may want a queue dedicated to the marketing department to use 80% of these memory-heavy nodes. • Use non-exclusive node partition to make better resource utilization. • Be careful about user-limits, capacity, etc. to make sure jobs can be launched I will cover more details about implementation & usage in Thursday morning’s session “YARN Node Labels” with Mayank Bansal from Ebay.
  • 17. Page17 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Resource Preemption
  • 18. Page18 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Resource Preemption – Overview • Queue has configured minimum resource. • Since it has a minimum resource value, the preemption policy (which performs preempting resources) is used to insure that: –When a queue is under its “minimum resource”, and the cluster doesn’t have available resources, preemption policy can get resource from other queues use more than their minimum resource. A B C 20% 30% 50%
  • 19. Page19 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Resource Preemption – Example • When preemption is not enabled • When preemption is enabled
  • 20. Page20 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Resource Preemption – best practice •Configurations to control the pace of preemption: –yarn.resourcemanager.monitor.capacity.preemption.max_wait_before_kill –yarn.resourcemanager.monitor.capacity.preemption.total_preemption_per_round –yarn.resourcemanager.monitor.capacity.preemption.natural_termination_factor •Configurations to control when or if preemption happens –yarn.resourcemanager.monitor.capacity.preemption.max_ignored_over_capacity (deadzone) –yarn.scheduler.capacity.<queue-path>.disable_preemption
  • 21. Page21 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Reservation System
  • 22. Page22 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Reservation System – Overview • Reserving resource ahead of time – Just like ordering table in a restaurant – “I need a table for X people at Y time” – “Wait for moment … Reservation confirmed sir“ – (After some time), “Your table is ready” –What Reservation System does is: –Send a reservation request –RM checks time table –Send back reservation confirmation ID –Notify when ready •Enables more predictable start and run time for time-critical / resource intensive applications
  • 23. Page23 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Reservation System – Use cases •Gang scheduling – Currently, YARN can do gang scheduling from application side (holding resources until it meets requirements) – Resources could be wasted and there’s risk of deadlocks. –RS lays the foundation for gang scheduling •Workflow support – I want to run jobs in stages – Stage-1 at 1 AM tomorrow, needs 10k containers – Stage-2 after stage-1, needs 5k containers – Stage-3 after stage-2, needs 2k containers – You can submit such requests to RS!
  • 24. Page24 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Reservation System – Result & References •Before & After Reservation System (reports from MSR) – It increased cluster utilization a lot! •References – Design / Discussion / Report : YARN-1051 – More detail about example : YARN-2609
  • 25. Page25 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Pluggable scheduler behavior
  • 26. Page26 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Why • Problem • It’s difficult to share functionality between schedulers • Users cannot achieve the same behavior with all schedulers • Fixes and enhancements tend to end up in one scheduler, not all, leading to fragmentation • No simple mechanism exists to mix behaviors for a given feature in a single cluster • Solution • Move to sharable, pluggable scheduler behavior
  • 27. Page27 © Hortonworks Inc. 2011 – 2014. All Rights Reserved How • The Goal –Recast scheduler behavior as policies – candidates include –Resource limits for apps, users... –Ordering for allocation and preemption • With this, we can: –Maximize feature availability and reduce fragmentation –Configure different queues for different workloads in a single cluster Flexible Scheduler configuration, as simple as building with Legos!
  • 28. Page28 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Ordering Policy of Capacity Scheduler • Pluggable ordering policies for LeafQueues in Capacity Scheduler –Enables the implementation of different policies for ordering assignment and preemption of containers for applications –Initial implementations include FIFO (Capacity Scheduler original behavior) and Fair –User Limits and Queue Capacity limits are still respected • Fair scheduling inside Capacity Scheduler –Based on the Fair Sharing logic in FairScheduler –Assigns containers to applications in order of least to greatest resource usage –Allows many applications to make progress concurrently –Lets short jobs finish in reasonable time while not starving long running jobs
  • 29. Page29 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Configuration and tuning • Rough guidelines for when to use Fair and FIFO ordering policies • Configuration –yarn.scheduler.capacity.<queue>.ordering- policy (“fifo” or “fair”, default “fifo”) –yarn.scheduler.capacity.<queue>.ordering- policy.fair.enable-size-based-weight (true or false) • Tuning –Use max-am-resource-percent to avoid “peanut buttering” from having too many apps running at once –Sometimes it’s necessary to separate large and small apps in different queues, or use size-based-weight, to avoid large app starvation Workloads Policy On- demand/interactive/ exploratory Fair Predictable/Recu- rring batch FIFO Mix of above two Fair
  • 30. Page30 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Docker container support
  • 31. Page31 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Docker container support – Overview • Containers for the Cluster –Brings the sandboxing and dependency isolation of container technology to Hadoop –Containers make it simple to use Hadoop resources for a wider range of applications
  • 32. Page32 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Docker container support – Status • Done –(V1) Initial implementation translating Kubernetes to an Application Master launching Docker containers from the Cluster met with success. –(V2) A custom container launcher for Docker containers. This brought the capability more fully under the management of YARN, –but a single cluster could not support both traditional YARN applications (MapReduce, etc) and Docker concurrently • Next phase –(V3) WIP, is adding support for running Docker and traditional YARN applications side-by-side in a single cluster
  • 33. Page33 © Hortonworks Inc. 2011 – 2014. All Rights Reserved It’s not all about memory
  • 34. Page34 © Hortonworks Inc. 2011 – 2014. All Rights Reserved It’s not all about Memory - CPU • What’s in a CPU –Some workloads are CPU intensive, without accounting for this nodes may end up CPU bound or CPU may be under utilized cluster-wide –CPU awareness at the scheduer level is enabled by selecting the DominantResourceCalculator. –Dominant? “Dominant” stands for the “dominant factor”, or the “bottleneck”. In simplified terms, for the resource type which is the most constrained becomes the dominant factor for any given comparison or calculation –For example, If there is enough memory but not enough cpu for a resource request, the cpu component is dominant ( and the answer is “No”  ) –See https://www.cs.berkeley.edu/~alig/pap ers/drf.pdf for more detail
  • 35. Page35 © Hortonworks Inc. 2011 – 2014. All Rights Reserved It’s not all about Memory – CPU - Vcores • What’s in a CPU –The unit used to abstract CPU capability in YARN is the vcore –Vcore counts are configured per- node in the yarn-site.xml, typically 1-1 vcore to physical CPU –If some Nodes’ CPUs outclass other nodes’, the number of vcores per physical CPU can be adjusted upward to compensate
  • 36. Page36 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Q & A ?