SlideShare ist ein Scribd-Unternehmen logo
1 von 38
Downloaden Sie, um offline zu lesen
HADOOP ROBOT
Action and Remediation Center for eBay's Hadoop Clusters
Polo Li | 李健
eBay Cloud Services
About me
2
Agenda
• Hadoop @eBay
• Problem Statement
• What is Hadoop Robot
• Q&A
3
4
Hadoop @eBay
1-10 nodes
2007
100+ nodes
1000 + core
1 PB
2010
2011
1000+ node
10,000+ core
10+ PB
4000+ node
40,000+ core
50+ PB
2013
2015
10,000+ nodes
150,000+ cores
150+ PB
2009
10+ nodes
Hadoop @eBay
•10+ large Hadoop Clusters
•10,000+ nodes
•50,000+ jobs per day
•50,000,000+ tasks per day
5
Shared vs Dedicated Clusters
• Shared clusters
• Used primarily for analytics of user behavior and inventory
• Mix of batch and ad-hoc jobs
• Mix of MR, YARN, Hive, Pig, Cascading, etc.
• Hadoop and HBase security enabled
• Dedicated clusters
• Very specific use case like index building
• Tight SLAs for jobs (in order of minutes)
• Immediate revenue impact
• Usually smaller than our shared clusters, but still large (600+ nodes)
6
7
Hadoop Platform Stack
Agenda
• Hadoop @eBay
• Problem Statement
• What is Hadoop Robot
• Q&A
8
Problem Statements
• Long trouble shooting time
• Bad cluster performance
• Too many different skus, operating systems and metadatas
• Human resource Cost
• Cluster Availability
9
Traditional Trouble Shooting Pipeline
Step 1
• Check failed application task logs to find out the suspicious hadoop nodes
Step 2
• Check the suspicious hadoop node hardware && system status
Step 3
• Check hadoop metrics and hadoop daemon logs
Step 4
• Check hadoop source code
10
Victim or Perpetrator ?
Sometimes you think you've found the perpetrators, however it may turn out to be
the victim.
11
What may impact cluster performance?
• Hardware
• System
• Hadoop
• JVM
• …
12
Hardware issues
84%
3%
4%
4%
1%
1%
1%
2%
Hardware Issues distribution
Disk
Nic
Memory
Cpu
Chassis
Fan
Mobo
PSU
13
System issues
•High load
•Node reboot
•Disk full
•Network saturated fully
•OOM
•Kernel bug
•Orphan processes
•…
14
Hadoop Issues
•Hot spot
•Low hdfs locality
•High RPC call queue length
•Hadoop configuration inconsistent issues
•Big resource consuming applications
•Big log/hdfs output applications
•Bad Application scheduling
•…
15
About hadoop node
16
NODE
SKU
OS
META
DATA
Different SKU
•Cpu core
•Memory size
•Disk number
•Disk size
•Nic Speed
•…
17
Different OS
•Image profile
•Image version
•Patches
•…
18
Different Metadata
• Hadoop Metadata
•Daemons
•Packages
•Configurations
• OS Metadata
•Services
•Scripts/Tools
•Configurations
19
Human Resource Cost
• Detect node (varies based on admin
experiences)
• Node decommission (1-2 Hours)
• Vendor offline remediation (3-5 Days)
• Node reimage(45 Mins)
• Hadoop Installation and OS
configuration (15 Mins)
• Node restart (5 Mins)
• Disk burning test (6-8 Hours)
• Node health verification (15 Mins)
• Node recommission (15 Mins)
20
Cluster Availability
21
• Improve Availability
• Increase Live data node percentage
• Reduce job failures due to bad node
Agenda
• Hadoop @eBay
• Problem Statement
• What is Hadoop Robot
• Q&A
22
What is Hadoop Robot
Hadoop Robot is action and remediation center for eBay hadoop clusters:
• End-to-end automated remediation center
• API center for hadoop action and remediation
• Unified Hadoop Admin Console
• Real time maintenance view of Hadoop clusters
• Analytical insights into hardware maintenance data
23
End-to-end Automatic remediation center
•Hardware Maintenance
•Alert Detection
•Node Decommission
•Remediation
•Node Recommission
•Remove Failed Disk Volume
•Bad Disk Hot Swap
•Hadoop Daemon Restart
•Hadoop Abnormal Job Termination
•Hadoop Cluster Expansion
•…
24
Robot Architecture
25
Hardware Maintenance Workflow
• Alert detection
• Node decommission
• Vendor remediation (while node is offline)
• Node OS reprovisioning
• Hadoop Installation && OS configuration
• Node restart
• Burn-in test
• Node health verification
• Node recommission
26
Data Flow
27
Alerts Drive Workflow
28
Hardware and System Alerts
•Hardware
29
DISK
SMARTD_CHECK
SECTOR_CHECK
RW_CHECK
MOUNT_CHCEK
SYSLOG_CHECK
NIC
SPEED_CHCEK
DUPLEX_CHECK
FLAPPING_CHECK
DOWN_CHECK
NOT_FOUND_CHECK
MCELOG
CPU_CHECK
MEMORY_CHCEK
CPU_HIGH_TEMPERA
TURE_CHECK
MEMORY
MEMORY_SIZE
_CHECK
Hardware and System Alerts
•System
30
DISK
DISK_SPACE_CH
ECK
REBOOT
NODE_REBOOT_
CHCEK
PROCESS
ORPHAN_PROC
ESSES_CHECK
Metadata Setup - ansible
31
One button hadoop installation and system configuration
We use ansible playbook to install and
configure various OS and software
packages including Hadoop.
32
Copy the Hadoop Packages,
Configuration and Codes from
Source to the Destination Node
Update Symbolic Links to point
to the Latest Hadoop Code.
Restart Hadoop Daemon
Bonnie++
•We use bonnie++ to carry out a stress-test of the repaired hardware. This not only
puts a load on the I/O and disk subsystem, but it also can flush out CPU, RAM
and fan/cooling issues.
33
Cluster level status overview
34
Unified Hadoop Admin Console
35
Summit Job
36
Track Robot Job Activities
37
38
We are Hiring Now!
https://careers.ebayinc.com
Or contact me: jianli1@ebay.com

Weitere ähnliche Inhalte

Was ist angesagt?

HBaseCon 2012 | HBase, the Use Case in eBay Cassini
HBaseCon 2012 | HBase, the Use Case in eBay Cassini HBaseCon 2012 | HBase, the Use Case in eBay Cassini
HBaseCon 2012 | HBase, the Use Case in eBay Cassini Cloudera, Inc.
 
Hadoop operations-2015-hadoop-summit-san-jose-v5
Hadoop operations-2015-hadoop-summit-san-jose-v5Hadoop operations-2015-hadoop-summit-san-jose-v5
Hadoop operations-2015-hadoop-summit-san-jose-v5Chris Nauroth
 
Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014
Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014
Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014NoSQLmatters
 
January 2015 HUG: Using HBase Co-Processors to Build a Distributed, Transacti...
January 2015 HUG: Using HBase Co-Processors to Build a Distributed, Transacti...January 2015 HUG: Using HBase Co-Processors to Build a Distributed, Transacti...
January 2015 HUG: Using HBase Co-Processors to Build a Distributed, Transacti...Yahoo Developer Network
 
Flexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkFlexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkDataWorks Summit
 
HBaseCon 2015: Elastic HBase on Mesos
HBaseCon 2015: Elastic HBase on MesosHBaseCon 2015: Elastic HBase on Mesos
HBaseCon 2015: Elastic HBase on MesosHBaseCon
 
HBaseCon 2012 | Solbase - Kyungseog Oh, Photobucket
HBaseCon 2012 | Solbase - Kyungseog Oh, PhotobucketHBaseCon 2012 | Solbase - Kyungseog Oh, Photobucket
HBaseCon 2012 | Solbase - Kyungseog Oh, PhotobucketCloudera, Inc.
 
Multi-tenant, Multi-cluster and Multi-container Apache HBase Deployments
Multi-tenant, Multi-cluster and Multi-container Apache HBase DeploymentsMulti-tenant, Multi-cluster and Multi-container Apache HBase Deployments
Multi-tenant, Multi-cluster and Multi-container Apache HBase DeploymentsDataWorks Summit
 
Hadoop - Lessons Learned
Hadoop - Lessons LearnedHadoop - Lessons Learned
Hadoop - Lessons Learnedtcurdt
 
HBaseCon 2013: Apache HBase Operations at Pinterest
HBaseCon 2013: Apache HBase Operations at PinterestHBaseCon 2013: Apache HBase Operations at Pinterest
HBaseCon 2013: Apache HBase Operations at PinterestCloudera, Inc.
 
Tez Shuffle Handler: Shuffling at Scale with Apache Hadoop
Tez Shuffle Handler: Shuffling at Scale with Apache HadoopTez Shuffle Handler: Shuffling at Scale with Apache Hadoop
Tez Shuffle Handler: Shuffling at Scale with Apache HadoopDataWorks Summit
 
Kudu: Resolving Transactional and Analytic Trade-offs in Hadoop
Kudu: Resolving Transactional and Analytic Trade-offs in HadoopKudu: Resolving Transactional and Analytic Trade-offs in Hadoop
Kudu: Resolving Transactional and Analytic Trade-offs in Hadoopjdcryans
 
NYC HUG - Application Architectures with Apache Hadoop
NYC HUG - Application Architectures with Apache HadoopNYC HUG - Application Architectures with Apache Hadoop
NYC HUG - Application Architectures with Apache Hadoopmarkgrover
 
Boosting Machine Learning with Redis Modules and Spark
Boosting Machine Learning with Redis Modules and SparkBoosting Machine Learning with Redis Modules and Spark
Boosting Machine Learning with Redis Modules and SparkDvir Volk
 
Managing multi tenant resource toward Hive 2.0
Managing multi tenant resource toward Hive 2.0Managing multi tenant resource toward Hive 2.0
Managing multi tenant resource toward Hive 2.0Kai Sasaki
 
Applications on Hadoop
Applications on HadoopApplications on Hadoop
Applications on Hadoopmarkgrover
 
Gruter TECHDAY 2014 Realtime Processing in Telco
Gruter TECHDAY 2014 Realtime Processing in TelcoGruter TECHDAY 2014 Realtime Processing in Telco
Gruter TECHDAY 2014 Realtime Processing in TelcoGruter
 
Building and Running Solr-as-a-Service: Presented by Shai Erera, IBM
Building and Running Solr-as-a-Service: Presented by Shai Erera, IBMBuilding and Running Solr-as-a-Service: Presented by Shai Erera, IBM
Building and Running Solr-as-a-Service: Presented by Shai Erera, IBMLucidworks
 
Avoiding big data antipatterns
Avoiding big data antipatternsAvoiding big data antipatterns
Avoiding big data antipatternsgrepalex
 

Was ist angesagt? (20)

HBaseCon 2012 | HBase, the Use Case in eBay Cassini
HBaseCon 2012 | HBase, the Use Case in eBay Cassini HBaseCon 2012 | HBase, the Use Case in eBay Cassini
HBaseCon 2012 | HBase, the Use Case in eBay Cassini
 
Hadoop operations-2015-hadoop-summit-san-jose-v5
Hadoop operations-2015-hadoop-summit-san-jose-v5Hadoop operations-2015-hadoop-summit-san-jose-v5
Hadoop operations-2015-hadoop-summit-san-jose-v5
 
tdtechtalk20160330johan
tdtechtalk20160330johantdtechtalk20160330johan
tdtechtalk20160330johan
 
Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014
Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014
Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014
 
January 2015 HUG: Using HBase Co-Processors to Build a Distributed, Transacti...
January 2015 HUG: Using HBase Co-Processors to Build a Distributed, Transacti...January 2015 HUG: Using HBase Co-Processors to Build a Distributed, Transacti...
January 2015 HUG: Using HBase Co-Processors to Build a Distributed, Transacti...
 
Flexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkFlexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache Flink
 
HBaseCon 2015: Elastic HBase on Mesos
HBaseCon 2015: Elastic HBase on MesosHBaseCon 2015: Elastic HBase on Mesos
HBaseCon 2015: Elastic HBase on Mesos
 
HBaseCon 2012 | Solbase - Kyungseog Oh, Photobucket
HBaseCon 2012 | Solbase - Kyungseog Oh, PhotobucketHBaseCon 2012 | Solbase - Kyungseog Oh, Photobucket
HBaseCon 2012 | Solbase - Kyungseog Oh, Photobucket
 
Multi-tenant, Multi-cluster and Multi-container Apache HBase Deployments
Multi-tenant, Multi-cluster and Multi-container Apache HBase DeploymentsMulti-tenant, Multi-cluster and Multi-container Apache HBase Deployments
Multi-tenant, Multi-cluster and Multi-container Apache HBase Deployments
 
Hadoop - Lessons Learned
Hadoop - Lessons LearnedHadoop - Lessons Learned
Hadoop - Lessons Learned
 
HBaseCon 2013: Apache HBase Operations at Pinterest
HBaseCon 2013: Apache HBase Operations at PinterestHBaseCon 2013: Apache HBase Operations at Pinterest
HBaseCon 2013: Apache HBase Operations at Pinterest
 
Tez Shuffle Handler: Shuffling at Scale with Apache Hadoop
Tez Shuffle Handler: Shuffling at Scale with Apache HadoopTez Shuffle Handler: Shuffling at Scale with Apache Hadoop
Tez Shuffle Handler: Shuffling at Scale with Apache Hadoop
 
Kudu: Resolving Transactional and Analytic Trade-offs in Hadoop
Kudu: Resolving Transactional and Analytic Trade-offs in HadoopKudu: Resolving Transactional and Analytic Trade-offs in Hadoop
Kudu: Resolving Transactional and Analytic Trade-offs in Hadoop
 
NYC HUG - Application Architectures with Apache Hadoop
NYC HUG - Application Architectures with Apache HadoopNYC HUG - Application Architectures with Apache Hadoop
NYC HUG - Application Architectures with Apache Hadoop
 
Boosting Machine Learning with Redis Modules and Spark
Boosting Machine Learning with Redis Modules and SparkBoosting Machine Learning with Redis Modules and Spark
Boosting Machine Learning with Redis Modules and Spark
 
Managing multi tenant resource toward Hive 2.0
Managing multi tenant resource toward Hive 2.0Managing multi tenant resource toward Hive 2.0
Managing multi tenant resource toward Hive 2.0
 
Applications on Hadoop
Applications on HadoopApplications on Hadoop
Applications on Hadoop
 
Gruter TECHDAY 2014 Realtime Processing in Telco
Gruter TECHDAY 2014 Realtime Processing in TelcoGruter TECHDAY 2014 Realtime Processing in Telco
Gruter TECHDAY 2014 Realtime Processing in Telco
 
Building and Running Solr-as-a-Service: Presented by Shai Erera, IBM
Building and Running Solr-as-a-Service: Presented by Shai Erera, IBMBuilding and Running Solr-as-a-Service: Presented by Shai Erera, IBM
Building and Running Solr-as-a-Service: Presented by Shai Erera, IBM
 
Avoiding big data antipatterns
Avoiding big data antipatternsAvoiding big data antipatterns
Avoiding big data antipatterns
 

Andere mochten auch

The Regacy Chapter 5.4b Ink - Emma
The Regacy Chapter 5.4b Ink - EmmaThe Regacy Chapter 5.4b Ink - Emma
The Regacy Chapter 5.4b Ink - Emmaregacylady
 
1.vitamines and coenzymes 31.03.2071
1.vitamines and coenzymes 31.03.20711.vitamines and coenzymes 31.03.2071
1.vitamines and coenzymes 31.03.2071RajDip Basnet
 
china Medical University Power Point Presentation
china Medical University Power Point Presentationchina Medical University Power Point Presentation
china Medical University Power Point PresentationSaju Bhaskar
 
From SOA to SCA and FraSCAti
From SOA to SCA and FraSCAtiFrom SOA to SCA and FraSCAti
From SOA to SCA and FraSCAtiphilippe_merle
 
11 12-2 学期校历
11 12-2 学期校历11 12-2 学期校历
11 12-2 学期校历share_stone
 
im watcing you
im watcing youim watcing you
im watcing youpisha
 
Proactive Professionalism Networking Today Nj 1
Proactive Professionalism Networking Today Nj 1Proactive Professionalism Networking Today Nj 1
Proactive Professionalism Networking Today Nj 1Suzanne Carbonaro
 
Blackwell Esteem AFSL
Blackwell Esteem AFSLBlackwell Esteem AFSL
Blackwell Esteem AFSLsamueltay77
 
Formula of failure and success
Formula of failure and successFormula of failure and success
Formula of failure and successUjjwal Panda
 
ZENworks Configuration Management
ZENworks Configuration ManagementZENworks Configuration Management
ZENworks Configuration ManagementRoel van Bueren
 
Know Your Supplier - Rubber & Tyre Machinery World May 2016 Special
Know Your Supplier - Rubber & Tyre Machinery World May 2016 SpecialKnow Your Supplier - Rubber & Tyre Machinery World May 2016 Special
Know Your Supplier - Rubber & Tyre Machinery World May 2016 SpecialRubber & Tyre Machinery World
 
Where can tell me who I am?
Where can tell me who I am?Where can tell me who I am?
Where can tell me who I am?seltzoid
 
Sovereignty, Free Will, and Salvation - Limited Atonement
Sovereignty, Free Will, and Salvation - Limited AtonementSovereignty, Free Will, and Salvation - Limited Atonement
Sovereignty, Free Will, and Salvation - Limited AtonementRobin Schumacher
 
API Exhibit
API ExhibitAPI Exhibit
API Exhibitpointerj
 
Sekolah kebangsaan jalan raja syed alwi
Sekolah kebangsaan jalan raja syed alwiSekolah kebangsaan jalan raja syed alwi
Sekolah kebangsaan jalan raja syed alwiNorshida Shida
 
3cork and kerry
3cork and kerry3cork and kerry
3cork and kerryriaenglish
 
Academic Word List - Sublist 3
Academic Word List - Sublist 3Academic Word List - Sublist 3
Academic Word List - Sublist 3Pham Duc
 

Andere mochten auch (20)

The Regacy Chapter 5.4b Ink - Emma
The Regacy Chapter 5.4b Ink - EmmaThe Regacy Chapter 5.4b Ink - Emma
The Regacy Chapter 5.4b Ink - Emma
 
1.vitamines and coenzymes 31.03.2071
1.vitamines and coenzymes 31.03.20711.vitamines and coenzymes 31.03.2071
1.vitamines and coenzymes 31.03.2071
 
Pqrc 2.4-a4-latest
Pqrc 2.4-a4-latestPqrc 2.4-a4-latest
Pqrc 2.4-a4-latest
 
china Medical University Power Point Presentation
china Medical University Power Point Presentationchina Medical University Power Point Presentation
china Medical University Power Point Presentation
 
From SOA to SCA and FraSCAti
From SOA to SCA and FraSCAtiFrom SOA to SCA and FraSCAti
From SOA to SCA and FraSCAti
 
11 12-2 学期校历
11 12-2 学期校历11 12-2 学期校历
11 12-2 学期校历
 
im watcing you
im watcing youim watcing you
im watcing you
 
Spinal cord trauma
Spinal cord traumaSpinal cord trauma
Spinal cord trauma
 
Proactive Professionalism Networking Today Nj 1
Proactive Professionalism Networking Today Nj 1Proactive Professionalism Networking Today Nj 1
Proactive Professionalism Networking Today Nj 1
 
Blackwell Esteem AFSL
Blackwell Esteem AFSLBlackwell Esteem AFSL
Blackwell Esteem AFSL
 
Formula of failure and success
Formula of failure and successFormula of failure and success
Formula of failure and success
 
ZENworks Configuration Management
ZENworks Configuration ManagementZENworks Configuration Management
ZENworks Configuration Management
 
Know Your Supplier - Rubber & Tyre Machinery World May 2016 Special
Know Your Supplier - Rubber & Tyre Machinery World May 2016 SpecialKnow Your Supplier - Rubber & Tyre Machinery World May 2016 Special
Know Your Supplier - Rubber & Tyre Machinery World May 2016 Special
 
Where can tell me who I am?
Where can tell me who I am?Where can tell me who I am?
Where can tell me who I am?
 
Sovereignty, Free Will, and Salvation - Limited Atonement
Sovereignty, Free Will, and Salvation - Limited AtonementSovereignty, Free Will, and Salvation - Limited Atonement
Sovereignty, Free Will, and Salvation - Limited Atonement
 
API Exhibit
API ExhibitAPI Exhibit
API Exhibit
 
R.T_article pdf
R.T_article pdfR.T_article pdf
R.T_article pdf
 
Sekolah kebangsaan jalan raja syed alwi
Sekolah kebangsaan jalan raja syed alwiSekolah kebangsaan jalan raja syed alwi
Sekolah kebangsaan jalan raja syed alwi
 
3cork and kerry
3cork and kerry3cork and kerry
3cork and kerry
 
Academic Word List - Sublist 3
Academic Word List - Sublist 3Academic Word List - Sublist 3
Academic Word List - Sublist 3
 

Ähnlich wie Hadoop Robot from eBay at China Hadoop Summit 2015

Infrastructure Around Hadoop
Infrastructure Around HadoopInfrastructure Around Hadoop
Infrastructure Around HadoopDataWorks Summit
 
Troubleshooting Hadoop: Distributed Debugging
Troubleshooting Hadoop: Distributed DebuggingTroubleshooting Hadoop: Distributed Debugging
Troubleshooting Hadoop: Distributed DebuggingGreat Wide Open
 
Hadoop-Quick introduction
Hadoop-Quick introductionHadoop-Quick introduction
Hadoop-Quick introductionSandeep Singh
 
Top 10 lessons learned from deploying hadoop in a private cloud
Top 10 lessons learned from deploying hadoop in a private cloudTop 10 lessons learned from deploying hadoop in a private cloud
Top 10 lessons learned from deploying hadoop in a private cloudRogue Wave Software
 
Real-time searching of big data with Solr and Hadoop
Real-time searching of big data with Solr and HadoopReal-time searching of big data with Solr and Hadoop
Real-time searching of big data with Solr and HadoopRogue Wave Software
 
Scaling Hadoop at LinkedIn
Scaling Hadoop at LinkedInScaling Hadoop at LinkedIn
Scaling Hadoop at LinkedInDataWorks Summit
 
Teradata Partners Conference Oct 2014 Big Data Anti-Patterns
Teradata Partners Conference Oct 2014   Big Data Anti-PatternsTeradata Partners Conference Oct 2014   Big Data Anti-Patterns
Teradata Partners Conference Oct 2014 Big Data Anti-PatternsDouglas Moore
 
Debugging Hive with Hadoop-in-the-Cloud by David Chaiken of Altiscale
Debugging Hive with Hadoop-in-the-Cloud by David Chaiken of AltiscaleDebugging Hive with Hadoop-in-the-Cloud by David Chaiken of Altiscale
Debugging Hive with Hadoop-in-the-Cloud by David Chaiken of AltiscaleData Con LA
 
Facing enterprise specific challenges – utility programming in hadoop
Facing enterprise specific challenges – utility programming in hadoopFacing enterprise specific challenges – utility programming in hadoop
Facing enterprise specific challenges – utility programming in hadoopfann wu
 
Innovation in the Data Warehouse - StampedeCon 2016
Innovation in the Data Warehouse - StampedeCon 2016Innovation in the Data Warehouse - StampedeCon 2016
Innovation in the Data Warehouse - StampedeCon 2016StampedeCon
 
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014cdmaxime
 
OSDC 2016 - Tuning Linux for your Database by Colin Charles
OSDC 2016 - Tuning Linux for your Database by Colin CharlesOSDC 2016 - Tuning Linux for your Database by Colin Charles
OSDC 2016 - Tuning Linux for your Database by Colin CharlesNETWAYS
 
Piranha vs. mammoth predator appliances that chew up big data
Piranha vs. mammoth   predator appliances that chew up big dataPiranha vs. mammoth   predator appliances that chew up big data
Piranha vs. mammoth predator appliances that chew up big dataJack (Yaakov) Bezalel
 
MySQL in the Hosted Cloud
MySQL in the Hosted CloudMySQL in the Hosted Cloud
MySQL in the Hosted CloudColin Charles
 
Hadoop - Disk Fail In Place (DFIP)
Hadoop - Disk Fail In Place (DFIP)Hadoop - Disk Fail In Place (DFIP)
Hadoop - Disk Fail In Place (DFIP)mundlapudi
 
DrupalCampLA 2014 - Drupal backend performance and scalability
DrupalCampLA 2014 - Drupal backend performance and scalabilityDrupalCampLA 2014 - Drupal backend performance and scalability
DrupalCampLA 2014 - Drupal backend performance and scalabilitycherryhillco
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoopbddmoscow
 

Ähnlich wie Hadoop Robot from eBay at China Hadoop Summit 2015 (20)

Infrastructure Around Hadoop
Infrastructure Around HadoopInfrastructure Around Hadoop
Infrastructure Around Hadoop
 
Troubleshooting Hadoop: Distributed Debugging
Troubleshooting Hadoop: Distributed DebuggingTroubleshooting Hadoop: Distributed Debugging
Troubleshooting Hadoop: Distributed Debugging
 
Hadoop-Quick introduction
Hadoop-Quick introductionHadoop-Quick introduction
Hadoop-Quick introduction
 
Top 10 lessons learned from deploying hadoop in a private cloud
Top 10 lessons learned from deploying hadoop in a private cloudTop 10 lessons learned from deploying hadoop in a private cloud
Top 10 lessons learned from deploying hadoop in a private cloud
 
Real-time searching of big data with Solr and Hadoop
Real-time searching of big data with Solr and HadoopReal-time searching of big data with Solr and Hadoop
Real-time searching of big data with Solr and Hadoop
 
Scaling Hadoop at LinkedIn
Scaling Hadoop at LinkedInScaling Hadoop at LinkedIn
Scaling Hadoop at LinkedIn
 
Teradata Partners Conference Oct 2014 Big Data Anti-Patterns
Teradata Partners Conference Oct 2014   Big Data Anti-PatternsTeradata Partners Conference Oct 2014   Big Data Anti-Patterns
Teradata Partners Conference Oct 2014 Big Data Anti-Patterns
 
Debugging Hive with Hadoop-in-the-Cloud by David Chaiken of Altiscale
Debugging Hive with Hadoop-in-the-Cloud by David Chaiken of AltiscaleDebugging Hive with Hadoop-in-the-Cloud by David Chaiken of Altiscale
Debugging Hive with Hadoop-in-the-Cloud by David Chaiken of Altiscale
 
Facing enterprise specific challenges – utility programming in hadoop
Facing enterprise specific challenges – utility programming in hadoopFacing enterprise specific challenges – utility programming in hadoop
Facing enterprise specific challenges – utility programming in hadoop
 
Innovation in the Data Warehouse - StampedeCon 2016
Innovation in the Data Warehouse - StampedeCon 2016Innovation in the Data Warehouse - StampedeCon 2016
Innovation in the Data Warehouse - StampedeCon 2016
 
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
 
Achieving 100k Queries per Hour on Hive on Tez
Achieving 100k Queries per Hour on Hive on TezAchieving 100k Queries per Hour on Hive on Tez
Achieving 100k Queries per Hour on Hive on Tez
 
OSDC 2016 - Tuning Linux for your Database by Colin Charles
OSDC 2016 - Tuning Linux for your Database by Colin CharlesOSDC 2016 - Tuning Linux for your Database by Colin Charles
OSDC 2016 - Tuning Linux for your Database by Colin Charles
 
Piranha vs. mammoth predator appliances that chew up big data
Piranha vs. mammoth   predator appliances that chew up big dataPiranha vs. mammoth   predator appliances that chew up big data
Piranha vs. mammoth predator appliances that chew up big data
 
MySQL in the Hosted Cloud
MySQL in the Hosted CloudMySQL in the Hosted Cloud
MySQL in the Hosted Cloud
 
Hadoop - Disk Fail In Place (DFIP)
Hadoop - Disk Fail In Place (DFIP)Hadoop - Disk Fail In Place (DFIP)
Hadoop - Disk Fail In Place (DFIP)
 
DrupalCampLA 2014 - Drupal backend performance and scalability
DrupalCampLA 2014 - Drupal backend performance and scalabilityDrupalCampLA 2014 - Drupal backend performance and scalability
DrupalCampLA 2014 - Drupal backend performance and scalability
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoop
 
002 Introduction to hadoop v3
002   Introduction to hadoop v3002   Introduction to hadoop v3
002 Introduction to hadoop v3
 
Oracle Big Data Cloud service
Oracle Big Data Cloud serviceOracle Big Data Cloud service
Oracle Big Data Cloud service
 

Kürzlich hochgeladen

Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 

Kürzlich hochgeladen (20)

Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 

Hadoop Robot from eBay at China Hadoop Summit 2015

  • 1. HADOOP ROBOT Action and Remediation Center for eBay's Hadoop Clusters Polo Li | 李健 eBay Cloud Services
  • 3. Agenda • Hadoop @eBay • Problem Statement • What is Hadoop Robot • Q&A 3
  • 4. 4 Hadoop @eBay 1-10 nodes 2007 100+ nodes 1000 + core 1 PB 2010 2011 1000+ node 10,000+ core 10+ PB 4000+ node 40,000+ core 50+ PB 2013 2015 10,000+ nodes 150,000+ cores 150+ PB 2009 10+ nodes
  • 5. Hadoop @eBay •10+ large Hadoop Clusters •10,000+ nodes •50,000+ jobs per day •50,000,000+ tasks per day 5
  • 6. Shared vs Dedicated Clusters • Shared clusters • Used primarily for analytics of user behavior and inventory • Mix of batch and ad-hoc jobs • Mix of MR, YARN, Hive, Pig, Cascading, etc. • Hadoop and HBase security enabled • Dedicated clusters • Very specific use case like index building • Tight SLAs for jobs (in order of minutes) • Immediate revenue impact • Usually smaller than our shared clusters, but still large (600+ nodes) 6
  • 8. Agenda • Hadoop @eBay • Problem Statement • What is Hadoop Robot • Q&A 8
  • 9. Problem Statements • Long trouble shooting time • Bad cluster performance • Too many different skus, operating systems and metadatas • Human resource Cost • Cluster Availability 9
  • 10. Traditional Trouble Shooting Pipeline Step 1 • Check failed application task logs to find out the suspicious hadoop nodes Step 2 • Check the suspicious hadoop node hardware && system status Step 3 • Check hadoop metrics and hadoop daemon logs Step 4 • Check hadoop source code 10
  • 11. Victim or Perpetrator ? Sometimes you think you've found the perpetrators, however it may turn out to be the victim. 11
  • 12. What may impact cluster performance? • Hardware • System • Hadoop • JVM • … 12
  • 13. Hardware issues 84% 3% 4% 4% 1% 1% 1% 2% Hardware Issues distribution Disk Nic Memory Cpu Chassis Fan Mobo PSU 13
  • 14. System issues •High load •Node reboot •Disk full •Network saturated fully •OOM •Kernel bug •Orphan processes •… 14
  • 15. Hadoop Issues •Hot spot •Low hdfs locality •High RPC call queue length •Hadoop configuration inconsistent issues •Big resource consuming applications •Big log/hdfs output applications •Bad Application scheduling •… 15
  • 17. Different SKU •Cpu core •Memory size •Disk number •Disk size •Nic Speed •… 17
  • 18. Different OS •Image profile •Image version •Patches •… 18
  • 19. Different Metadata • Hadoop Metadata •Daemons •Packages •Configurations • OS Metadata •Services •Scripts/Tools •Configurations 19
  • 20. Human Resource Cost • Detect node (varies based on admin experiences) • Node decommission (1-2 Hours) • Vendor offline remediation (3-5 Days) • Node reimage(45 Mins) • Hadoop Installation and OS configuration (15 Mins) • Node restart (5 Mins) • Disk burning test (6-8 Hours) • Node health verification (15 Mins) • Node recommission (15 Mins) 20
  • 21. Cluster Availability 21 • Improve Availability • Increase Live data node percentage • Reduce job failures due to bad node
  • 22. Agenda • Hadoop @eBay • Problem Statement • What is Hadoop Robot • Q&A 22
  • 23. What is Hadoop Robot Hadoop Robot is action and remediation center for eBay hadoop clusters: • End-to-end automated remediation center • API center for hadoop action and remediation • Unified Hadoop Admin Console • Real time maintenance view of Hadoop clusters • Analytical insights into hardware maintenance data 23
  • 24. End-to-end Automatic remediation center •Hardware Maintenance •Alert Detection •Node Decommission •Remediation •Node Recommission •Remove Failed Disk Volume •Bad Disk Hot Swap •Hadoop Daemon Restart •Hadoop Abnormal Job Termination •Hadoop Cluster Expansion •… 24
  • 26. Hardware Maintenance Workflow • Alert detection • Node decommission • Vendor remediation (while node is offline) • Node OS reprovisioning • Hadoop Installation && OS configuration • Node restart • Burn-in test • Node health verification • Node recommission 26
  • 29. Hardware and System Alerts •Hardware 29 DISK SMARTD_CHECK SECTOR_CHECK RW_CHECK MOUNT_CHCEK SYSLOG_CHECK NIC SPEED_CHCEK DUPLEX_CHECK FLAPPING_CHECK DOWN_CHECK NOT_FOUND_CHECK MCELOG CPU_CHECK MEMORY_CHCEK CPU_HIGH_TEMPERA TURE_CHECK MEMORY MEMORY_SIZE _CHECK
  • 30. Hardware and System Alerts •System 30 DISK DISK_SPACE_CH ECK REBOOT NODE_REBOOT_ CHCEK PROCESS ORPHAN_PROC ESSES_CHECK
  • 31. Metadata Setup - ansible 31
  • 32. One button hadoop installation and system configuration We use ansible playbook to install and configure various OS and software packages including Hadoop. 32 Copy the Hadoop Packages, Configuration and Codes from Source to the Destination Node Update Symbolic Links to point to the Latest Hadoop Code. Restart Hadoop Daemon
  • 33. Bonnie++ •We use bonnie++ to carry out a stress-test of the repaired hardware. This not only puts a load on the I/O and disk subsystem, but it also can flush out CPU, RAM and fan/cooling issues. 33
  • 34. Cluster level status overview 34
  • 35. Unified Hadoop Admin Console 35
  • 37. Track Robot Job Activities 37
  • 38. 38 We are Hiring Now! https://careers.ebayinc.com Or contact me: jianli1@ebay.com