Over the past year, a lot of progress has been made in advancing the Apache Hadoop platform at Yahoo. We underwent a massive infrastructure consolidation to lower the platform TCO. CaffeOnSpark was open-sourced for distributed deep learning on existing infrastructure with a combination of CPU and GPU-based computing. Traditional compute on MapReduce continues to shift to Apache Tez and Apache Spark for lower processing time. Our internal security, multi-tenancy, and scale changes to Apache Storm got pushed to the community in Storm 0.10. Omid was open-sourced for managing transactions reliably on Apache HBase. Multi-tenancy with region groups, splittable META, ZooKeeper-less assignment manager, favored nodes with HDFS block placement, and support for humongous tables have taken Apache HBase scale to new heights. Dependency management in Apache Oozie for combinatorial, conditional, and optional processing gives increased flexibility to our data pipelines teams in maintaining SLAs. Focus on ease of use and onboarding improvements have brought in a whole new class of use cases and users to the platform. In this talk, we will provide a comprehensive overview of the platform technology stack, recent developments, metrics, and share thoughts on where things are headed when it comes to big data at Yahoo.
3. 0
100
200
300
400
500
600
700
800
0
5,000
10,000
15,000
20,000
25,000
30,000
35,000
40,000
45,000
50,000
2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016
RawHDFS(inPB)
#Servers
Year
Servers Storage
Yahoo!
Commits to
Scaling
Hadoop for
Production
Use
Research
Workloads
in Search
and
Advertising
Production
(Modeling)
with machine
learning &
WebMap
Revenue
Systems
with Security,
Multi-
tenancy, and
SLAs
Open
Sourced with
Apache
Hortonworks
Spinoff for
Enterprise
hardening
Nextgen
Hadoop
(H 0.23 YARN)
New Services
(HBase,
Storm, Spark,
Hive)
Increased
User-base
with
partitioned
namespaces
Apache H2.7
(Scalable ML, Latency,
Utilization, Productivity)
Platform Evolution
3
4. Deployment Models
Private (dedicated)
Clusters
Hosted Multi-tenant
(private cloud)
Clusters
Hosted Compute
Clusters
Large demanding use
cases
New technology not
yet platformized
Data movement and
regulation issues
When more cost
effective than on-
premise
Time to market/
results matter
Data already in
public cloud
Source of truth for all
of orgs data
App delivery agility
Operational efficiency
and cost savings
through economies of
scale
On-Premise Public Cloud
Purpose-built
Big Data
Clusters
For performance,
tighter integration
with tech stack
Value added services
such as monitoring,
alerts, tuning and
common tools
4
9. 0
50
100
150
200
250
300
Consolidated Cluster
HDFS 65 PB
Compute 240 TB
Avg. Util: 70%
Consolidated Research Cluster Characteristics
One Month Sample (2016)
40% decrease in TCO
10,500
servers
2,200
servers
Before After
65% increase in compute capacity
50% increase in avg. utilization
Total Used
ComputeTotalandUsed(TB)
9
10. Common Hadoop Cluster Configuration
Rack 1
Network Backplane
CPU Servers
with JBODs
& 10GbE
Rack 2 Rack N
.
.
.
.
.
.
.
.
.
10
11. New Hadoop Cluster Configuration
Rack 1
Network Backplane
CPU Servers
with JBODs
& 10GbE
Rack 2 Rack N
100Gbps
InfiniBand
GPU Servers
Hi-Mem Servers
.
.
.
11
12. YARN Node Labels
J2J3
J4
Queue 1, 40%
Label x
Queue 2, 40%
Label x, y
J1
Queue 3, 20%
x x x x x x
x x x x x x
y y y y y y
y y y y y y
yarn.scheduler.capacity.root.<queue name>.accessible-node-labels = <label name>
yarn.scheduler.capacity.root.<label name>.default-node-label-expression sets the default label asked for by queue
Hadoop Cluster
12
18. CaffeOnSpark Architecture – Common Cluster
Spark Driver
Caffe
(enhanced with
multi-GPU/CPU)
Model
Synchronizer
(across nodes)
HDFS
Datasets
Spark
Executor
(for data feeding
and control)
Caffe
(enhanced with
multi-GPU/CPU)
Model
Synchronizer
(across nodes)
HDFS
Datasets
Spark
Executor
(for data feeding
and control)
Caffe
(enhanced with
multi-GPU/CPU)
Model
Synchronizer
(across nodes)
HDFS
Datasets
Spark
Executor
(for data feeding
and control)
Model
O/P on
HDFS
MPI on RDMA / TCP
18
19. CaffeOnSpark Architecture – Incremental Learning
cos = new CaffeOnSpark(ctx) conf = new Config(ctx, args).init()
dl_train_source = DataSource.getSource(conf, true) cos.train(dl_train_source)
//training DL model lr_raw_source = DataSource.getSource(conf, false) ext_df =
cos.features(lr_raw_source) // extract features via DL
lr_input=ext_df.withColumn(“L", cos.floats2doubleUDF(ext_df(conf.label)))
.withColumn(“F", cos.floats2doublesUDF(ext_df(conf.features(0)))) lr = new
LogisticRegression().setLabelCol(”L").setFeaturesCol(”F") lr_model =
lr.fit(lr_input_df) …
TrainClassifiers:
Non-deep
Learning
Feature
Engineering:
DeepLearning
19
23. Hadoop Compute Sources
HDFS
(File System and Storage)
Pig
(Scripting)
Hive
(SQL)
Java MR APIs
YARN
(Resource Management and Scheduling)
Tez
(Execution Engine for
Pig and Hive)
Spark
(Alternate Exec Engine)
MapReduce
(Legacy)
Data Processing
ML
Custom App on
Slider
Oozie
Data
Management
23
33. Security
Authentication
Kerberos (users, processes)
Delegation Token (MapReduce, YARN, etc.)
Authorization
HBase ACLs (Read, Write, Create, Admin)
Grant permissions to User or Unix Group
ACL for Table, Column Family or Column
33
34. Region Server Groups
Dedicated region servers for a set of tables
Resource Isolation (CPU, Memory, IO, etc)
RegionServer
Group Foo
RegionServer
RegionServer
RegionServer
Region Server 1...5
TableA TableB TableC
TableD TableE TableF
RegionServer
Group Bar
RegionServer
RegionServer
RegionServer
Region Server 6…10
Table1 Table2 Table3
Table4 Table5 Table6
34
35. Namespaces
Analogous to “Database”
Namespace ACL to create tables
Default group
Quota
Tables
Regions
Namespace
Group Tables Quota ACL
35
36. Split Meta to Spread Load and Avoid Large Regions
36
39. Scaling HBase to Handle Millions of Regions on a Cluster
Region Server
Groups
Split
Meta
Split
ZK
Favored
Nodes
Humongous
Tables
39
40. Transactions on HBase with Omid1
Highly performant and fault tolerant ACID
transactional framework
New Apache Incubator project
incubator.apache.org/projects/omid.html
Handles million of transactions per day for
search and personalization products
1 Omid stands for “Hope” in Persian
40
44. Oozie Data Pipelines
Oozie
Message Bus
HCatalog
3. Push notification
<New Partition>
2. Register Topic
4. Notify New Partition
Data Producer HDFS
Produce data (distcp, pig, M/R..)
/data/click/2014/06/02
1. Query/Poll Partition
Start workflow
Update metadata
(ALTER TABLE click ADD PARTITION(data=‘2014/06/02’)
location ’hdfs://data/click/2014/06/02’)
44
45. Large Scale Data Pipeline Requirements
Administrative
One should be able to start, stop and pause
all related pipelines at a same time
Dependency Management
Output of a coordinator “n+1” action is
dependent on coordinator “n” action (dataset
dependency)
If dataset has a BCP instance, workflow
should run with either, whichever arrives first
Start as soon as mandatory data is available,
other feeds are optional
Data is not guaranteed, start processing
even if partial data is available
SLA Management
Monitor pipeline processing to take
immediate action in case of failures or
SLA misses
Pipelines owners should get notified if
an SLA is missed
Multiple Providers
If data is available from multiple
providers, I want to specify the provider
priority
Combine datasets from multiple
providers to fill the gaps in data a single
provider may have
45
46. Large Scale Data Pipeline Requirements
Administrative
One should be able to start, stop and pause
all related pipelines at a same time
Dependency Management
Output of a coordinator “n+1” action is
dependent on coordinator “n” action (dataset
dependency)
If dataset has a BCP instance, workflow
should run with either, whichever arrives first
Start as soon as mandatory data is available,
other feeds are optional
Data is not guaranteed, start processing
even if partial data is available
SLA Management
Monitor pipeline processing to take
immediate action in case of failures or
SLA misses
Pipelines owners should get notified if
an SLA is missed
Multiple Providers
If data is available from multiple
providers, I want to specify the provider
priority
Combine datasets from multiple
providers to fill the gaps in data a single
provider may have
46
47. BCP And Mandatory / Optional Feeds
Pull data from A or B. Specify dataset as
AorB. Action will start running as soon
either dataset A or B is available.
<input-logic>
<or name=“AorB”>
<data-in dataset="A” wait=“10”/>
<data-in dataset="B"/>
</or>
</input-logic>
Dataset B is optional, Oozie will start
processing as soon as A is available. It
will include dataset from A and whatever
is available from B.
<input-logic>
<and name="optional
<data-in dataset="A"/>
<data-in dataset="B" min=”0”/>
</and>
</input-logic>
47
48. Data Not Guaranteed / Priority Among Dataset Instances
A will have higher precedence over B
and B will have higher precedence
over C.
<input-logic>
<or name="AorBorC">
<data-in dataset="A"/>
<data-in dataset="B"/>
<data-in dataset="C”/>
</or>
</input-logic>
48
Oozie will start processing if available A
instances are >= 10. Min can also be
combined with wait (as shown for dataset B).
<input-logic>
<data-in dataset="A" min=”10”/>
<data-in dataset=“B” min =“10”
wait=“20”/>
</input-logic>
49. Combining Dataset From Multiple Providers
Combine function will first check instances from A and go to B next for whatever is
missing in A.
<data-in name="A" dataset="dataset_A">
<start-instance> ${coord:CURRENT(-5)} </start-instance>
<end-instance> ${coord:latest(-1)} </end-instance>
</data-in>
<data-in name="B" dataset="dataset_B">
<start-instance>${coord:CURRENT(-5)}</start-instance>
<end-instance>${coord:CURRENT(-1)}</end-instance>
</data-in>
<input-logic>
<combine name="AB">
<data-in dataset="A"/>
<data-in dataset="B"/>
</combine>
</input-logic>
49
65. THANK YOU
SUMEET SINGH (@sumeetksingh)
Sr. Director, Cloud and Big Data Platforms
Icon Courtesy – iconfinder.com (under Creative Commons)
Hinweis der Redaktion
JIRA 1976 (Oozie 4.3)
While $coord:latest allows skipping to available ones, the workflow will never trigger unless mentioned number of instances are found. Min can be also combined with wait. If all dependencies are not met and if we have met MIN dependencies and then Oozie keeps on waiting for more instance till wait time elapses or all data dependencies are met.
(30 secs)
T: 2 min 30 secs
xyz
Protocols
REST – Use pyhton-requests and a custom client to streamline RESTful interface calls
Thrift – Custom connection pooling and socket multiplexing to streamline thrift calls
Accessibility
Middleware – Make Hadoop interfaces accessible in request objects
Hue uses CherryPy web server. You can use the following options to change the IP address and port that the web server listens on. The default setting is port 8888 on all configured IP addresses.
If you don’t specify a secret key, your session cookies will not be secure. Hue will run but it will also display error messages telling you to set the secret key.
You can configure Hue to serve over HTTPS. To do so, you must install "pyOpenSSL" within Hue’s context and configure your keys.
Protocols
REST – Use pyhton-requests and a custom client to streamline RESTful interface calls
Thrift – Custom connection pooling and socket multiplexing to streamline thrift calls
Accessibility
Middleware – Make Hadoop interfaces accessible in request objects
Hue uses CherryPy web server. You can use the following options to change the IP address and port that the web server listens on. The default setting is port 8888 on all configured IP addresses.
If you don’t specify a secret key, your session cookies will not be secure. Hue will run but it will also display error messages telling you to set the secret key.
You can configure Hue to serve over HTTPS. To do so, you must install "pyOpenSSL" within Hue’s context and configure your keys.