SlideShare a Scribd company logo
1 of 66
HADOOP PLATFORM
AT YAHOO
A YEAR IN REVIEW
SUMEET SINGH (@sumeetksingh)
Sr. Director, Cloud and Big Data Platforms
Agenda
2
Platform Overview1
Infrastructure and Metrics2
CaffeOnSpark for Distributed DL3
Compute and Sketches4
Oozie6
Ease of Use7
Q&A8
HBase and Omid5
0
100
200
300
400
500
600
700
800
0
5,000
10,000
15,000
20,000
25,000
30,000
35,000
40,000
45,000
50,000
2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016
RawHDFS(inPB)
#Servers
Year
Servers Storage
Yahoo!
Commits to
Scaling
Hadoop for
Production
Use
Research
Workloads
in Search
and
Advertising
Production
(Modeling)
with machine
learning &
WebMap
Revenue
Systems
with Security,
Multi-
tenancy, and
SLAs
Open
Sourced with
Apache
Hortonworks
Spinoff for
Enterprise
hardening
Nextgen
Hadoop
(H 0.23 YARN)
New Services
(HBase,
Storm, Spark,
Hive)
Increased
User-base
with
partitioned
namespaces
Apache H2.7
(Scalable ML, Latency,
Utilization, Productivity)
Platform Evolution
3
Deployment Models
Private (dedicated)
Clusters
Hosted Multi-tenant
(private cloud)
Clusters
Hosted Compute
Clusters
 Large demanding use
cases
 New technology not
yet platformized
 Data movement and
regulation issues
 When more cost
effective than on-
premise
 Time to market/
results matter
 Data already in
public cloud
 Source of truth for all
of orgs data
 App delivery agility
 Operational efficiency
and cost savings
through economies of
scale
On-Premise Public Cloud
Purpose-built
Big Data
Clusters
 For performance,
tighter integration
with tech stack
 Value added services
such as monitoring,
alerts, tuning and
common tools
4
Platform Today
ZK DBMS MON SSHOP LOG WH TOOLS
Apache / Open Source Projects Yahoo Projects
HDFS HBase HCat Kafka CMS DH
Pig Hive Oozie Hue GDM Big ML
YARN CS MR Tez Spark Storm
Services
Compute
Storage / Msg.
Tools
5
Technology Stack Assembly
ZK DBMS MON SSHOP LOG WH TOOLS
Apache Projects Yahoo Projects
HDFS HBase HCat Kafka CMS DH
Pig Hive Oozie Hue GDM Big ML
YARN CS MR Tez Spark Storm
Services
Compute
Storage / Msg.
Tools
HDFS
(File System)
YARN
(Scheduling, Resource Management)
Common
RHEL6 64-bit, JDK8
Platformized
Tech with
Production
Support
In-
progress,
Unmet
needs or
Apache
Alignment
6
Common Backplane
DataNode NodeManager
NameNode RM
DataNodes RegionServers
NameNode HBase Master Nimbus
Supervisor
Administration, Management and Monitoring
ZooKeeper
Pools
HTTP/HDFS/GDM
Load Proxies
Applications and Data
Data
Feeds
Data
Stores
Oozie
Server
HS2/
HCat
Network
Backplane
7
0
10
20
30
Cluster 1 (2,000 servers)
HDFS 12 PB
Compute 23 TB
Avg. Util: 26%
Research Cluster Consolidation
0
20
40
60
80
ComputeTotalandUsed(TB)
Cluster 3 (5,400 servers)
HDFS 36 PB
Compute 70 TB
Avg. Util: 59%
Cluster 2 (3,100 servers)
HDFS 21 PB
Compute 52 TB
Avg. Util: 40%
0
20
40
60
One Month Sample (2015)
Total Used
8
0
50
100
150
200
250
300
Consolidated Cluster
HDFS 65 PB
Compute 240 TB
Avg. Util: 70%
Consolidated Research Cluster Characteristics
One Month Sample (2016)
40% decrease in TCO
10,500
servers
2,200
servers
Before After
65% increase in compute capacity
50% increase in avg. utilization
Total Used
ComputeTotalandUsed(TB)
9
Common Hadoop Cluster Configuration
Rack 1
Network Backplane
CPU Servers
with JBODs
& 10GbE
Rack 2 Rack N
.
.
.
.
.
.
.
.
.
10
New Hadoop Cluster Configuration
Rack 1
Network Backplane
CPU Servers
with JBODs
& 10GbE
Rack 2 Rack N
100Gbps
InfiniBand
GPU Servers
Hi-Mem Servers
.
.
.
11
YARN Node Labels
J2J3
J4
Queue 1, 40%
Label x
Queue 2, 40%
Label x, y
J1
Queue 3, 20%
x x x x x x
x x x x x x
y y y y y y
y y y y y y
yarn.scheduler.capacity.root.<queue name>.accessible-node-labels = <label name>
yarn.scheduler.capacity.root.<label name>.default-node-label-expression sets the default label asked for by queue
Hadoop Cluster
12
Agenda
Platform Overview1
Infrastructure and Metrics2
CaffeOnSpark for Distributed DL3
Compute and Sketches4
Oozie6
Ease of Use7
Q&A8
HBase and Omid5
13
CaffeOnSpark – Distributed Deep Learning
CaffeOnSpark
for
DL
MLLib
for
non-DL
Hive or
SparkSQL
Spark
YARN (RM and Scheduling)
HDFS (Datasets)
. . .
14
Few Use Cases – Yahoo Weather
15
Few Use Cases – Flickr Facial Recognition
16
Few Use Cases – Flickr Scene Detection
17
CaffeOnSpark Architecture – Common Cluster
Spark Driver
Caffe
(enhanced with
multi-GPU/CPU)
Model
Synchronizer
(across nodes)
HDFS
Datasets
Spark
Executor
(for data feeding
and control)
Caffe
(enhanced with
multi-GPU/CPU)
Model
Synchronizer
(across nodes)
HDFS
Datasets
Spark
Executor
(for data feeding
and control)
Caffe
(enhanced with
multi-GPU/CPU)
Model
Synchronizer
(across nodes)
HDFS
Datasets
Spark
Executor
(for data feeding
and control)
Model
O/P on
HDFS
MPI on RDMA / TCP
18
CaffeOnSpark Architecture – Incremental Learning
cos = new CaffeOnSpark(ctx)
conf = new Config(ctx, args).init()
dl_train_source = DataSource.getSource(conf, true)
cos.train(dl_train_source) //training DL model
lr_raw_source = DataSource.getSource(conf, false)
ext_df = cos.features(lr_raw_source) // extract features via DL
Feature
Engineering:
DeepLearning
19
CaffeOnSpark Architecture – Incremental Learning
cos = new CaffeOnSpark(ctx)
conf = new Config(ctx, args).init()
dl_train_source = DataSource.getSource(conf, true)
cos.train(dl_train_source) //training DL model
lr_raw_source = DataSource.getSource(conf, false)
ext_df = cos.features(lr_raw_source) // extract features via DL
vlr_input=ext_df.withColumn(“L",cos.floats2doubleUDF(ext_df(conf.label))
)
.withColumn(“F",cos.floats2doublesUDF(ext_df(conf.features(0))))
lr = new LogisticRegression().setLabelCol(”L").setFeaturesCol(”F")
lr_model = lr.fit(lr_input_df) …
Feature
Engineering:
DeepLearning
20
TrainClassifiers:
Non-deep
Learning
CaffeOnSpark Architecture – Single Command
spark-submit
--num-executors #Exes
--class CaffeOnSpark
my-caffe-on-spark.jar
-devices #GPUs
-model dl_model_file
-output lr_model_file
21
Distributed Deep Learning
Apache
License
Existing
Clusters
Powerful
DL Platform
Fully
Distributed
High-level
API
Incremental
Learning
CaffeOnSpark
github.com/yahoo/caffeonspark
22
Agenda
Platform Overview1
Infrastructure and Metrics2
CaffeOnSpark for Distributed DL3
Compute and Sketches4
Oozie6
Ease of Use7
Q&A8
HBase and Omid5
23
Hadoop Compute Sources
HDFS
(File System and Storage)
Pig
(Scripting)
Hive
(SQL)
Java MR APIs
YARN
(Resource Management and Scheduling)
Tez
(Execution Engine for
Pig and Hive)
Spark
(Alternate Exec Engine)
MapReduce
(Legacy)
Data Processing
ML
Custom App on
Slider
Oozie
Data
Management
24
Compute Growth
13.3
20.4
23.8
27.2
32.3
34.1
39.1
10
15
20
25
30
35
40
45 Mar-13
Apr-13
May-13
Jun-13
Jul-13
Aug-13
Sep-13
Oct-13
Nov-13
Dec-13
Jan-14
Feb-14
Mar-14
Apr-14
May-14
Jun-14
Jul-14
Aug-14
Sep-14
Oct-14
Nov-14
Dec-14
Jan-15
Feb-15
Mar-15
Apr-15
May-15
Jun-15
Jul-15
Aug-15
Sep-15
Oct-15
Nov-15
Dec-15
Jan-16
Feb-16
Mar-16
#MR,Tez,SparkJobs(inmillions)
25
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Pushing Batch Compute Boundaries%ofTotalCompute(memory-sec)
Q1 2016
MapReduce Tez Spark
112 Million Batch Jobs in Q1’16
Jan 78%
Mar 67%
Mar 21% 12%Jan 8% 14%
26
Multi-tenant Apache Storm
27
Recent Apache Storm Developments at Yahoo
MT & RA
Scheduler
Dist. Cache
API
8 x
Throughput
Improved
Debuggability
1 github.com/yahoo/streaming-benchmarks
Pacemaker
Server
Streaming
Benchmark 1
28
Data Sketches Algorithms
Data Sketches Algorithms Library
datasketches.github.io
 Good enough approximate answers
for problem queries
 Streamable
 Approximate with predictable error
 Sub-linear in size
 Mergeable / additive
 Highly parallelizable
 Maven deployable
Characteristics
29
Distinct Count Sketch, High-level View
Big Data
Stream
Transform Data Structure Estimator
Result + / - ε
White
Noise
Basic Sketch Elements
30
Data Sketches Algorithms
Data Sketches Algorithms Library
datasketches.github.io
31
Agenda
Platform Overview1
Infrastructure and Metrics2
CaffeOnSpark for Distributed DL3
Compute and Sketches4
Oozie6
Ease of Use7
Q&A8
HBase and Omid5
32
Apache HBase at Yahoo
 Security
 Isolated Deployment
 Multi-tenant
 Region Server Group
 Namespace
 Unsupported Features
HBase
Client
HBase
Client
JobTracker Namenode
TaskTracker
DataNode
Namenode
RegionServer
DataNode
RegionServer
DataNode
RegionServer
DataNode
HBase Master
Zookeeper
Quorum
HBase
Client
MR Client
M/R Task
TaskTracker
DataNode
M/R Task
TaskTracker
DataNode
MR Task
Compute Cluster HBase Cluster
Gateway/Launcher
Rest Proxy
HTTP
Client
33
Security
 Authentication
 Kerberos (users, processes)
 Delegation Token (MapReduce, YARN, etc.)
 Authorization
 HBase ACLs (Read, Write, Create, Admin)
 Grant permissions to User or Unix Group
 ACL for Table, Column Family or Column
34
Region Server Groups
 Dedicated region servers for a set of tables
 Resource Isolation (CPU, Memory, IO, etc)
RegionServer
Group Foo
RegionServer
RegionServer
RegionServer
Region Server 1...5
TableA TableB TableC
TableD TableE TableF
RegionServer
Group Bar
RegionServer
RegionServer
RegionServer
Region Server 6…10
Table1 Table2 Table3
Table4 Table5 Table6
35
Namespaces
 Analogous to “Database”
 Namespace ACL to create tables
 Default group
 Quota
 Tables
 Regions
Namespace
Group Tables Quota ACL
36
Split Meta to Spread Load and Avoid Large Regions
37
Favored Nodes for HDFS Locality
38
Humongous Tables
39
Scaling HBase to Handle Millions of Regions on a Cluster
Region Server
Groups
Split
Meta
Split
ZK
Favored
Nodes
Humongous
Tables
40
Transactions on HBase with Omid1
Highly performant and fault tolerant ACID
transactional framework
New Apache Incubator project
incubator.apache.org/projects/omid.html
Handles million of transactions per day for
search and personalization products
1 Omid stands for “Hope” in Persian
41
Omid Components
42
Omid Data Model
43
Agenda
Platform Overview1
Infrastructure and Metrics2
CaffeOnSpark for Distributed DL3
Compute and Sketches4
Oozie6
Ease of Use7
Q&A8
HBase and Omid5
44
Oozie Data Pipelines
Oozie
Message Bus
HCatalog
3. Push notification
<New Partition>
2. Register Topic
4. Notify New Partition
Data Producer HDFS
Produce data (distcp, pig, M/R..)
/data/click/2014/06/02
1. Query/Poll Partition
Start workflow
Update metadata
(ALTER TABLE click ADD PARTITION(data=‘2014/06/02’)
location ’hdfs://data/click/2014/06/02’)
45
Large Scale Data Pipeline Requirements
Administrative
 One should be able to start, stop and pause
all related pipelines at a same time
Dependency Management
 Output of a coordinator “n+1” action is
dependent on coordinator “n” action (dataset
dependency)
 If dataset has a BCP instance, workflow
should run with either, whichever arrives first
 Start as soon as mandatory data is available,
other feeds are optional
 Data is not guaranteed, start processing
even if partial data is available
SLA Management
 Monitor pipeline processing to take
immediate action in case of failures or
SLA misses
 Pipelines owners should get notified if
an SLA is missed
Multiple Providers
 If data is available from multiple
providers, I want to specify the provider
priority
 Combine datasets from multiple
providers to fill the gaps in data a single
provider may have
46
Large Scale Data Pipeline Requirements
Administrative
 One should be able to start, stop and pause
all related pipelines at a same time
Dependency Management
 Output of a coordinator “n+1” action is
dependent on coordinator “n” action (dataset
dependency)
 If dataset has a BCP instance, workflow
should run with either, whichever arrives first
 Start as soon as mandatory data is available,
other feeds are optional
 Data is not guaranteed, start processing
even if partial data is available
SLA Management
 Monitor pipeline processing to take
immediate action in case of failures or
SLA misses
 Pipelines owners should get notified if
an SLA is missed
Multiple Providers
 If data is available from multiple
providers, I want to specify the provider
priority
 Combine datasets from multiple
providers to fill the gaps in data a single
provider may have
47
BCP And Mandatory / Optional Feeds
Pull data from A or B. Specify dataset as
AorB. Action will start running as soon
either dataset A or B is available.
<input-logic>
<or name=“AorB”>
<data-in dataset="A” wait=“10”/>
<data-in dataset="B"/>
</or>
</input-logic>
Dataset B is optional, Oozie will start
processing as soon as A is available. It
will include dataset from A and whatever
is available from B.
<input-logic>
<and name="optional
<data-in dataset="A"/>
<data-in dataset="B" min=”0”/>
</and>
</input-logic>
48
Data Not Guaranteed / Priority Among Dataset Instances
A will have higher precedence over B
and B will have higher precedence
over C.
<input-logic>
<or name="AorBorC">
<data-in dataset="A"/>
<data-in dataset="B"/>
<data-in dataset="C”/>
</or>
</input-logic>
49
Oozie will start processing if available A
instances are >= 10. Min can also be
combined with wait (as shown for dataset B).
<input-logic>
<data-in dataset="A" min=”10”/>
<data-in dataset=“B” min =“10”
wait=“20”/>
</input-logic>
Combining Dataset From Multiple Providers
Combine function will first check instances from A and go to B next for whatever is
missing in A.
<data-in name="A" dataset="dataset_A">
<start-instance> ${coord:CURRENT(-5)} </start-instance>
<end-instance> ${coord:latest(-1)} </end-instance>
</data-in>
<data-in name="B" dataset="dataset_B">
<start-instance>${coord:CURRENT(-5)}</start-instance>
<end-instance>${coord:CURRENT(-1)}</end-instance>
</data-in>
<input-logic>
<combine name="AB">
<data-in dataset="A"/>
<data-in dataset="B"/>
</combine>
</input-logic>
50
Agenda
Platform Overview1
Infrastructure and Metrics2
CaffeOnSpark for Distributed DL3
Compute and Sketches4
Oozie6
Ease of Use7
Q&A8
HBase and Omid5
51
Automated Onboarding / Collaboration Portal
52
Built for Tenant Transparency
53
Queue Utilization Dashboard
54
Data Discovery and Access
55
Audits, Compliance, and Efficiency
Starling
FS, Job, Task logs
Cluster 1 Cluster 2 Cluster n...
CF, Region, Action, Query Stats
Cluster 1 Cluster 2 Cluster n...
DB, Tbl., Part., Colmn. Access Stats
...MS 1 MS 2 MS n
GDM
Data Defn., Flow, Feed, Source
F 1 F 2 F n
Log Warehouse
Log Sources
56
Audits, Compliance, and Efficiency (cont’d)
Data Discovery and Access
Public
Non-sensitive
Financial $
Governance
Classification
No addn. reqmt.
LMS Integration
Stock Admin
Integration
Approval Flow
Restricted
57
Hosted UI – Hue as a Service
WSGI
Hue-1.Cluster-1 (Hot)
VIPUsers
HS2
Hue
MySQL DB
(HA)
Hadoop Cluster
HCat
Meta
Oozie
Server
YARN
RM
Web
HDFS
NMs
WSGI
Hue-2.Cluster-1 (hot)
HS2
IdP
SAML
Auth.
Serving pages and static content
Cookies, saved queries,
workflows etc.
FullStackHA
REST / Thrift
(jQuery, Bootstrap, Knockout.js, Love)
58
Going Forward
Increased
Intelligence
Greater
Speed
Higher
Efficiency
Necessary
Scale
59
Increased Intelligence
GBDT FTRL SGD
Deep
Learning
Random
Forests
ML Libraries
Click
Prediction Search RankingKeyword Auctions Ad
Relevance Abuse Detection
Applications
Proven to
Work at Scale
Solve Complex
Problems
YARN (Resource Manager)
Heterogeneous
Scheduling
Long-running
Services
GPUs
Large
Memory Support
Core Grid
Enhancements
…
Parameter ServerGlobally Shared
Parameters
Compute Engines
Distributed
Processing
…
60
Greater Speed
DeData
Management
Ease of
Use
Productivity
Dimensions
Real-time
Pipelines
Unified Metadata &
Lineage
Fine-grained
Access Control
Self-serve Data
Movement
SLA & Cost
Transparency
Intuitive
UIs
Planning &
Collab. Tools
Central Grid
Portal
Improvements
Query times
< 1 sec
4x Speedups in
ETL
SQL on
HBase
Limitless BI
Clients
Analytics, BI &
Reporting
61
Higher Efficiency
Achieve five 9’s availability and 70% average compute utilization across clusters
62
Hadoop Users at Yahoo
Slingstone & Aviate Mail Anti-Spam
Gemini Campaign
Mgmt.
Search Assist
Audience Analytics Flickr YAM+ & Targeting Membership Abuse
… and many more.
63
Yahoo at the Apache Open Source Foundation
10 Committers (6 PMC)
3 Committers (3 PMC)
3 Committers (2 PMC)
6 Committer (5 PMC)
1 Committer
3 Committers (2 PMCs)
7 Committers (6 PMCs)
1 2
43
5 6
7 8
1 Committer
64
Join Us @ yahoohadoop.tumblr.com
65
THANK YOU
SUMEET SINGH (@sumeetksingh)
Sr. Director, Cloud and Big Data Platforms
Icon Courtesy – iconfinder.com (under Creative Commons)

More Related Content

What's hot

Hadoop Query Performance Smackdown
Hadoop Query Performance SmackdownHadoop Query Performance Smackdown
Hadoop Query Performance SmackdownDataWorks Summit
 
Azure Data Factory Data Flow Performance Tuning 101
Azure Data Factory Data Flow Performance Tuning 101Azure Data Factory Data Flow Performance Tuning 101
Azure Data Factory Data Flow Performance Tuning 101Mark Kromer
 
Apache Flink Adoption at Shopify
Apache Flink Adoption at ShopifyApache Flink Adoption at Shopify
Apache Flink Adoption at ShopifyYaroslav Tkachenko
 
Native Support of Prometheus Monitoring in Apache Spark 3.0
Native Support of Prometheus Monitoring in Apache Spark 3.0Native Support of Prometheus Monitoring in Apache Spark 3.0
Native Support of Prometheus Monitoring in Apache Spark 3.0Databricks
 
Hudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesHudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesNishith Agarwal
 
Machine Learning with Spark MLlib
Machine Learning with Spark MLlibMachine Learning with Spark MLlib
Machine Learning with Spark MLlibTodd McGrath
 
Performance Update: When Apache ORC Met Apache Spark
Performance Update: When Apache ORC Met Apache SparkPerformance Update: When Apache ORC Met Apache Spark
Performance Update: When Apache ORC Met Apache SparkDataWorks Summit
 
(Big) Data Serialization with Avro and Protobuf
(Big) Data Serialization with Avro and Protobuf(Big) Data Serialization with Avro and Protobuf
(Big) Data Serialization with Avro and ProtobufGuido Schmutz
 
Spark 의 핵심은 무엇인가? RDD! (RDD paper review)
Spark 의 핵심은 무엇인가? RDD! (RDD paper review)Spark 의 핵심은 무엇인가? RDD! (RDD paper review)
Spark 의 핵심은 무엇인가? RDD! (RDD paper review)Yongho Ha
 
Welcome to the Flink Community!
Welcome to the Flink Community!Welcome to the Flink Community!
Welcome to the Flink Community!Flink Forward
 
Data Analysis & Visualization using MS. Excel
Data Analysis & Visualization using MS. ExcelData Analysis & Visualization using MS. Excel
Data Analysis & Visualization using MS. ExcelFrehiwot Mulugeta
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture EMC
 
Time Series Analytics with Spark: Spark Summit East talk by Simon Ouellette
Time Series Analytics with Spark: Spark Summit East talk by Simon OuelletteTime Series Analytics with Spark: Spark Summit East talk by Simon Ouellette
Time Series Analytics with Spark: Spark Summit East talk by Simon OuelletteSpark Summit
 
Hive ICDE 2010
Hive ICDE 2010Hive ICDE 2010
Hive ICDE 2010ragho
 
Deep Dive into Apache Kafka
Deep Dive into Apache KafkaDeep Dive into Apache Kafka
Deep Dive into Apache Kafkaconfluent
 
Reduce Amazon RDS Costs up to 50% with Proxies
Reduce Amazon RDS Costs up to 50% with ProxiesReduce Amazon RDS Costs up to 50% with Proxies
Reduce Amazon RDS Costs up to 50% with ProxiesJohn Varghese
 

What's hot (20)

Hadoop Query Performance Smackdown
Hadoop Query Performance SmackdownHadoop Query Performance Smackdown
Hadoop Query Performance Smackdown
 
Apache Hive - Introduction
Apache Hive - IntroductionApache Hive - Introduction
Apache Hive - Introduction
 
Azure Data Factory Data Flow Performance Tuning 101
Azure Data Factory Data Flow Performance Tuning 101Azure Data Factory Data Flow Performance Tuning 101
Azure Data Factory Data Flow Performance Tuning 101
 
Database system
Database systemDatabase system
Database system
 
Hadoop HDFS.ppt
Hadoop HDFS.pptHadoop HDFS.ppt
Hadoop HDFS.ppt
 
HBASE Overview
HBASE OverviewHBASE Overview
HBASE Overview
 
Apache Flink Adoption at Shopify
Apache Flink Adoption at ShopifyApache Flink Adoption at Shopify
Apache Flink Adoption at Shopify
 
Native Support of Prometheus Monitoring in Apache Spark 3.0
Native Support of Prometheus Monitoring in Apache Spark 3.0Native Support of Prometheus Monitoring in Apache Spark 3.0
Native Support of Prometheus Monitoring in Apache Spark 3.0
 
Hudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesHudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilities
 
Machine Learning with Spark MLlib
Machine Learning with Spark MLlibMachine Learning with Spark MLlib
Machine Learning with Spark MLlib
 
Performance Update: When Apache ORC Met Apache Spark
Performance Update: When Apache ORC Met Apache SparkPerformance Update: When Apache ORC Met Apache Spark
Performance Update: When Apache ORC Met Apache Spark
 
(Big) Data Serialization with Avro and Protobuf
(Big) Data Serialization with Avro and Protobuf(Big) Data Serialization with Avro and Protobuf
(Big) Data Serialization with Avro and Protobuf
 
Spark 의 핵심은 무엇인가? RDD! (RDD paper review)
Spark 의 핵심은 무엇인가? RDD! (RDD paper review)Spark 의 핵심은 무엇인가? RDD! (RDD paper review)
Spark 의 핵심은 무엇인가? RDD! (RDD paper review)
 
Welcome to the Flink Community!
Welcome to the Flink Community!Welcome to the Flink Community!
Welcome to the Flink Community!
 
Data Analysis & Visualization using MS. Excel
Data Analysis & Visualization using MS. ExcelData Analysis & Visualization using MS. Excel
Data Analysis & Visualization using MS. Excel
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
 
Time Series Analytics with Spark: Spark Summit East talk by Simon Ouellette
Time Series Analytics with Spark: Spark Summit East talk by Simon OuelletteTime Series Analytics with Spark: Spark Summit East talk by Simon Ouellette
Time Series Analytics with Spark: Spark Summit East talk by Simon Ouellette
 
Hive ICDE 2010
Hive ICDE 2010Hive ICDE 2010
Hive ICDE 2010
 
Deep Dive into Apache Kafka
Deep Dive into Apache KafkaDeep Dive into Apache Kafka
Deep Dive into Apache Kafka
 
Reduce Amazon RDS Costs up to 50% with Proxies
Reduce Amazon RDS Costs up to 50% with ProxiesReduce Amazon RDS Costs up to 50% with Proxies
Reduce Amazon RDS Costs up to 50% with Proxies
 

Viewers also liked

Process Scheduling on Hadoop at Expedia
Process Scheduling on Hadoop at ExpediaProcess Scheduling on Hadoop at Expedia
Process Scheduling on Hadoop at Expediahuguk
 
BIG Data & Hadoop Applications in Social Media
BIG Data & Hadoop Applications in Social MediaBIG Data & Hadoop Applications in Social Media
BIG Data & Hadoop Applications in Social MediaSkillspeed
 
Pinterest hadoop summit_talk
Pinterest hadoop summit_talkPinterest hadoop summit_talk
Pinterest hadoop summit_talkKrishna Gade
 
August 2016 HUG: Recent development in Apache Oozie
August 2016 HUG: Recent development in Apache OozieAugust 2016 HUG: Recent development in Apache Oozie
August 2016 HUG: Recent development in Apache OozieYahoo Developer Network
 
August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...
August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...
August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...Yahoo Developer Network
 
August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector
August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector
August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector Yahoo Developer Network
 
IT業界のリーディングカンパニーとして描く「少し先の未来」〜Yahoo! JAPANの事例を通して〜#a11yfuture
IT業界のリーディングカンパニーとして描く「少し先の未来」〜Yahoo! JAPANの事例を通して〜#a11yfutureIT業界のリーディングカンパニーとして描く「少し先の未来」〜Yahoo! JAPANの事例を通して〜#a11yfuture
IT業界のリーディングカンパニーとして描く「少し先の未来」〜Yahoo! JAPANの事例を通して〜#a11yfutureYahoo!デベロッパーネットワーク
 
ユーザー企業内製CSIRTにおける対応のポイント
ユーザー企業内製CSIRTにおける対応のポイントユーザー企業内製CSIRTにおける対応のポイント
ユーザー企業内製CSIRTにおける対応のポイントRecruit Technologies
 
What i learned from translation of the sre ryuji tamagawa
What i learned from translation of the sre ryuji tamagawaWhat i learned from translation of the sre ryuji tamagawa
What i learned from translation of the sre ryuji tamagawaRakuten Group, Inc.
 
Rakutenとsreと私 yanagimoto koichi
Rakutenとsreと私 yanagimoto koichiRakutenとsreと私 yanagimoto koichi
Rakutenとsreと私 yanagimoto koichiRakuten Group, Inc.
 
Kafka Connect(Japanese)
Kafka Connect(Japanese)Kafka Connect(Japanese)
Kafka Connect(Japanese)Roman Shtykh
 
ビックデータ処理技術の全体像とリクルートでの使い分け
ビックデータ処理技術の全体像とリクルートでの使い分けビックデータ処理技術の全体像とリクルートでの使い分け
ビックデータ処理技術の全体像とリクルートでの使い分けTetsutaro Watanabe
 
Struggling with BIGDATA -リクルートおけるデータサイエンス/エンジニアリング-
Struggling with BIGDATA -リクルートおけるデータサイエンス/エンジニアリング-Struggling with BIGDATA -リクルートおけるデータサイエンス/エンジニアリング-
Struggling with BIGDATA -リクルートおけるデータサイエンス/エンジニアリング-Recruit Technologies
 
新卒2年目が鍛えられたコードレビュー道場
新卒2年目が鍛えられたコードレビュー道場新卒2年目が鍛えられたコードレビュー道場
新卒2年目が鍛えられたコードレビュー道場Recruit Technologies
 
Company Recommendation for New Graduates via Implicit Feedback Multiple Matri...
Company Recommendation for New Graduates via Implicit Feedback Multiple Matri...Company Recommendation for New Graduates via Implicit Feedback Multiple Matri...
Company Recommendation for New Graduates via Implicit Feedback Multiple Matri...Recruit Technologies
 

Viewers also liked (20)

Process Scheduling on Hadoop at Expedia
Process Scheduling on Hadoop at ExpediaProcess Scheduling on Hadoop at Expedia
Process Scheduling on Hadoop at Expedia
 
BIG Data & Hadoop Applications in Social Media
BIG Data & Hadoop Applications in Social MediaBIG Data & Hadoop Applications in Social Media
BIG Data & Hadoop Applications in Social Media
 
Pinterest hadoop summit_talk
Pinterest hadoop summit_talkPinterest hadoop summit_talk
Pinterest hadoop summit_talk
 
August 2016 HUG: Recent development in Apache Oozie
August 2016 HUG: Recent development in Apache OozieAugust 2016 HUG: Recent development in Apache Oozie
August 2016 HUG: Recent development in Apache Oozie
 
August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...
August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...
August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...
 
August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector
August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector
August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector
 
IT業界のリーディングカンパニーとして描く「少し先の未来」〜Yahoo! JAPANの事例を通して〜#a11yfuture
IT業界のリーディングカンパニーとして描く「少し先の未来」〜Yahoo! JAPANの事例を通して〜#a11yfutureIT業界のリーディングカンパニーとして描く「少し先の未来」〜Yahoo! JAPANの事例を通して〜#a11yfuture
IT業界のリーディングカンパニーとして描く「少し先の未来」〜Yahoo! JAPANの事例を通して〜#a11yfuture
 
ユーザー企業内製CSIRTにおける対応のポイント
ユーザー企業内製CSIRTにおける対応のポイントユーザー企業内製CSIRTにおける対応のポイント
ユーザー企業内製CSIRTにおける対応のポイント
 
What i learned from translation of the sre ryuji tamagawa
What i learned from translation of the sre ryuji tamagawaWhat i learned from translation of the sre ryuji tamagawa
What i learned from translation of the sre ryuji tamagawa
 
Rakutenとsreと私 yanagimoto koichi
Rakutenとsreと私 yanagimoto koichiRakutenとsreと私 yanagimoto koichi
Rakutenとsreと私 yanagimoto koichi
 
Yahoo! JAPANのデータ基盤とHadoop #dbts2016
Yahoo! JAPANのデータ基盤とHadoop #dbts2016Yahoo! JAPANのデータ基盤とHadoop #dbts2016
Yahoo! JAPANのデータ基盤とHadoop #dbts2016
 
Yahoo! JAPANを支えるビッグデータプラットフォーム技術
Yahoo! JAPANを支えるビッグデータプラットフォーム技術Yahoo! JAPANを支えるビッグデータプラットフォーム技術
Yahoo! JAPANを支えるビッグデータプラットフォーム技術
 
Prestoクエリログの保存/分析機能の構築 #yjdsnight
Prestoクエリログの保存/分析機能の構築 #yjdsnightPrestoクエリログの保存/分析機能の構築 #yjdsnight
Prestoクエリログの保存/分析機能の構築 #yjdsnight
 
Yahoo! JAPANにおけるオンライン機械学習実例 #streamctjp
Yahoo! JAPANにおけるオンライン機械学習実例 #streamctjpYahoo! JAPANにおけるオンライン機械学習実例 #streamctjp
Yahoo! JAPANにおけるオンライン機械学習実例 #streamctjp
 
Kafka Connect(Japanese)
Kafka Connect(Japanese)Kafka Connect(Japanese)
Kafka Connect(Japanese)
 
ビックデータ処理技術の全体像とリクルートでの使い分け
ビックデータ処理技術の全体像とリクルートでの使い分けビックデータ処理技術の全体像とリクルートでの使い分け
ビックデータ処理技術の全体像とリクルートでの使い分け
 
Apache Big Data Miami 2017 - Hadoop Source Code Reading #23 #hadoopreading
Apache Big Data Miami 2017 - Hadoop Source Code Reading #23 #hadoopreadingApache Big Data Miami 2017 - Hadoop Source Code Reading #23 #hadoopreading
Apache Big Data Miami 2017 - Hadoop Source Code Reading #23 #hadoopreading
 
Struggling with BIGDATA -リクルートおけるデータサイエンス/エンジニアリング-
Struggling with BIGDATA -リクルートおけるデータサイエンス/エンジニアリング-Struggling with BIGDATA -リクルートおけるデータサイエンス/エンジニアリング-
Struggling with BIGDATA -リクルートおけるデータサイエンス/エンジニアリング-
 
新卒2年目が鍛えられたコードレビュー道場
新卒2年目が鍛えられたコードレビュー道場新卒2年目が鍛えられたコードレビュー道場
新卒2年目が鍛えられたコードレビュー道場
 
Company Recommendation for New Graduates via Implicit Feedback Multiple Matri...
Company Recommendation for New Graduates via Implicit Feedback Multiple Matri...Company Recommendation for New Graduates via Implicit Feedback Multiple Matri...
Company Recommendation for New Graduates via Implicit Feedback Multiple Matri...
 

Similar to Hadoop Platform at Yahoo: A Year in Review

Hadoop Summit Dublin 2016: Hadoop Platform at Yahoo - A Year in Review
Hadoop Summit Dublin 2016: Hadoop Platform at Yahoo - A Year in Review Hadoop Summit Dublin 2016: Hadoop Platform at Yahoo - A Year in Review
Hadoop Summit Dublin 2016: Hadoop Platform at Yahoo - A Year in Review Sumeet Singh
 
sudoers: Benchmarking Hadoop with ALOJA
sudoers: Benchmarking Hadoop with ALOJAsudoers: Benchmarking Hadoop with ALOJA
sudoers: Benchmarking Hadoop with ALOJANicolas Poggi
 
What it takes to run Hadoop at Scale: Yahoo! Perspectives
What it takes to run Hadoop at Scale: Yahoo! PerspectivesWhat it takes to run Hadoop at Scale: Yahoo! Perspectives
What it takes to run Hadoop at Scale: Yahoo! PerspectivesDataWorks Summit
 
Hadoop and big data training
Hadoop and big data trainingHadoop and big data training
Hadoop and big data trainingagiamas
 
Hadoop - Past, Present and Future - v1.1
Hadoop - Past, Present and Future - v1.1Hadoop - Past, Present and Future - v1.1
Hadoop - Past, Present and Future - v1.1Big Data Joe™ Rossi
 
Tajo_Meetup_20141120
Tajo_Meetup_20141120Tajo_Meetup_20141120
Tajo_Meetup_20141120Hyoungjun Kim
 
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...Sumeet Singh
 
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...Debraj GuhaThakurta
 
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...Debraj GuhaThakurta
 
Experience sql server on l inux and docker
Experience sql server on l inux and dockerExperience sql server on l inux and docker
Experience sql server on l inux and dockerBob Ward
 
Fórum E-Commerce Brasil | Tecnologias NVIDIA aplicadas ao e-commerce. Muito a...
Fórum E-Commerce Brasil | Tecnologias NVIDIA aplicadas ao e-commerce. Muito a...Fórum E-Commerce Brasil | Tecnologias NVIDIA aplicadas ao e-commerce. Muito a...
Fórum E-Commerce Brasil | Tecnologias NVIDIA aplicadas ao e-commerce. Muito a...E-Commerce Brasil
 
Introduction to PowerAI - The Enterprise AI Platform
Introduction to PowerAI - The Enterprise AI PlatformIntroduction to PowerAI - The Enterprise AI Platform
Introduction to PowerAI - The Enterprise AI PlatformIndrajit Poddar
 
Eric Baldeschwieler Keynote from Storage Developers Conference
Eric Baldeschwieler Keynote from Storage Developers ConferenceEric Baldeschwieler Keynote from Storage Developers Conference
Eric Baldeschwieler Keynote from Storage Developers ConferenceHortonworks
 
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...Cloudera, Inc.
 
Overview of big data & hadoop v1
Overview of big data & hadoop   v1Overview of big data & hadoop   v1
Overview of big data & hadoop v1Thanh Nguyen
 
xPatterns ... beyond Hadoop (Spark, Shark, Mesos, Tachyon)
xPatterns ... beyond Hadoop (Spark, Shark, Mesos, Tachyon)xPatterns ... beyond Hadoop (Spark, Shark, Mesos, Tachyon)
xPatterns ... beyond Hadoop (Spark, Shark, Mesos, Tachyon)Claudiu Barbura
 
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...Jürgen Ambrosi
 
Backend.AI Technical Introduction (19.09 / 2019 Autumn)
Backend.AI Technical Introduction (19.09 / 2019 Autumn)Backend.AI Technical Introduction (19.09 / 2019 Autumn)
Backend.AI Technical Introduction (19.09 / 2019 Autumn)Lablup Inc.
 

Similar to Hadoop Platform at Yahoo: A Year in Review (20)

Hadoop Summit Dublin 2016: Hadoop Platform at Yahoo - A Year in Review
Hadoop Summit Dublin 2016: Hadoop Platform at Yahoo - A Year in Review Hadoop Summit Dublin 2016: Hadoop Platform at Yahoo - A Year in Review
Hadoop Summit Dublin 2016: Hadoop Platform at Yahoo - A Year in Review
 
sudoers: Benchmarking Hadoop with ALOJA
sudoers: Benchmarking Hadoop with ALOJAsudoers: Benchmarking Hadoop with ALOJA
sudoers: Benchmarking Hadoop with ALOJA
 
What it takes to run Hadoop at Scale: Yahoo! Perspectives
What it takes to run Hadoop at Scale: Yahoo! PerspectivesWhat it takes to run Hadoop at Scale: Yahoo! Perspectives
What it takes to run Hadoop at Scale: Yahoo! Perspectives
 
Hadoop and big data training
Hadoop and big data trainingHadoop and big data training
Hadoop and big data training
 
Hadoop - Past, Present and Future - v1.1
Hadoop - Past, Present and Future - v1.1Hadoop - Past, Present and Future - v1.1
Hadoop - Past, Present and Future - v1.1
 
Tajo_Meetup_20141120
Tajo_Meetup_20141120Tajo_Meetup_20141120
Tajo_Meetup_20141120
 
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
 
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
 
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
 
Experience sql server on l inux and docker
Experience sql server on l inux and dockerExperience sql server on l inux and docker
Experience sql server on l inux and docker
 
Hortonworks.bdb
Hortonworks.bdbHortonworks.bdb
Hortonworks.bdb
 
Fórum E-Commerce Brasil | Tecnologias NVIDIA aplicadas ao e-commerce. Muito a...
Fórum E-Commerce Brasil | Tecnologias NVIDIA aplicadas ao e-commerce. Muito a...Fórum E-Commerce Brasil | Tecnologias NVIDIA aplicadas ao e-commerce. Muito a...
Fórum E-Commerce Brasil | Tecnologias NVIDIA aplicadas ao e-commerce. Muito a...
 
Introduction to PowerAI - The Enterprise AI Platform
Introduction to PowerAI - The Enterprise AI PlatformIntroduction to PowerAI - The Enterprise AI Platform
Introduction to PowerAI - The Enterprise AI Platform
 
Eric Baldeschwieler Keynote from Storage Developers Conference
Eric Baldeschwieler Keynote from Storage Developers ConferenceEric Baldeschwieler Keynote from Storage Developers Conference
Eric Baldeschwieler Keynote from Storage Developers Conference
 
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...
 
Overview of big data & hadoop v1
Overview of big data & hadoop   v1Overview of big data & hadoop   v1
Overview of big data & hadoop v1
 
xPatterns ... beyond Hadoop (Spark, Shark, Mesos, Tachyon)
xPatterns ... beyond Hadoop (Spark, Shark, Mesos, Tachyon)xPatterns ... beyond Hadoop (Spark, Shark, Mesos, Tachyon)
xPatterns ... beyond Hadoop (Spark, Shark, Mesos, Tachyon)
 
Huhadoop - v1.1
Huhadoop - v1.1Huhadoop - v1.1
Huhadoop - v1.1
 
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
 
Backend.AI Technical Introduction (19.09 / 2019 Autumn)
Backend.AI Technical Introduction (19.09 / 2019 Autumn)Backend.AI Technical Introduction (19.09 / 2019 Autumn)
Backend.AI Technical Introduction (19.09 / 2019 Autumn)
 

More from DataWorks Summit/Hadoop Summit

Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerUnleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerDataWorks Summit/Hadoop Summit
 
Enabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformEnabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformDataWorks Summit/Hadoop Summit
 
Double Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDouble Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDataWorks Summit/Hadoop Summit
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...DataWorks Summit/Hadoop Summit
 
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...DataWorks Summit/Hadoop Summit
 
Mool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLMool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLDataWorks Summit/Hadoop Summit
 
The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)DataWorks Summit/Hadoop Summit
 
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...DataWorks Summit/Hadoop Summit
 

More from DataWorks Summit/Hadoop Summit (20)

Running Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in ProductionRunning Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in Production
 
State of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache ZeppelinState of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache Zeppelin
 
Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerUnleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache Ranger
 
Enabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformEnabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science Platform
 
Revolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and ZeppelinRevolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and Zeppelin
 
Double Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDouble Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSense
 
Hadoop Crash Course
Hadoop Crash CourseHadoop Crash Course
Hadoop Crash Course
 
Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Apache Spark Crash Course
Apache Spark Crash CourseApache Spark Crash Course
Apache Spark Crash Course
 
Dataflow with Apache NiFi
Dataflow with Apache NiFiDataflow with Apache NiFi
Dataflow with Apache NiFi
 
Schema Registry - Set you Data Free
Schema Registry - Set you Data FreeSchema Registry - Set you Data Free
Schema Registry - Set you Data Free
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
 
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
 
Mool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLMool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and ML
 
How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient
 
HBase in Practice
HBase in Practice HBase in Practice
HBase in Practice
 
The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)
 
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS HadoopBreaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
 
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
 

Recently uploaded

Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 

Recently uploaded (20)

Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & Application
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 

Hadoop Platform at Yahoo: A Year in Review

  • 1. HADOOP PLATFORM AT YAHOO A YEAR IN REVIEW SUMEET SINGH (@sumeetksingh) Sr. Director, Cloud and Big Data Platforms
  • 2. Agenda 2 Platform Overview1 Infrastructure and Metrics2 CaffeOnSpark for Distributed DL3 Compute and Sketches4 Oozie6 Ease of Use7 Q&A8 HBase and Omid5
  • 3. 0 100 200 300 400 500 600 700 800 0 5,000 10,000 15,000 20,000 25,000 30,000 35,000 40,000 45,000 50,000 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 RawHDFS(inPB) #Servers Year Servers Storage Yahoo! Commits to Scaling Hadoop for Production Use Research Workloads in Search and Advertising Production (Modeling) with machine learning & WebMap Revenue Systems with Security, Multi- tenancy, and SLAs Open Sourced with Apache Hortonworks Spinoff for Enterprise hardening Nextgen Hadoop (H 0.23 YARN) New Services (HBase, Storm, Spark, Hive) Increased User-base with partitioned namespaces Apache H2.7 (Scalable ML, Latency, Utilization, Productivity) Platform Evolution 3
  • 4. Deployment Models Private (dedicated) Clusters Hosted Multi-tenant (private cloud) Clusters Hosted Compute Clusters  Large demanding use cases  New technology not yet platformized  Data movement and regulation issues  When more cost effective than on- premise  Time to market/ results matter  Data already in public cloud  Source of truth for all of orgs data  App delivery agility  Operational efficiency and cost savings through economies of scale On-Premise Public Cloud Purpose-built Big Data Clusters  For performance, tighter integration with tech stack  Value added services such as monitoring, alerts, tuning and common tools 4
  • 5. Platform Today ZK DBMS MON SSHOP LOG WH TOOLS Apache / Open Source Projects Yahoo Projects HDFS HBase HCat Kafka CMS DH Pig Hive Oozie Hue GDM Big ML YARN CS MR Tez Spark Storm Services Compute Storage / Msg. Tools 5
  • 6. Technology Stack Assembly ZK DBMS MON SSHOP LOG WH TOOLS Apache Projects Yahoo Projects HDFS HBase HCat Kafka CMS DH Pig Hive Oozie Hue GDM Big ML YARN CS MR Tez Spark Storm Services Compute Storage / Msg. Tools HDFS (File System) YARN (Scheduling, Resource Management) Common RHEL6 64-bit, JDK8 Platformized Tech with Production Support In- progress, Unmet needs or Apache Alignment 6
  • 7. Common Backplane DataNode NodeManager NameNode RM DataNodes RegionServers NameNode HBase Master Nimbus Supervisor Administration, Management and Monitoring ZooKeeper Pools HTTP/HDFS/GDM Load Proxies Applications and Data Data Feeds Data Stores Oozie Server HS2/ HCat Network Backplane 7
  • 8. 0 10 20 30 Cluster 1 (2,000 servers) HDFS 12 PB Compute 23 TB Avg. Util: 26% Research Cluster Consolidation 0 20 40 60 80 ComputeTotalandUsed(TB) Cluster 3 (5,400 servers) HDFS 36 PB Compute 70 TB Avg. Util: 59% Cluster 2 (3,100 servers) HDFS 21 PB Compute 52 TB Avg. Util: 40% 0 20 40 60 One Month Sample (2015) Total Used 8
  • 9. 0 50 100 150 200 250 300 Consolidated Cluster HDFS 65 PB Compute 240 TB Avg. Util: 70% Consolidated Research Cluster Characteristics One Month Sample (2016) 40% decrease in TCO 10,500 servers 2,200 servers Before After 65% increase in compute capacity 50% increase in avg. utilization Total Used ComputeTotalandUsed(TB) 9
  • 10. Common Hadoop Cluster Configuration Rack 1 Network Backplane CPU Servers with JBODs & 10GbE Rack 2 Rack N . . . . . . . . . 10
  • 11. New Hadoop Cluster Configuration Rack 1 Network Backplane CPU Servers with JBODs & 10GbE Rack 2 Rack N 100Gbps InfiniBand GPU Servers Hi-Mem Servers . . . 11
  • 12. YARN Node Labels J2J3 J4 Queue 1, 40% Label x Queue 2, 40% Label x, y J1 Queue 3, 20% x x x x x x x x x x x x y y y y y y y y y y y y yarn.scheduler.capacity.root.<queue name>.accessible-node-labels = <label name> yarn.scheduler.capacity.root.<label name>.default-node-label-expression sets the default label asked for by queue Hadoop Cluster 12
  • 13. Agenda Platform Overview1 Infrastructure and Metrics2 CaffeOnSpark for Distributed DL3 Compute and Sketches4 Oozie6 Ease of Use7 Q&A8 HBase and Omid5 13
  • 14. CaffeOnSpark – Distributed Deep Learning CaffeOnSpark for DL MLLib for non-DL Hive or SparkSQL Spark YARN (RM and Scheduling) HDFS (Datasets) . . . 14
  • 15. Few Use Cases – Yahoo Weather 15
  • 16. Few Use Cases – Flickr Facial Recognition 16
  • 17. Few Use Cases – Flickr Scene Detection 17
  • 18. CaffeOnSpark Architecture – Common Cluster Spark Driver Caffe (enhanced with multi-GPU/CPU) Model Synchronizer (across nodes) HDFS Datasets Spark Executor (for data feeding and control) Caffe (enhanced with multi-GPU/CPU) Model Synchronizer (across nodes) HDFS Datasets Spark Executor (for data feeding and control) Caffe (enhanced with multi-GPU/CPU) Model Synchronizer (across nodes) HDFS Datasets Spark Executor (for data feeding and control) Model O/P on HDFS MPI on RDMA / TCP 18
  • 19. CaffeOnSpark Architecture – Incremental Learning cos = new CaffeOnSpark(ctx) conf = new Config(ctx, args).init() dl_train_source = DataSource.getSource(conf, true) cos.train(dl_train_source) //training DL model lr_raw_source = DataSource.getSource(conf, false) ext_df = cos.features(lr_raw_source) // extract features via DL Feature Engineering: DeepLearning 19
  • 20. CaffeOnSpark Architecture – Incremental Learning cos = new CaffeOnSpark(ctx) conf = new Config(ctx, args).init() dl_train_source = DataSource.getSource(conf, true) cos.train(dl_train_source) //training DL model lr_raw_source = DataSource.getSource(conf, false) ext_df = cos.features(lr_raw_source) // extract features via DL vlr_input=ext_df.withColumn(“L",cos.floats2doubleUDF(ext_df(conf.label)) ) .withColumn(“F",cos.floats2doublesUDF(ext_df(conf.features(0)))) lr = new LogisticRegression().setLabelCol(”L").setFeaturesCol(”F") lr_model = lr.fit(lr_input_df) … Feature Engineering: DeepLearning 20 TrainClassifiers: Non-deep Learning
  • 21. CaffeOnSpark Architecture – Single Command spark-submit --num-executors #Exes --class CaffeOnSpark my-caffe-on-spark.jar -devices #GPUs -model dl_model_file -output lr_model_file 21
  • 22. Distributed Deep Learning Apache License Existing Clusters Powerful DL Platform Fully Distributed High-level API Incremental Learning CaffeOnSpark github.com/yahoo/caffeonspark 22
  • 23. Agenda Platform Overview1 Infrastructure and Metrics2 CaffeOnSpark for Distributed DL3 Compute and Sketches4 Oozie6 Ease of Use7 Q&A8 HBase and Omid5 23
  • 24. Hadoop Compute Sources HDFS (File System and Storage) Pig (Scripting) Hive (SQL) Java MR APIs YARN (Resource Management and Scheduling) Tez (Execution Engine for Pig and Hive) Spark (Alternate Exec Engine) MapReduce (Legacy) Data Processing ML Custom App on Slider Oozie Data Management 24
  • 26. 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Pushing Batch Compute Boundaries%ofTotalCompute(memory-sec) Q1 2016 MapReduce Tez Spark 112 Million Batch Jobs in Q1’16 Jan 78% Mar 67% Mar 21% 12%Jan 8% 14% 26
  • 28. Recent Apache Storm Developments at Yahoo MT & RA Scheduler Dist. Cache API 8 x Throughput Improved Debuggability 1 github.com/yahoo/streaming-benchmarks Pacemaker Server Streaming Benchmark 1 28
  • 29. Data Sketches Algorithms Data Sketches Algorithms Library datasketches.github.io  Good enough approximate answers for problem queries  Streamable  Approximate with predictable error  Sub-linear in size  Mergeable / additive  Highly parallelizable  Maven deployable Characteristics 29
  • 30. Distinct Count Sketch, High-level View Big Data Stream Transform Data Structure Estimator Result + / - ε White Noise Basic Sketch Elements 30
  • 31. Data Sketches Algorithms Data Sketches Algorithms Library datasketches.github.io 31
  • 32. Agenda Platform Overview1 Infrastructure and Metrics2 CaffeOnSpark for Distributed DL3 Compute and Sketches4 Oozie6 Ease of Use7 Q&A8 HBase and Omid5 32
  • 33. Apache HBase at Yahoo  Security  Isolated Deployment  Multi-tenant  Region Server Group  Namespace  Unsupported Features HBase Client HBase Client JobTracker Namenode TaskTracker DataNode Namenode RegionServer DataNode RegionServer DataNode RegionServer DataNode HBase Master Zookeeper Quorum HBase Client MR Client M/R Task TaskTracker DataNode M/R Task TaskTracker DataNode MR Task Compute Cluster HBase Cluster Gateway/Launcher Rest Proxy HTTP Client 33
  • 34. Security  Authentication  Kerberos (users, processes)  Delegation Token (MapReduce, YARN, etc.)  Authorization  HBase ACLs (Read, Write, Create, Admin)  Grant permissions to User or Unix Group  ACL for Table, Column Family or Column 34
  • 35. Region Server Groups  Dedicated region servers for a set of tables  Resource Isolation (CPU, Memory, IO, etc) RegionServer Group Foo RegionServer RegionServer RegionServer Region Server 1...5 TableA TableB TableC TableD TableE TableF RegionServer Group Bar RegionServer RegionServer RegionServer Region Server 6…10 Table1 Table2 Table3 Table4 Table5 Table6 35
  • 36. Namespaces  Analogous to “Database”  Namespace ACL to create tables  Default group  Quota  Tables  Regions Namespace Group Tables Quota ACL 36
  • 37. Split Meta to Spread Load and Avoid Large Regions 37
  • 38. Favored Nodes for HDFS Locality 38
  • 40. Scaling HBase to Handle Millions of Regions on a Cluster Region Server Groups Split Meta Split ZK Favored Nodes Humongous Tables 40
  • 41. Transactions on HBase with Omid1 Highly performant and fault tolerant ACID transactional framework New Apache Incubator project incubator.apache.org/projects/omid.html Handles million of transactions per day for search and personalization products 1 Omid stands for “Hope” in Persian 41
  • 44. Agenda Platform Overview1 Infrastructure and Metrics2 CaffeOnSpark for Distributed DL3 Compute and Sketches4 Oozie6 Ease of Use7 Q&A8 HBase and Omid5 44
  • 45. Oozie Data Pipelines Oozie Message Bus HCatalog 3. Push notification <New Partition> 2. Register Topic 4. Notify New Partition Data Producer HDFS Produce data (distcp, pig, M/R..) /data/click/2014/06/02 1. Query/Poll Partition Start workflow Update metadata (ALTER TABLE click ADD PARTITION(data=‘2014/06/02’) location ’hdfs://data/click/2014/06/02’) 45
  • 46. Large Scale Data Pipeline Requirements Administrative  One should be able to start, stop and pause all related pipelines at a same time Dependency Management  Output of a coordinator “n+1” action is dependent on coordinator “n” action (dataset dependency)  If dataset has a BCP instance, workflow should run with either, whichever arrives first  Start as soon as mandatory data is available, other feeds are optional  Data is not guaranteed, start processing even if partial data is available SLA Management  Monitor pipeline processing to take immediate action in case of failures or SLA misses  Pipelines owners should get notified if an SLA is missed Multiple Providers  If data is available from multiple providers, I want to specify the provider priority  Combine datasets from multiple providers to fill the gaps in data a single provider may have 46
  • 47. Large Scale Data Pipeline Requirements Administrative  One should be able to start, stop and pause all related pipelines at a same time Dependency Management  Output of a coordinator “n+1” action is dependent on coordinator “n” action (dataset dependency)  If dataset has a BCP instance, workflow should run with either, whichever arrives first  Start as soon as mandatory data is available, other feeds are optional  Data is not guaranteed, start processing even if partial data is available SLA Management  Monitor pipeline processing to take immediate action in case of failures or SLA misses  Pipelines owners should get notified if an SLA is missed Multiple Providers  If data is available from multiple providers, I want to specify the provider priority  Combine datasets from multiple providers to fill the gaps in data a single provider may have 47
  • 48. BCP And Mandatory / Optional Feeds Pull data from A or B. Specify dataset as AorB. Action will start running as soon either dataset A or B is available. <input-logic> <or name=“AorB”> <data-in dataset="A” wait=“10”/> <data-in dataset="B"/> </or> </input-logic> Dataset B is optional, Oozie will start processing as soon as A is available. It will include dataset from A and whatever is available from B. <input-logic> <and name="optional <data-in dataset="A"/> <data-in dataset="B" min=”0”/> </and> </input-logic> 48
  • 49. Data Not Guaranteed / Priority Among Dataset Instances A will have higher precedence over B and B will have higher precedence over C. <input-logic> <or name="AorBorC"> <data-in dataset="A"/> <data-in dataset="B"/> <data-in dataset="C”/> </or> </input-logic> 49 Oozie will start processing if available A instances are >= 10. Min can also be combined with wait (as shown for dataset B). <input-logic> <data-in dataset="A" min=”10”/> <data-in dataset=“B” min =“10” wait=“20”/> </input-logic>
  • 50. Combining Dataset From Multiple Providers Combine function will first check instances from A and go to B next for whatever is missing in A. <data-in name="A" dataset="dataset_A"> <start-instance> ${coord:CURRENT(-5)} </start-instance> <end-instance> ${coord:latest(-1)} </end-instance> </data-in> <data-in name="B" dataset="dataset_B"> <start-instance>${coord:CURRENT(-5)}</start-instance> <end-instance>${coord:CURRENT(-1)}</end-instance> </data-in> <input-logic> <combine name="AB"> <data-in dataset="A"/> <data-in dataset="B"/> </combine> </input-logic> 50
  • 51. Agenda Platform Overview1 Infrastructure and Metrics2 CaffeOnSpark for Distributed DL3 Compute and Sketches4 Oozie6 Ease of Use7 Q&A8 HBase and Omid5 51
  • 52. Automated Onboarding / Collaboration Portal 52
  • 53. Built for Tenant Transparency 53
  • 55. Data Discovery and Access 55
  • 56. Audits, Compliance, and Efficiency Starling FS, Job, Task logs Cluster 1 Cluster 2 Cluster n... CF, Region, Action, Query Stats Cluster 1 Cluster 2 Cluster n... DB, Tbl., Part., Colmn. Access Stats ...MS 1 MS 2 MS n GDM Data Defn., Flow, Feed, Source F 1 F 2 F n Log Warehouse Log Sources 56
  • 57. Audits, Compliance, and Efficiency (cont’d) Data Discovery and Access Public Non-sensitive Financial $ Governance Classification No addn. reqmt. LMS Integration Stock Admin Integration Approval Flow Restricted 57
  • 58. Hosted UI – Hue as a Service WSGI Hue-1.Cluster-1 (Hot) VIPUsers HS2 Hue MySQL DB (HA) Hadoop Cluster HCat Meta Oozie Server YARN RM Web HDFS NMs WSGI Hue-2.Cluster-1 (hot) HS2 IdP SAML Auth. Serving pages and static content Cookies, saved queries, workflows etc. FullStackHA REST / Thrift (jQuery, Bootstrap, Knockout.js, Love) 58
  • 60. Increased Intelligence GBDT FTRL SGD Deep Learning Random Forests ML Libraries Click Prediction Search RankingKeyword Auctions Ad Relevance Abuse Detection Applications Proven to Work at Scale Solve Complex Problems YARN (Resource Manager) Heterogeneous Scheduling Long-running Services GPUs Large Memory Support Core Grid Enhancements … Parameter ServerGlobally Shared Parameters Compute Engines Distributed Processing … 60
  • 61. Greater Speed DeData Management Ease of Use Productivity Dimensions Real-time Pipelines Unified Metadata & Lineage Fine-grained Access Control Self-serve Data Movement SLA & Cost Transparency Intuitive UIs Planning & Collab. Tools Central Grid Portal Improvements Query times < 1 sec 4x Speedups in ETL SQL on HBase Limitless BI Clients Analytics, BI & Reporting 61
  • 62. Higher Efficiency Achieve five 9’s availability and 70% average compute utilization across clusters 62
  • 63. Hadoop Users at Yahoo Slingstone & Aviate Mail Anti-Spam Gemini Campaign Mgmt. Search Assist Audience Analytics Flickr YAM+ & Targeting Membership Abuse … and many more. 63
  • 64. Yahoo at the Apache Open Source Foundation 10 Committers (6 PMC) 3 Committers (3 PMC) 3 Committers (2 PMC) 6 Committer (5 PMC) 1 Committer 3 Committers (2 PMCs) 7 Committers (6 PMCs) 1 2 43 5 6 7 8 1 Committer 64
  • 65. Join Us @ yahoohadoop.tumblr.com 65
  • 66. THANK YOU SUMEET SINGH (@sumeetksingh) Sr. Director, Cloud and Big Data Platforms Icon Courtesy – iconfinder.com (under Creative Commons)

Editor's Notes

  1. JIRA 1976 (Oozie 4.3)
  2. While $coord:latest allows skipping to available ones, the workflow will never trigger unless mentioned number of instances are found. Min can be also combined with wait. If all dependencies are not met and if we have met MIN dependencies and then Oozie keeps on waiting for more instance till wait time elapses or all data dependencies are met.
  3. (30 secs) T: 2 min 30 secs xyz
  4. Protocols REST – Use pyhton-requests and a custom client to streamline RESTful interface calls Thrift – Custom connection pooling and socket multiplexing to streamline thrift calls Accessibility Middleware – Make Hadoop interfaces accessible in request objects Hue uses CherryPy web server. You can use the following options to change the IP address and port that the web server listens on. The default setting is port 8888 on all configured IP addresses. If you don’t specify a secret key, your session cookies will not be secure. Hue will run but it will also display error messages telling you to set the secret key. You can configure Hue to serve over HTTPS. To do so, you must install "pyOpenSSL" within Hue’s context and configure your keys.
  5. Protocols REST – Use pyhton-requests and a custom client to streamline RESTful interface calls Thrift – Custom connection pooling and socket multiplexing to streamline thrift calls Accessibility Middleware – Make Hadoop interfaces accessible in request objects Hue uses CherryPy web server. You can use the following options to change the IP address and port that the web server listens on. The default setting is port 8888 on all configured IP addresses. If you don’t specify a secret key, your session cookies will not be secure. Hue will run but it will also display error messages telling you to set the secret key. You can configure Hue to serve over HTTPS. To do so, you must install "pyOpenSSL" within Hue’s context and configure your keys.