SlideShare ist ein Scribd-Unternehmen logo
1 von 65
HADOOP PLATFORM
AT YAHOO
A YEAR IN REVIEW
SUMEET SINGH (@sumeetksingh)
Sr. Director, Cloud and Big Data Platforms
Agenda
2
Platform Overview1
Infrastructure and Metrics2
CaffeOnSpark for Distributed DL3
Compute and Sketches4
Oozie6
Ease of Use7
Q&A8
HBase and Omid5
0
100
200
300
400
500
600
700
800
0
5,000
10,000
15,000
20,000
25,000
30,000
35,000
40,000
45,000
50,000
2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016
RawHDFS(inPB)
#Servers
Year
Servers Storage
Yahoo!
Commits to
Scaling
Hadoop for
Production
Use
Research
Workloads
in Search
and
Advertising
Production
(Modeling)
with machine
learning &
WebMap
Revenue
Systems
with Security,
Multi-
tenancy, and
SLAs
Open
Sourced with
Apache
Hortonworks
Spinoff for
Enterprise
hardening
Nextgen
Hadoop
(H 0.23 YARN)
New Services
(HBase,
Storm, Spark,
Hive)
Increased
User-base
with
partitioned
namespaces
Apache H2.7
(Scalable ML, Latency,
Utilization, Productivity)
Platform Evolution
3
Deployment Models
Private (dedicated)
Clusters
Hosted Multi-tenant
(private cloud)
Clusters
Hosted Compute
Clusters
 Large demanding use
cases
 New technology not
yet platformized
 Data movement and
regulation issues
 When more cost
effective than on-
premise
 Time to market/
results matter
 Data already in
public cloud
 Source of truth for all
of orgs data
 App delivery agility
 Operational efficiency
and cost savings
through economies of
scale
On-Premise Public Cloud
Purpose-built
Big Data
Clusters
 For performance,
tighter integration
with tech stack
 Value added services
such as monitoring,
alerts, tuning and
common tools
4
Platform Today
ZK DBMS MON SSHOP
LOG
WH
TOOLS
Apache / Open Source Projects Yahoo Projects
HDFS HBase HCat Kafka CMS DH
Pig Hive Oozie Hue GDM Big ML
YARN CS MR Tez Spark Storm
Services
Compute
Storage / Msg.
Tools
5
Technology Stack Assembly
ZK DBMS MON SSHOP LOG WH TOOLS
Apache Projects Yahoo Projects
HDFS HBase HCat Kafka CMS DH
Pig Hive Oozie Hue GDM Big ML
YARN CS MR Tez Spark Storm
Services
Compute
Storage / Msg.
Tools
HDFS
(File System)
YARN
(Scheduling, Resource Management)
Common
RHEL6 64-bit, JDK8
Platformized
Tech with
Production
Support
In-
progress,
Unmet
needs or
Apache
Alignment
6
Common Backplane
DataNode NodeManager
NameNode RM
DataNodes RegionServers
NameNode HBase Master Nimbus
Supervisor
Administration, Management and Monitoring
ZooKeeper
Pools
HTTP/HDFS/GDM
Load Proxies
Applications and Data
Data
Feeds
Data
Stores
Oozie
Server
HS2/
HCat
Network
Backplane
7
0
10
20
30
Cluster 1 (2,000 servers)
HDFS 12 PB
Compute 23 TB
Avg. Util: 26%
Research Cluster Consolidation
0
20
40
60
80
ComputeTotalandUsed(TB)
Cluster 3 (5,400 servers)
HDFS 36 PB
Compute 70 TB
Avg. Util: 59%
Cluster 2 (3,100 servers)
HDFS 21 PB
Compute 52 TB
Avg. Util: 40%
0
20
40
60
One Month Sample (2015)
Total Used
8
0
50
100
150
200
250
300
Consolidated Cluster
HDFS 65 PB
Compute 240 TB
Avg. Util: 70%
Consolidated Research Cluster Characteristics
One Month Sample (2016)
40% decrease in TCO
10,500
servers
2,200
servers
Before After
65% increase in compute capacity
50% increase in avg. utilization
Total Used
ComputeTotalandUsed(TB)
9
Common Hadoop Cluster Configuration
Rack 1
Network Backplane
CPU Servers
with JBODs
& 10GbE
Rack 2 Rack N
.
.
.
.
.
.
.
.
.
10
New Hadoop Cluster Configuration
Rack 1
Network Backplane
CPU Servers
with JBODs
& 10GbE
Rack 2 Rack N
100Gbps
InfiniBand
GPU Servers
Hi-Mem Servers
.
.
.
11
YARN Node Labels
J2J3
J4
Queue 1, 40%
Label x
Queue 2, 40%
Label x, y
J1
Queue 3, 20%
x x x x x x
x x x x x x
y y y y y y
y y y y y y
yarn.scheduler.capacity.root.<queue name>.accessible-node-labels = <label name>
yarn.scheduler.capacity.root.<label name>.default-node-label-expression sets the default label asked for by queue
Hadoop Cluster
12
Agenda
Platform Overview1
Infrastructure and Metrics2
CaffeOnSpark for Distributed DL3
Compute and Sketches4
Oozie6
Ease of Use7
Q&A8
HBase and Omid5
13
CaffeOnSpark – Distributed Deep Learning
CaffeOnSpark
for
DL
MLLib
for
non-DL
Hive or
SparkSQL
Spark
YARN (RM and Scheduling)
HDFS (Datasets)
. . .
14
Few Use Cases – Yahoo Weather
15
Few Use Cases – Flickr Facial Recognition
16
Few Use Cases – Flickr Scene Detection
17
CaffeOnSpark Architecture – Common Cluster
Spark Driver
Caffe
(enhanced with
multi-GPU/CPU)
Model
Synchronizer
(across nodes)
HDFS
Datasets
Spark
Executor
(for data feeding
and control)
Caffe
(enhanced with
multi-GPU/CPU)
Model
Synchronizer
(across nodes)
HDFS
Datasets
Spark
Executor
(for data feeding
and control)
Caffe
(enhanced with
multi-GPU/CPU)
Model
Synchronizer
(across nodes)
HDFS
Datasets
Spark
Executor
(for data feeding
and control)
Model
O/P on
HDFS
MPI on RDMA / TCP
18
CaffeOnSpark Architecture – Incremental Learning
cos = new CaffeOnSpark(ctx) conf = new Config(ctx, args).init()
dl_train_source = DataSource.getSource(conf, true) cos.train(dl_train_source)
//training DL model lr_raw_source = DataSource.getSource(conf, false) ext_df =
cos.features(lr_raw_source) // extract features via DL
lr_input=ext_df.withColumn(“L", cos.floats2doubleUDF(ext_df(conf.label)))
.withColumn(“F", cos.floats2doublesUDF(ext_df(conf.features(0)))) lr = new
LogisticRegression().setLabelCol(”L").setFeaturesCol(”F") lr_model =
lr.fit(lr_input_df) …
TrainClassifiers:
Non-deep
Learning
Feature
Engineering:
DeepLearning
19
CaffeOnSpark Architecture – Single Command
spark-submit
--num-executors #Exes
--class CaffeOnSpark
my-caffe-on-spark.jar
-devices #GPUs
-model dl_model_file
-output lr_model_file
20
Distributed Deep Learning
Apache
License
Existing
Clusters
Powerful
DL Platform
Fully
Distributed
High-level
API
Incremental
Learning
CaffeOnSpark
github.com/yahoo/caffeonspark
21
Agenda
Platform Overview1
Infrastructure and Metrics2
CaffeOnSpark for Distributed DL3
Compute and Sketches4
Oozie6
Ease of Use7
Q&A8
HBase and Omid5
22
Hadoop Compute Sources
HDFS
(File System and Storage)
Pig
(Scripting)
Hive
(SQL)
Java MR APIs
YARN
(Resource Management and Scheduling)
Tez
(Execution Engine for
Pig and Hive)
Spark
(Alternate Exec Engine)
MapReduce
(Legacy)
Data Processing
ML
Custom App on
Slider
Oozie
Data
Management
23
Compute Growth
13.3
20.4
23.8
27.2
32.3
34.1
39.1
10
15
20
25
30
35
40
45 Mar-13
Apr-13
May-13
Jun-13
Jul-13
Aug-13
Sep-13
Oct-13
Nov-13
Dec-13
Jan-14
Feb-14
Mar-14
Apr-14
May-14
Jun-14
Jul-14
Aug-14
Sep-14
Oct-14
Nov-14
Dec-14
Jan-15
Feb-15
Mar-15
Apr-15
May-15
Jun-15
Jul-15
Aug-15
Sep-15
Oct-15
Nov-15
Dec-15
Jan-16
Feb-16
Mar-16
#MR,Tez,SparkJobs(inmillions)
24
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Pushing Batch Compute Boundaries%ofTotalCompute(memory-sec)
Q1 2016
MapReduce Tez Spark
112 Million Batch Jobs in Q1’16
Jan 78%
Mar 67%
Mar 21% 12%Jan 8% 14%
25
Multi-tenant Apache Storm
26
Recent Apache Storm Developments at Yahoo
MT & RA
Scheduler
Dist. Cache
API
8 x
Throughput
Improved
Debuggability
1 github.com/yahoo/streaming-benchmarks
Pacemaker
Server
Streaming
Benchmark 1
27
Data Sketches Algorithms
Data Sketches Algorithms Library
datasketches.github.io
 Good enough approximate answers
for problem queries
 Streamable
 Approximate with predictable error
 Sub-linear in size
 Mergeable / additive
 Highly parallelizable
 Maven deployable
Characteristics
28
Distinct Count Sketch, High-level View
Big Data
Stream
Transform Data Structure Estimator
Result + / - ε
White
Noise
Basic Sketch Elements
29
Data Sketches Algorithms
Data Sketches Algorithms Library
datasketches.github.io
30
Agenda
Platform Overview1
Infrastructure and Metrics2
CaffeOnSpark for Distributed DL3
Compute and Sketches4
Oozie6
Ease of Use7
Q&A8
HBase and Omid5
31
Apache HBase at Yahoo
 Security
 Isolated Deployment
 Multi-tenant
 Region Server Group
 Namespace
 Unsupported Features
HBase
Client
HBase
Client
JobTracker Namenode
TaskTracker
DataNode
Namenode
RegionServer
DataNode
RegionServer
DataNode
RegionServer
DataNode
HBase Master
Zookeeper
Quorum
HBase
Client
MR Client
M/R Task
TaskTracker
DataNode
M/R Task
TaskTracker
DataNode
MR Task
Compute Cluster HBase Cluster
Gateway/Launcher
Rest Proxy
HTTP
Client
32
Security
 Authentication
 Kerberos (users, processes)
 Delegation Token (MapReduce, YARN, etc.)
 Authorization
 HBase ACLs (Read, Write, Create, Admin)
 Grant permissions to User or Unix Group
 ACL for Table, Column Family or Column
33
Region Server Groups
 Dedicated region servers for a set of tables
 Resource Isolation (CPU, Memory, IO, etc)
RegionServer
Group Foo
RegionServer
RegionServer
RegionServer
Region Server 1...5
TableA TableB TableC
TableD TableE TableF
RegionServer
Group Bar
RegionServer
RegionServer
RegionServer
Region Server 6…10
Table1 Table2 Table3
Table4 Table5 Table6
34
Namespaces
 Analogous to “Database”
 Namespace ACL to create tables
 Default group
 Quota
 Tables
 Regions
Namespace
Group Tables Quota ACL
35
Split Meta to Spread Load and Avoid Large Regions
36
Favored Nodes for HDFS Locality
37
Humongous Tables
38
Scaling HBase to Handle Millions of Regions on a Cluster
Region Server
Groups
Split
Meta
Split
ZK
Favored
Nodes
Humongous
Tables
39
Transactions on HBase with Omid1
Highly performant and fault tolerant ACID
transactional framework
New Apache Incubator project
incubator.apache.org/projects/omid.html
Handles million of transactions per day for
search and personalization products
1 Omid stands for “Hope” in Persian
40
Omid Components
41
Omid Data Model
42
Agenda
Platform Overview1
Infrastructure and Metrics2
CaffeOnSpark for Distributed DL3
Compute and Sketches4
Oozie6
Ease of Use7
Q&A8
HBase and Omid5
43
Oozie Data Pipelines
Oozie
Message Bus
HCatalog
3. Push notification
<New Partition>
2. Register Topic
4. Notify New Partition
Data Producer HDFS
Produce data (distcp, pig, M/R..)
/data/click/2014/06/02
1. Query/Poll Partition
Start workflow
Update metadata
(ALTER TABLE click ADD PARTITION(data=‘2014/06/02’)
location ’hdfs://data/click/2014/06/02’)
44
Large Scale Data Pipeline Requirements
Administrative
 One should be able to start, stop and pause
all related pipelines at a same time
Dependency Management
 Output of a coordinator “n+1” action is
dependent on coordinator “n” action (dataset
dependency)
 If dataset has a BCP instance, workflow
should run with either, whichever arrives first
 Start as soon as mandatory data is available,
other feeds are optional
 Data is not guaranteed, start processing
even if partial data is available
SLA Management
 Monitor pipeline processing to take
immediate action in case of failures or
SLA misses
 Pipelines owners should get notified if
an SLA is missed
Multiple Providers
 If data is available from multiple
providers, I want to specify the provider
priority
 Combine datasets from multiple
providers to fill the gaps in data a single
provider may have
45
Large Scale Data Pipeline Requirements
Administrative
 One should be able to start, stop and pause
all related pipelines at a same time
Dependency Management
 Output of a coordinator “n+1” action is
dependent on coordinator “n” action (dataset
dependency)
 If dataset has a BCP instance, workflow
should run with either, whichever arrives first
 Start as soon as mandatory data is available,
other feeds are optional
 Data is not guaranteed, start processing
even if partial data is available
SLA Management
 Monitor pipeline processing to take
immediate action in case of failures or
SLA misses
 Pipelines owners should get notified if
an SLA is missed
Multiple Providers
 If data is available from multiple
providers, I want to specify the provider
priority
 Combine datasets from multiple
providers to fill the gaps in data a single
provider may have
46
BCP And Mandatory / Optional Feeds
Pull data from A or B. Specify dataset as
AorB. Action will start running as soon
either dataset A or B is available.
<input-logic>
<or name=“AorB”>
<data-in dataset="A” wait=“10”/>
<data-in dataset="B"/>
</or>
</input-logic>
Dataset B is optional, Oozie will start
processing as soon as A is available. It
will include dataset from A and whatever
is available from B.
<input-logic>
<and name="optional
<data-in dataset="A"/>
<data-in dataset="B" min=”0”/>
</and>
</input-logic>
47
Data Not Guaranteed / Priority Among Dataset Instances
A will have higher precedence over B
and B will have higher precedence
over C.
<input-logic>
<or name="AorBorC">
<data-in dataset="A"/>
<data-in dataset="B"/>
<data-in dataset="C”/>
</or>
</input-logic>
48
Oozie will start processing if available A
instances are >= 10. Min can also be
combined with wait (as shown for dataset B).
<input-logic>
<data-in dataset="A" min=”10”/>
<data-in dataset=“B” min =“10”
wait=“20”/>
</input-logic>
Combining Dataset From Multiple Providers
Combine function will first check instances from A and go to B next for whatever is
missing in A.
<data-in name="A" dataset="dataset_A">
<start-instance> ${coord:CURRENT(-5)} </start-instance>
<end-instance> ${coord:latest(-1)} </end-instance>
</data-in>
<data-in name="B" dataset="dataset_B">
<start-instance>${coord:CURRENT(-5)}</start-instance>
<end-instance>${coord:CURRENT(-1)}</end-instance>
</data-in>
<input-logic>
<combine name="AB">
<data-in dataset="A"/>
<data-in dataset="B"/>
</combine>
</input-logic>
49
Agenda
Platform Overview1
Infrastructure and Metrics2
CaffeOnSpark for Distributed DL3
Compute and Sketches4
Oozie6
Ease of Use7
Q&A8
HBase and Omid5
50
Automated Onboarding / Collaboration Portal
51
Built for Tenant Transparency
52
Queue Utilization Dashboard
53
Data Discovery and Access
54
Audits, Compliance, and Efficiency
Starling
FS, Job, Task logs
Cluster 1 Cluster 2 Cluster n...
CF, Region, Action, Query Stats
Cluster 1 Cluster 2 Cluster n...
DB, Tbl., Part., Colmn. Access Stats
...MS 1 MS 2 MS n
GDM
Data Defn., Flow, Feed, Source
F 1 F 2 F n
Log Warehouse
Log Sources
55
Audits, Compliance, and Efficiency (cont’d)
Data Discovery and Access
Public
Non-sensitive
Financial $
Governance
Classification
No addn. reqmt.
LMS Integration
Stock Admin
Integration
Approval Flow
Restricted
56
Hosted UI – Hue as a Service
WSGI
Hue-1.Cluster-1 (Hot)
VIPUsers
HS2
Hue
MySQL DB
(HA)
Hadoop Cluster
HCat
Meta
Oozie
Server
YARN
RM
Web
HDFS
NMs
WSGI
Hue-2.Cluster-1 (hot)
HS2
IdP
SAML
Auth.
Serving pages and static content
Cookies, saved queries,
workflows etc.
FullStackHA
REST / Thrift
(jQuery, Bootstrap, Knockout.js, Love)
57
Going Forward
Increased
Intelligence
Greater
Speed
Higher
Efficiency
Necessary
Scale
58
Increased Intelligence
GBDT FTRL SGD
Deep
Learning
Random
Forests
ML Libraries
Click
Prediction Search RankingKeyword Auctions Ad
Relevance Abuse Detection
Applications
Proven to
Work at Scale
Solve Complex
Problems
YARN (Resource Manager)
Heterogeneous
Scheduling
Long-running
Services
GPUs
Large
Memory Support
Core Grid
Enhancements
…
Parameter ServerGlobally Shared
Parameters
Compute Engines
Distributed
Processing
…
59
Greater Speed
DeData
Management
Ease of
Use
Productivity
Dimensions
Real-time
Pipelines
Unified Metadata &
Lineage
Fine-grained
Access Control
Self-serve Data
Movement
SLA & Cost
Transparency
Intuitive
UIs
Planning &
Collab. Tools
Central Grid
Portal
Improvements
Query times
< 1 sec
4x Speedups in
ETL
SQL on
HBase
Limitless BI
Clients
Analytics, BI &
Reporting
60
Higher Efficiency
Achieve five 9’s availability and 70% average compute utilization across clusters
61
Hadoop Users at Yahoo
Slingstone & Aviate Mail Anti-Spam
Gemini Campaign
Mgmt.
Search Assist
Audience Analytics Flickr YAM+ & Targeting Membership Abuse
… and many more.
62
Yahoo at the Apache Open Source Foundation
10 Committers (6 PMC)
3 Committers (3 PMC)
3 Committers (2 PMC)
6 Committer (5 PMC)
1 Committer
3 Committers (2 PMCs)
7 Committers (6 PMCs)
1 2
43
5 6
7 8
1 Committer
63
Join Us @ yahoohadoop.tumblr.com
64
THANK YOU
SUMEET SINGH (@sumeetksingh)
Sr. Director, Cloud and Big Data Platforms
Icon Courtesy – iconfinder.com (under Creative Commons)

Weitere ähnliche Inhalte

Was ist angesagt?

Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...Sumeet Singh
 
Hive at Yahoo: Letters from the trenches
Hive at Yahoo: Letters from the trenchesHive at Yahoo: Letters from the trenches
Hive at Yahoo: Letters from the trenchesDataWorks Summit
 
Hadoop in the Cloud: Real World Lessons from Enterprise Customers
Hadoop in the Cloud: Real World Lessons from Enterprise CustomersHadoop in the Cloud: Real World Lessons from Enterprise Customers
Hadoop in the Cloud: Real World Lessons from Enterprise CustomersDataWorks Summit/Hadoop Summit
 
Hadoop from Hive with Stinger to Tez
Hadoop from Hive with Stinger to TezHadoop from Hive with Stinger to Tez
Hadoop from Hive with Stinger to TezJan Pieter Posthuma
 
Hadoop project design and a usecase
Hadoop project design and  a usecaseHadoop project design and  a usecase
Hadoop project design and a usecasesudhakara st
 
Hadoop benchmark: Evaluating Cloudera, Hortonworks, and MapR
Hadoop benchmark: Evaluating Cloudera, Hortonworks, and MapRHadoop benchmark: Evaluating Cloudera, Hortonworks, and MapR
Hadoop benchmark: Evaluating Cloudera, Hortonworks, and MapRDouglas Bernardini
 
Integration of HIve and HBase
Integration of HIve and HBaseIntegration of HIve and HBase
Integration of HIve and HBaseHortonworks
 
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)Uwe Printz
 
Hadoop MapReduce Framework
Hadoop MapReduce FrameworkHadoop MapReduce Framework
Hadoop MapReduce FrameworkEdureka!
 
20140228 - Singapore - BDAS - Ensuring Hadoop Production Success
20140228 - Singapore - BDAS - Ensuring Hadoop Production Success20140228 - Singapore - BDAS - Ensuring Hadoop Production Success
20140228 - Singapore - BDAS - Ensuring Hadoop Production SuccessAllen Day, PhD
 
Hadoop Architecture and HDFS
Hadoop Architecture and HDFSHadoop Architecture and HDFS
Hadoop Architecture and HDFSEdureka!
 
YARN - Hadoop Next Generation Compute Platform
YARN - Hadoop Next Generation Compute PlatformYARN - Hadoop Next Generation Compute Platform
YARN - Hadoop Next Generation Compute PlatformBikas Saha
 
Big data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guideBig data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guideDanairat Thanabodithammachari
 
What it takes to run Hadoop at Scale: Yahoo! Perspectives
What it takes to run Hadoop at Scale: Yahoo! PerspectivesWhat it takes to run Hadoop at Scale: Yahoo! Perspectives
What it takes to run Hadoop at Scale: Yahoo! PerspectivesDataWorks Summit
 
Benchmarking Hadoop and Big Data
Benchmarking Hadoop and Big DataBenchmarking Hadoop and Big Data
Benchmarking Hadoop and Big DataNicolas Poggi
 
Basics of big data analytics hadoop
Basics of big data analytics hadoopBasics of big data analytics hadoop
Basics of big data analytics hadoopAmbuj Kumar
 
Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作James Chen
 
Hadoop Developer
Hadoop DeveloperHadoop Developer
Hadoop DeveloperEdureka!
 

Was ist angesagt? (20)

Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
 
Hive at Yahoo: Letters from the trenches
Hive at Yahoo: Letters from the trenchesHive at Yahoo: Letters from the trenches
Hive at Yahoo: Letters from the trenches
 
Hadoop in the Cloud: Real World Lessons from Enterprise Customers
Hadoop in the Cloud: Real World Lessons from Enterprise CustomersHadoop in the Cloud: Real World Lessons from Enterprise Customers
Hadoop in the Cloud: Real World Lessons from Enterprise Customers
 
Hadoop from Hive with Stinger to Tez
Hadoop from Hive with Stinger to TezHadoop from Hive with Stinger to Tez
Hadoop from Hive with Stinger to Tez
 
Hadoop project design and a usecase
Hadoop project design and  a usecaseHadoop project design and  a usecase
Hadoop project design and a usecase
 
Hadoop benchmark: Evaluating Cloudera, Hortonworks, and MapR
Hadoop benchmark: Evaluating Cloudera, Hortonworks, and MapRHadoop benchmark: Evaluating Cloudera, Hortonworks, and MapR
Hadoop benchmark: Evaluating Cloudera, Hortonworks, and MapR
 
Big Data Benchmarking
Big Data BenchmarkingBig Data Benchmarking
Big Data Benchmarking
 
Integration of HIve and HBase
Integration of HIve and HBaseIntegration of HIve and HBase
Integration of HIve and HBase
 
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
 
Hadoop Family and Ecosystem
Hadoop Family and EcosystemHadoop Family and Ecosystem
Hadoop Family and Ecosystem
 
Hadoop MapReduce Framework
Hadoop MapReduce FrameworkHadoop MapReduce Framework
Hadoop MapReduce Framework
 
20140228 - Singapore - BDAS - Ensuring Hadoop Production Success
20140228 - Singapore - BDAS - Ensuring Hadoop Production Success20140228 - Singapore - BDAS - Ensuring Hadoop Production Success
20140228 - Singapore - BDAS - Ensuring Hadoop Production Success
 
Hadoop Architecture and HDFS
Hadoop Architecture and HDFSHadoop Architecture and HDFS
Hadoop Architecture and HDFS
 
YARN - Hadoop Next Generation Compute Platform
YARN - Hadoop Next Generation Compute PlatformYARN - Hadoop Next Generation Compute Platform
YARN - Hadoop Next Generation Compute Platform
 
Big data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guideBig data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guide
 
What it takes to run Hadoop at Scale: Yahoo! Perspectives
What it takes to run Hadoop at Scale: Yahoo! PerspectivesWhat it takes to run Hadoop at Scale: Yahoo! Perspectives
What it takes to run Hadoop at Scale: Yahoo! Perspectives
 
Benchmarking Hadoop and Big Data
Benchmarking Hadoop and Big DataBenchmarking Hadoop and Big Data
Benchmarking Hadoop and Big Data
 
Basics of big data analytics hadoop
Basics of big data analytics hadoopBasics of big data analytics hadoop
Basics of big data analytics hadoop
 
Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作
 
Hadoop Developer
Hadoop DeveloperHadoop Developer
Hadoop Developer
 

Andere mochten auch

Hadoop Summit Europe 2015 - YARN Present and Future
Hadoop Summit Europe 2015 - YARN Present and FutureHadoop Summit Europe 2015 - YARN Present and Future
Hadoop Summit Europe 2015 - YARN Present and FutureVinod Kumar Vavilapalli
 
Big data hadoop ecosystem and nosql
Big data hadoop ecosystem and nosqlBig data hadoop ecosystem and nosql
Big data hadoop ecosystem and nosqlKhanderao Kand
 
Hadoop Summit Europe Talk 2014: Apache Hadoop YARN: Present and Future
Hadoop Summit Europe Talk 2014: Apache Hadoop YARN: Present and FutureHadoop Summit Europe Talk 2014: Apache Hadoop YARN: Present and Future
Hadoop Summit Europe Talk 2014: Apache Hadoop YARN: Present and FutureVinod Kumar Vavilapalli
 
Apache Hadoop YARN: Understanding the Data Operating System of Hadoop
Apache Hadoop YARN: Understanding the Data Operating System of HadoopApache Hadoop YARN: Understanding the Data Operating System of Hadoop
Apache Hadoop YARN: Understanding the Data Operating System of HadoopHortonworks
 
Apache Hadoop YARN - Enabling Next Generation Data Applications
Apache Hadoop YARN - Enabling Next Generation Data ApplicationsApache Hadoop YARN - Enabling Next Generation Data Applications
Apache Hadoop YARN - Enabling Next Generation Data ApplicationsHortonworks
 
Solidos exercicios resolvidos
Solidos exercicios resolvidosSolidos exercicios resolvidos
Solidos exercicios resolvidosHelena Borralho
 

Andere mochten auch (10)

Hadoop Summit Europe 2015 - YARN Present and Future
Hadoop Summit Europe 2015 - YARN Present and FutureHadoop Summit Europe 2015 - YARN Present and Future
Hadoop Summit Europe 2015 - YARN Present and Future
 
Big data hadoop ecosystem and nosql
Big data hadoop ecosystem and nosqlBig data hadoop ecosystem and nosql
Big data hadoop ecosystem and nosql
 
Hadoop Summit Europe Talk 2014: Apache Hadoop YARN: Present and Future
Hadoop Summit Europe Talk 2014: Apache Hadoop YARN: Present and FutureHadoop Summit Europe Talk 2014: Apache Hadoop YARN: Present and Future
Hadoop Summit Europe Talk 2014: Apache Hadoop YARN: Present and Future
 
Setting up Hadoop YARN Clustering
Setting up Hadoop YARN ClusteringSetting up Hadoop YARN Clustering
Setting up Hadoop YARN Clustering
 
Apache Hadoop 3.0 What's new in YARN and MapReduce
Apache Hadoop 3.0 What's new in YARN and MapReduceApache Hadoop 3.0 What's new in YARN and MapReduce
Apache Hadoop 3.0 What's new in YARN and MapReduce
 
Apache Hadoop YARN: Understanding the Data Operating System of Hadoop
Apache Hadoop YARN: Understanding the Data Operating System of HadoopApache Hadoop YARN: Understanding the Data Operating System of Hadoop
Apache Hadoop YARN: Understanding the Data Operating System of Hadoop
 
Apache Hadoop YARN - Enabling Next Generation Data Applications
Apache Hadoop YARN - Enabling Next Generation Data ApplicationsApache Hadoop YARN - Enabling Next Generation Data Applications
Apache Hadoop YARN - Enabling Next Generation Data Applications
 
Apache Hadoop YARN: Past, Present and Future
Apache Hadoop YARN: Past, Present and FutureApache Hadoop YARN: Past, Present and Future
Apache Hadoop YARN: Past, Present and Future
 
Solidos exercicios resolvidos
Solidos exercicios resolvidosSolidos exercicios resolvidos
Solidos exercicios resolvidos
 
Hadoop 1.x vs 2
Hadoop 1.x vs 2Hadoop 1.x vs 2
Hadoop 1.x vs 2
 

Ähnlich wie Hadoop Platform at Yahoo: A Year in Review

sudoers: Benchmarking Hadoop with ALOJA
sudoers: Benchmarking Hadoop with ALOJAsudoers: Benchmarking Hadoop with ALOJA
sudoers: Benchmarking Hadoop with ALOJANicolas Poggi
 
Experience sql server on l inux and docker
Experience sql server on l inux and dockerExperience sql server on l inux and docker
Experience sql server on l inux and dockerBob Ward
 
Hadoop - Past, Present and Future - v1.1
Hadoop - Past, Present and Future - v1.1Hadoop - Past, Present and Future - v1.1
Hadoop - Past, Present and Future - v1.1Big Data Joe™ Rossi
 
Hadoop and big data training
Hadoop and big data trainingHadoop and big data training
Hadoop and big data trainingagiamas
 
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...Jürgen Ambrosi
 
App cap2956v2-121001194956-phpapp01 (1)
App cap2956v2-121001194956-phpapp01 (1)App cap2956v2-121001194956-phpapp01 (1)
App cap2956v2-121001194956-phpapp01 (1)outstanding59
 
Inside the Hadoop Machine @ VMworld
Inside the Hadoop Machine @ VMworldInside the Hadoop Machine @ VMworld
Inside the Hadoop Machine @ VMworldRichard McDougall
 
App Cap2956v2 121001194956 Phpapp01 (1)
App Cap2956v2 121001194956 Phpapp01 (1)App Cap2956v2 121001194956 Phpapp01 (1)
App Cap2956v2 121001194956 Phpapp01 (1)outstanding59
 
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...Debraj GuhaThakurta
 
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...Debraj GuhaThakurta
 
Performance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaPerformance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaCloudera, Inc.
 
Tajo_Meetup_20141120
Tajo_Meetup_20141120Tajo_Meetup_20141120
Tajo_Meetup_20141120Hyoungjun Kim
 
October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...
October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...
October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...Yahoo Developer Network
 
Brk2051 sql server on linux and docker
Brk2051 sql server on linux and dockerBrk2051 sql server on linux and docker
Brk2051 sql server on linux and dockerBob Ward
 
Spark meetup v2.0.5
Spark meetup v2.0.5Spark meetup v2.0.5
Spark meetup v2.0.5Yan Zhou
 

Ähnlich wie Hadoop Platform at Yahoo: A Year in Review (20)

Hadoop Platform at Yahoo
Hadoop Platform at YahooHadoop Platform at Yahoo
Hadoop Platform at Yahoo
 
sudoers: Benchmarking Hadoop with ALOJA
sudoers: Benchmarking Hadoop with ALOJAsudoers: Benchmarking Hadoop with ALOJA
sudoers: Benchmarking Hadoop with ALOJA
 
Experience sql server on l inux and docker
Experience sql server on l inux and dockerExperience sql server on l inux and docker
Experience sql server on l inux and docker
 
Hadoop - Past, Present and Future - v1.1
Hadoop - Past, Present and Future - v1.1Hadoop - Past, Present and Future - v1.1
Hadoop - Past, Present and Future - v1.1
 
Hadoop and big data training
Hadoop and big data trainingHadoop and big data training
Hadoop and big data training
 
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
 
App cap2956v2-121001194956-phpapp01 (1)
App cap2956v2-121001194956-phpapp01 (1)App cap2956v2-121001194956-phpapp01 (1)
App cap2956v2-121001194956-phpapp01 (1)
 
Inside the Hadoop Machine @ VMworld
Inside the Hadoop Machine @ VMworldInside the Hadoop Machine @ VMworld
Inside the Hadoop Machine @ VMworld
 
App Cap2956v2 121001194956 Phpapp01 (1)
App Cap2956v2 121001194956 Phpapp01 (1)App Cap2956v2 121001194956 Phpapp01 (1)
App Cap2956v2 121001194956 Phpapp01 (1)
 
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
 
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
 
Performance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaPerformance Optimizations in Apache Impala
Performance Optimizations in Apache Impala
 
What's New in Apache Hive
What's New in Apache HiveWhat's New in Apache Hive
What's New in Apache Hive
 
Vijay
VijayVijay
Vijay
 
Tajo_Meetup_20141120
Tajo_Meetup_20141120Tajo_Meetup_20141120
Tajo_Meetup_20141120
 
Huhadoop - v1.1
Huhadoop - v1.1Huhadoop - v1.1
Huhadoop - v1.1
 
October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...
October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...
October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...
 
Brk2051 sql server on linux and docker
Brk2051 sql server on linux and dockerBrk2051 sql server on linux and docker
Brk2051 sql server on linux and docker
 
Hadoop_arunam_ppt
Hadoop_arunam_pptHadoop_arunam_ppt
Hadoop_arunam_ppt
 
Spark meetup v2.0.5
Spark meetup v2.0.5Spark meetup v2.0.5
Spark meetup v2.0.5
 

Mehr von Sumeet Singh

Keynote Hadoop Summit San Jose 2017 : Shaping Data Platform To Create Lasting...
Keynote Hadoop Summit San Jose 2017 : Shaping Data Platform To Create Lasting...Keynote Hadoop Summit San Jose 2017 : Shaping Data Platform To Create Lasting...
Keynote Hadoop Summit San Jose 2017 : Shaping Data Platform To Create Lasting...Sumeet Singh
 
Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...
Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...
Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...Sumeet Singh
 
HUG Meetup 2013: HCatalog / Hive Data Out
HUG Meetup 2013: HCatalog / Hive Data Out HUG Meetup 2013: HCatalog / Hive Data Out
HUG Meetup 2013: HCatalog / Hive Data Out Sumeet Singh
 
Hadoop Summit San Jose 2014: Data Discovery on Hadoop
Hadoop Summit San Jose 2014: Data Discovery on Hadoop Hadoop Summit San Jose 2014: Data Discovery on Hadoop
Hadoop Summit San Jose 2014: Data Discovery on Hadoop Sumeet Singh
 
Strata Conference + Hadoop World San Jose 2015: Data Discovery on Hadoop
Strata Conference + Hadoop World San Jose 2015: Data Discovery on Hadoop Strata Conference + Hadoop World San Jose 2015: Data Discovery on Hadoop
Strata Conference + Hadoop World San Jose 2015: Data Discovery on Hadoop Sumeet Singh
 
Hadoop Summit San Jose 2015: Towards SLA-based Scheduling on YARN Clusters
Hadoop Summit San Jose 2015: Towards SLA-based Scheduling on YARN Clusters Hadoop Summit San Jose 2015: Towards SLA-based Scheduling on YARN Clusters
Hadoop Summit San Jose 2015: Towards SLA-based Scheduling on YARN Clusters Sumeet Singh
 
Hadoop Summit San Jose 2013: Compression Options in Hadoop - A Tale of Tradeo...
Hadoop Summit San Jose 2013: Compression Options in Hadoop - A Tale of Tradeo...Hadoop Summit San Jose 2013: Compression Options in Hadoop - A Tale of Tradeo...
Hadoop Summit San Jose 2013: Compression Options in Hadoop - A Tale of Tradeo...Sumeet Singh
 
SAP Technology Services Conference 2013: Big Data and The Cloud at Yahoo!
SAP Technology Services Conference 2013: Big Data and The Cloud at Yahoo! SAP Technology Services Conference 2013: Big Data and The Cloud at Yahoo!
SAP Technology Services Conference 2013: Big Data and The Cloud at Yahoo! Sumeet Singh
 
Strata Conference + Hadoop World NY 2013: Running On-premise Hadoop as a Busi...
Strata Conference + Hadoop World NY 2013: Running On-premise Hadoop as a Busi...Strata Conference + Hadoop World NY 2013: Running On-premise Hadoop as a Busi...
Strata Conference + Hadoop World NY 2013: Running On-premise Hadoop as a Busi...Sumeet Singh
 
HBaseCon 2013: Multi-tenant Apache HBase at Yahoo!
HBaseCon 2013: Multi-tenant Apache HBase at Yahoo! HBaseCon 2013: Multi-tenant Apache HBase at Yahoo!
HBaseCon 2013: Multi-tenant Apache HBase at Yahoo! Sumeet Singh
 

Mehr von Sumeet Singh (10)

Keynote Hadoop Summit San Jose 2017 : Shaping Data Platform To Create Lasting...
Keynote Hadoop Summit San Jose 2017 : Shaping Data Platform To Create Lasting...Keynote Hadoop Summit San Jose 2017 : Shaping Data Platform To Create Lasting...
Keynote Hadoop Summit San Jose 2017 : Shaping Data Platform To Create Lasting...
 
Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...
Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...
Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...
 
HUG Meetup 2013: HCatalog / Hive Data Out
HUG Meetup 2013: HCatalog / Hive Data Out HUG Meetup 2013: HCatalog / Hive Data Out
HUG Meetup 2013: HCatalog / Hive Data Out
 
Hadoop Summit San Jose 2014: Data Discovery on Hadoop
Hadoop Summit San Jose 2014: Data Discovery on Hadoop Hadoop Summit San Jose 2014: Data Discovery on Hadoop
Hadoop Summit San Jose 2014: Data Discovery on Hadoop
 
Strata Conference + Hadoop World San Jose 2015: Data Discovery on Hadoop
Strata Conference + Hadoop World San Jose 2015: Data Discovery on Hadoop Strata Conference + Hadoop World San Jose 2015: Data Discovery on Hadoop
Strata Conference + Hadoop World San Jose 2015: Data Discovery on Hadoop
 
Hadoop Summit San Jose 2015: Towards SLA-based Scheduling on YARN Clusters
Hadoop Summit San Jose 2015: Towards SLA-based Scheduling on YARN Clusters Hadoop Summit San Jose 2015: Towards SLA-based Scheduling on YARN Clusters
Hadoop Summit San Jose 2015: Towards SLA-based Scheduling on YARN Clusters
 
Hadoop Summit San Jose 2013: Compression Options in Hadoop - A Tale of Tradeo...
Hadoop Summit San Jose 2013: Compression Options in Hadoop - A Tale of Tradeo...Hadoop Summit San Jose 2013: Compression Options in Hadoop - A Tale of Tradeo...
Hadoop Summit San Jose 2013: Compression Options in Hadoop - A Tale of Tradeo...
 
SAP Technology Services Conference 2013: Big Data and The Cloud at Yahoo!
SAP Technology Services Conference 2013: Big Data and The Cloud at Yahoo! SAP Technology Services Conference 2013: Big Data and The Cloud at Yahoo!
SAP Technology Services Conference 2013: Big Data and The Cloud at Yahoo!
 
Strata Conference + Hadoop World NY 2013: Running On-premise Hadoop as a Busi...
Strata Conference + Hadoop World NY 2013: Running On-premise Hadoop as a Busi...Strata Conference + Hadoop World NY 2013: Running On-premise Hadoop as a Busi...
Strata Conference + Hadoop World NY 2013: Running On-premise Hadoop as a Busi...
 
HBaseCon 2013: Multi-tenant Apache HBase at Yahoo!
HBaseCon 2013: Multi-tenant Apache HBase at Yahoo! HBaseCon 2013: Multi-tenant Apache HBase at Yahoo!
HBaseCon 2013: Multi-tenant Apache HBase at Yahoo!
 

Kürzlich hochgeladen

UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...AliaaTarek5
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesThousandEyes
 

Kürzlich hochgeladen (20)

UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
 

Hadoop Platform at Yahoo: A Year in Review

  • 1. HADOOP PLATFORM AT YAHOO A YEAR IN REVIEW SUMEET SINGH (@sumeetksingh) Sr. Director, Cloud and Big Data Platforms
  • 2. Agenda 2 Platform Overview1 Infrastructure and Metrics2 CaffeOnSpark for Distributed DL3 Compute and Sketches4 Oozie6 Ease of Use7 Q&A8 HBase and Omid5
  • 3. 0 100 200 300 400 500 600 700 800 0 5,000 10,000 15,000 20,000 25,000 30,000 35,000 40,000 45,000 50,000 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 RawHDFS(inPB) #Servers Year Servers Storage Yahoo! Commits to Scaling Hadoop for Production Use Research Workloads in Search and Advertising Production (Modeling) with machine learning & WebMap Revenue Systems with Security, Multi- tenancy, and SLAs Open Sourced with Apache Hortonworks Spinoff for Enterprise hardening Nextgen Hadoop (H 0.23 YARN) New Services (HBase, Storm, Spark, Hive) Increased User-base with partitioned namespaces Apache H2.7 (Scalable ML, Latency, Utilization, Productivity) Platform Evolution 3
  • 4. Deployment Models Private (dedicated) Clusters Hosted Multi-tenant (private cloud) Clusters Hosted Compute Clusters  Large demanding use cases  New technology not yet platformized  Data movement and regulation issues  When more cost effective than on- premise  Time to market/ results matter  Data already in public cloud  Source of truth for all of orgs data  App delivery agility  Operational efficiency and cost savings through economies of scale On-Premise Public Cloud Purpose-built Big Data Clusters  For performance, tighter integration with tech stack  Value added services such as monitoring, alerts, tuning and common tools 4
  • 5. Platform Today ZK DBMS MON SSHOP LOG WH TOOLS Apache / Open Source Projects Yahoo Projects HDFS HBase HCat Kafka CMS DH Pig Hive Oozie Hue GDM Big ML YARN CS MR Tez Spark Storm Services Compute Storage / Msg. Tools 5
  • 6. Technology Stack Assembly ZK DBMS MON SSHOP LOG WH TOOLS Apache Projects Yahoo Projects HDFS HBase HCat Kafka CMS DH Pig Hive Oozie Hue GDM Big ML YARN CS MR Tez Spark Storm Services Compute Storage / Msg. Tools HDFS (File System) YARN (Scheduling, Resource Management) Common RHEL6 64-bit, JDK8 Platformized Tech with Production Support In- progress, Unmet needs or Apache Alignment 6
  • 7. Common Backplane DataNode NodeManager NameNode RM DataNodes RegionServers NameNode HBase Master Nimbus Supervisor Administration, Management and Monitoring ZooKeeper Pools HTTP/HDFS/GDM Load Proxies Applications and Data Data Feeds Data Stores Oozie Server HS2/ HCat Network Backplane 7
  • 8. 0 10 20 30 Cluster 1 (2,000 servers) HDFS 12 PB Compute 23 TB Avg. Util: 26% Research Cluster Consolidation 0 20 40 60 80 ComputeTotalandUsed(TB) Cluster 3 (5,400 servers) HDFS 36 PB Compute 70 TB Avg. Util: 59% Cluster 2 (3,100 servers) HDFS 21 PB Compute 52 TB Avg. Util: 40% 0 20 40 60 One Month Sample (2015) Total Used 8
  • 9. 0 50 100 150 200 250 300 Consolidated Cluster HDFS 65 PB Compute 240 TB Avg. Util: 70% Consolidated Research Cluster Characteristics One Month Sample (2016) 40% decrease in TCO 10,500 servers 2,200 servers Before After 65% increase in compute capacity 50% increase in avg. utilization Total Used ComputeTotalandUsed(TB) 9
  • 10. Common Hadoop Cluster Configuration Rack 1 Network Backplane CPU Servers with JBODs & 10GbE Rack 2 Rack N . . . . . . . . . 10
  • 11. New Hadoop Cluster Configuration Rack 1 Network Backplane CPU Servers with JBODs & 10GbE Rack 2 Rack N 100Gbps InfiniBand GPU Servers Hi-Mem Servers . . . 11
  • 12. YARN Node Labels J2J3 J4 Queue 1, 40% Label x Queue 2, 40% Label x, y J1 Queue 3, 20% x x x x x x x x x x x x y y y y y y y y y y y y yarn.scheduler.capacity.root.<queue name>.accessible-node-labels = <label name> yarn.scheduler.capacity.root.<label name>.default-node-label-expression sets the default label asked for by queue Hadoop Cluster 12
  • 13. Agenda Platform Overview1 Infrastructure and Metrics2 CaffeOnSpark for Distributed DL3 Compute and Sketches4 Oozie6 Ease of Use7 Q&A8 HBase and Omid5 13
  • 14. CaffeOnSpark – Distributed Deep Learning CaffeOnSpark for DL MLLib for non-DL Hive or SparkSQL Spark YARN (RM and Scheduling) HDFS (Datasets) . . . 14
  • 15. Few Use Cases – Yahoo Weather 15
  • 16. Few Use Cases – Flickr Facial Recognition 16
  • 17. Few Use Cases – Flickr Scene Detection 17
  • 18. CaffeOnSpark Architecture – Common Cluster Spark Driver Caffe (enhanced with multi-GPU/CPU) Model Synchronizer (across nodes) HDFS Datasets Spark Executor (for data feeding and control) Caffe (enhanced with multi-GPU/CPU) Model Synchronizer (across nodes) HDFS Datasets Spark Executor (for data feeding and control) Caffe (enhanced with multi-GPU/CPU) Model Synchronizer (across nodes) HDFS Datasets Spark Executor (for data feeding and control) Model O/P on HDFS MPI on RDMA / TCP 18
  • 19. CaffeOnSpark Architecture – Incremental Learning cos = new CaffeOnSpark(ctx) conf = new Config(ctx, args).init() dl_train_source = DataSource.getSource(conf, true) cos.train(dl_train_source) //training DL model lr_raw_source = DataSource.getSource(conf, false) ext_df = cos.features(lr_raw_source) // extract features via DL lr_input=ext_df.withColumn(“L", cos.floats2doubleUDF(ext_df(conf.label))) .withColumn(“F", cos.floats2doublesUDF(ext_df(conf.features(0)))) lr = new LogisticRegression().setLabelCol(”L").setFeaturesCol(”F") lr_model = lr.fit(lr_input_df) … TrainClassifiers: Non-deep Learning Feature Engineering: DeepLearning 19
  • 20. CaffeOnSpark Architecture – Single Command spark-submit --num-executors #Exes --class CaffeOnSpark my-caffe-on-spark.jar -devices #GPUs -model dl_model_file -output lr_model_file 20
  • 21. Distributed Deep Learning Apache License Existing Clusters Powerful DL Platform Fully Distributed High-level API Incremental Learning CaffeOnSpark github.com/yahoo/caffeonspark 21
  • 22. Agenda Platform Overview1 Infrastructure and Metrics2 CaffeOnSpark for Distributed DL3 Compute and Sketches4 Oozie6 Ease of Use7 Q&A8 HBase and Omid5 22
  • 23. Hadoop Compute Sources HDFS (File System and Storage) Pig (Scripting) Hive (SQL) Java MR APIs YARN (Resource Management and Scheduling) Tez (Execution Engine for Pig and Hive) Spark (Alternate Exec Engine) MapReduce (Legacy) Data Processing ML Custom App on Slider Oozie Data Management 23
  • 25. 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Pushing Batch Compute Boundaries%ofTotalCompute(memory-sec) Q1 2016 MapReduce Tez Spark 112 Million Batch Jobs in Q1’16 Jan 78% Mar 67% Mar 21% 12%Jan 8% 14% 25
  • 27. Recent Apache Storm Developments at Yahoo MT & RA Scheduler Dist. Cache API 8 x Throughput Improved Debuggability 1 github.com/yahoo/streaming-benchmarks Pacemaker Server Streaming Benchmark 1 27
  • 28. Data Sketches Algorithms Data Sketches Algorithms Library datasketches.github.io  Good enough approximate answers for problem queries  Streamable  Approximate with predictable error  Sub-linear in size  Mergeable / additive  Highly parallelizable  Maven deployable Characteristics 28
  • 29. Distinct Count Sketch, High-level View Big Data Stream Transform Data Structure Estimator Result + / - ε White Noise Basic Sketch Elements 29
  • 30. Data Sketches Algorithms Data Sketches Algorithms Library datasketches.github.io 30
  • 31. Agenda Platform Overview1 Infrastructure and Metrics2 CaffeOnSpark for Distributed DL3 Compute and Sketches4 Oozie6 Ease of Use7 Q&A8 HBase and Omid5 31
  • 32. Apache HBase at Yahoo  Security  Isolated Deployment  Multi-tenant  Region Server Group  Namespace  Unsupported Features HBase Client HBase Client JobTracker Namenode TaskTracker DataNode Namenode RegionServer DataNode RegionServer DataNode RegionServer DataNode HBase Master Zookeeper Quorum HBase Client MR Client M/R Task TaskTracker DataNode M/R Task TaskTracker DataNode MR Task Compute Cluster HBase Cluster Gateway/Launcher Rest Proxy HTTP Client 32
  • 33. Security  Authentication  Kerberos (users, processes)  Delegation Token (MapReduce, YARN, etc.)  Authorization  HBase ACLs (Read, Write, Create, Admin)  Grant permissions to User or Unix Group  ACL for Table, Column Family or Column 33
  • 34. Region Server Groups  Dedicated region servers for a set of tables  Resource Isolation (CPU, Memory, IO, etc) RegionServer Group Foo RegionServer RegionServer RegionServer Region Server 1...5 TableA TableB TableC TableD TableE TableF RegionServer Group Bar RegionServer RegionServer RegionServer Region Server 6…10 Table1 Table2 Table3 Table4 Table5 Table6 34
  • 35. Namespaces  Analogous to “Database”  Namespace ACL to create tables  Default group  Quota  Tables  Regions Namespace Group Tables Quota ACL 35
  • 36. Split Meta to Spread Load and Avoid Large Regions 36
  • 37. Favored Nodes for HDFS Locality 37
  • 39. Scaling HBase to Handle Millions of Regions on a Cluster Region Server Groups Split Meta Split ZK Favored Nodes Humongous Tables 39
  • 40. Transactions on HBase with Omid1 Highly performant and fault tolerant ACID transactional framework New Apache Incubator project incubator.apache.org/projects/omid.html Handles million of transactions per day for search and personalization products 1 Omid stands for “Hope” in Persian 40
  • 43. Agenda Platform Overview1 Infrastructure and Metrics2 CaffeOnSpark for Distributed DL3 Compute and Sketches4 Oozie6 Ease of Use7 Q&A8 HBase and Omid5 43
  • 44. Oozie Data Pipelines Oozie Message Bus HCatalog 3. Push notification <New Partition> 2. Register Topic 4. Notify New Partition Data Producer HDFS Produce data (distcp, pig, M/R..) /data/click/2014/06/02 1. Query/Poll Partition Start workflow Update metadata (ALTER TABLE click ADD PARTITION(data=‘2014/06/02’) location ’hdfs://data/click/2014/06/02’) 44
  • 45. Large Scale Data Pipeline Requirements Administrative  One should be able to start, stop and pause all related pipelines at a same time Dependency Management  Output of a coordinator “n+1” action is dependent on coordinator “n” action (dataset dependency)  If dataset has a BCP instance, workflow should run with either, whichever arrives first  Start as soon as mandatory data is available, other feeds are optional  Data is not guaranteed, start processing even if partial data is available SLA Management  Monitor pipeline processing to take immediate action in case of failures or SLA misses  Pipelines owners should get notified if an SLA is missed Multiple Providers  If data is available from multiple providers, I want to specify the provider priority  Combine datasets from multiple providers to fill the gaps in data a single provider may have 45
  • 46. Large Scale Data Pipeline Requirements Administrative  One should be able to start, stop and pause all related pipelines at a same time Dependency Management  Output of a coordinator “n+1” action is dependent on coordinator “n” action (dataset dependency)  If dataset has a BCP instance, workflow should run with either, whichever arrives first  Start as soon as mandatory data is available, other feeds are optional  Data is not guaranteed, start processing even if partial data is available SLA Management  Monitor pipeline processing to take immediate action in case of failures or SLA misses  Pipelines owners should get notified if an SLA is missed Multiple Providers  If data is available from multiple providers, I want to specify the provider priority  Combine datasets from multiple providers to fill the gaps in data a single provider may have 46
  • 47. BCP And Mandatory / Optional Feeds Pull data from A or B. Specify dataset as AorB. Action will start running as soon either dataset A or B is available. <input-logic> <or name=“AorB”> <data-in dataset="A” wait=“10”/> <data-in dataset="B"/> </or> </input-logic> Dataset B is optional, Oozie will start processing as soon as A is available. It will include dataset from A and whatever is available from B. <input-logic> <and name="optional <data-in dataset="A"/> <data-in dataset="B" min=”0”/> </and> </input-logic> 47
  • 48. Data Not Guaranteed / Priority Among Dataset Instances A will have higher precedence over B and B will have higher precedence over C. <input-logic> <or name="AorBorC"> <data-in dataset="A"/> <data-in dataset="B"/> <data-in dataset="C”/> </or> </input-logic> 48 Oozie will start processing if available A instances are >= 10. Min can also be combined with wait (as shown for dataset B). <input-logic> <data-in dataset="A" min=”10”/> <data-in dataset=“B” min =“10” wait=“20”/> </input-logic>
  • 49. Combining Dataset From Multiple Providers Combine function will first check instances from A and go to B next for whatever is missing in A. <data-in name="A" dataset="dataset_A"> <start-instance> ${coord:CURRENT(-5)} </start-instance> <end-instance> ${coord:latest(-1)} </end-instance> </data-in> <data-in name="B" dataset="dataset_B"> <start-instance>${coord:CURRENT(-5)}</start-instance> <end-instance>${coord:CURRENT(-1)}</end-instance> </data-in> <input-logic> <combine name="AB"> <data-in dataset="A"/> <data-in dataset="B"/> </combine> </input-logic> 49
  • 50. Agenda Platform Overview1 Infrastructure and Metrics2 CaffeOnSpark for Distributed DL3 Compute and Sketches4 Oozie6 Ease of Use7 Q&A8 HBase and Omid5 50
  • 51. Automated Onboarding / Collaboration Portal 51
  • 52. Built for Tenant Transparency 52
  • 54. Data Discovery and Access 54
  • 55. Audits, Compliance, and Efficiency Starling FS, Job, Task logs Cluster 1 Cluster 2 Cluster n... CF, Region, Action, Query Stats Cluster 1 Cluster 2 Cluster n... DB, Tbl., Part., Colmn. Access Stats ...MS 1 MS 2 MS n GDM Data Defn., Flow, Feed, Source F 1 F 2 F n Log Warehouse Log Sources 55
  • 56. Audits, Compliance, and Efficiency (cont’d) Data Discovery and Access Public Non-sensitive Financial $ Governance Classification No addn. reqmt. LMS Integration Stock Admin Integration Approval Flow Restricted 56
  • 57. Hosted UI – Hue as a Service WSGI Hue-1.Cluster-1 (Hot) VIPUsers HS2 Hue MySQL DB (HA) Hadoop Cluster HCat Meta Oozie Server YARN RM Web HDFS NMs WSGI Hue-2.Cluster-1 (hot) HS2 IdP SAML Auth. Serving pages and static content Cookies, saved queries, workflows etc. FullStackHA REST / Thrift (jQuery, Bootstrap, Knockout.js, Love) 57
  • 59. Increased Intelligence GBDT FTRL SGD Deep Learning Random Forests ML Libraries Click Prediction Search RankingKeyword Auctions Ad Relevance Abuse Detection Applications Proven to Work at Scale Solve Complex Problems YARN (Resource Manager) Heterogeneous Scheduling Long-running Services GPUs Large Memory Support Core Grid Enhancements … Parameter ServerGlobally Shared Parameters Compute Engines Distributed Processing … 59
  • 60. Greater Speed DeData Management Ease of Use Productivity Dimensions Real-time Pipelines Unified Metadata & Lineage Fine-grained Access Control Self-serve Data Movement SLA & Cost Transparency Intuitive UIs Planning & Collab. Tools Central Grid Portal Improvements Query times < 1 sec 4x Speedups in ETL SQL on HBase Limitless BI Clients Analytics, BI & Reporting 60
  • 61. Higher Efficiency Achieve five 9’s availability and 70% average compute utilization across clusters 61
  • 62. Hadoop Users at Yahoo Slingstone & Aviate Mail Anti-Spam Gemini Campaign Mgmt. Search Assist Audience Analytics Flickr YAM+ & Targeting Membership Abuse … and many more. 62
  • 63. Yahoo at the Apache Open Source Foundation 10 Committers (6 PMC) 3 Committers (3 PMC) 3 Committers (2 PMC) 6 Committer (5 PMC) 1 Committer 3 Committers (2 PMCs) 7 Committers (6 PMCs) 1 2 43 5 6 7 8 1 Committer 63
  • 64. Join Us @ yahoohadoop.tumblr.com 64
  • 65. THANK YOU SUMEET SINGH (@sumeetksingh) Sr. Director, Cloud and Big Data Platforms Icon Courtesy – iconfinder.com (under Creative Commons)

Hinweis der Redaktion

  1. JIRA 1976 (Oozie 4.3)
  2. While $coord:latest allows skipping to available ones, the workflow will never trigger unless mentioned number of instances are found. Min can be also combined with wait. If all dependencies are not met and if we have met MIN dependencies and then Oozie keeps on waiting for more instance till wait time elapses or all data dependencies are met.
  3. (30 secs) T: 2 min 30 secs xyz
  4. Protocols REST – Use pyhton-requests and a custom client to streamline RESTful interface calls Thrift – Custom connection pooling and socket multiplexing to streamline thrift calls Accessibility Middleware – Make Hadoop interfaces accessible in request objects Hue uses CherryPy web server. You can use the following options to change the IP address and port that the web server listens on. The default setting is port 8888 on all configured IP addresses. If you don’t specify a secret key, your session cookies will not be secure. Hue will run but it will also display error messages telling you to set the secret key. You can configure Hue to serve over HTTPS. To do so, you must install "pyOpenSSL" within Hue’s context and configure your keys.
  5. Protocols REST – Use pyhton-requests and a custom client to streamline RESTful interface calls Thrift – Custom connection pooling and socket multiplexing to streamline thrift calls Accessibility Middleware – Make Hadoop interfaces accessible in request objects Hue uses CherryPy web server. You can use the following options to change the IP address and port that the web server listens on. The default setting is port 8888 on all configured IP addresses. If you don’t specify a secret key, your session cookies will not be secure. Hue will run but it will also display error messages telling you to set the secret key. You can configure Hue to serve over HTTPS. To do so, you must install "pyOpenSSL" within Hue’s context and configure your keys.