SlideShare ist ein Scribd-Unternehmen logo
1 von 36
Enabling Exploratory Analytics of Data in
Shared-service Hadoop Clusters
P R E S E N T E D B Y S a g i Z e l n i c k P r i n c i p a l A r c h i t e c t @ Y a h o o a n d L e d i o n B i t i n c k a
P r i n c i p a l A r c h i t e c t @ S p l u n k
H a d o o p S u m m i t J u n e 2 0 1 4 S a n J o s e , C A
About your speakers
2 Yahoo Proprietary
Sagi Zelnick Ledion Bitincka
Principal Architect Principal Architect
Yahoo Splunk
Background
3 Yahoo Proprietary
 Hadoop @ Yahoo: 8+ years of innovation
 Hunk @ Yahoo: organization-wide investment for next 3+ years
 Yahoo providing Hunk as a self-service to explore, analyze & visualize
data in HDFS
 Hunk allows visually browsing of very complex tables (250+ fields)
 Rapid prototyping for new jobs with almost instant results for searches,
without having to wait for the entire job/query to finish
 Cuts down on the development cycles by faster interaction with results
4 Yahoo Proprietary
History of Hadoop innovation @ Yahoo
Over 600PB of Hadoop storage (over half an exabyte)
5 Yahoo Proprietary
 Very large clusters used by many groups across the enterprise
 More than 40,000 individual datanodes
 Hadoop is provided as a service
 Multiple cluster types such as research, dev, sandbox and production
 Services such as HBase, Hive, Oozie, etc…
 Users are free to run jobs, but have resource constraints
 Maintained by Grid Operations Group
Improving visibility & providing operational insights with Hunk
 We pointed Hunk at many operational logs and event data we already
have on the grid
 This includes system metrics, HDFS ops, JVM stats and YARN metrics
 Created instrumentation to measure usage per user and job
 Analyzed terabytes of NameNode audit logs
 Job history leveraged for visualizing usage/growth and historical views
 Custom events for HBase statistics
6 Yahoo Proprietary
Use Case Customer Benefits
Namenode metrics, block ops, memory
usage
Research, Dev Improved performance and
stability
System/Hadoop metrics of ~40,000
individual datanodes
Grid Ops / Grid Customers Identify slow tasks/nodes when
debugging
Historical insights into resource
consumption
All Grid Customers Track organic growth
Generate reports on job performance All Grid Customers Improved job SLAs
HBase metrics All Grid Customers Track region/RS/table metrics…
Track job logs in near real-time All Grid Customers / Ops Detect and search for errors
directly from the YARN job logs
for troubleshooting
Tracking Hadoop performance and metrics in Hunk
7 Yahoo Proprietary
Use Case Customer Benefits
Find dataset instances/files that have never
been accessed after creation
Data Storage Efficiency
Team, SE
Savings via reduction of storage-
costs
How is each user/team using compute and
disk capacity on a cluster?
Management / Grid
Customers
Metering / Chargeback
Replace ad hoc and legacy solutions for
analyzing cluster-usage
SE / Grid Solutions / Grid
Performance / Hadoop
Core Development Team
Improved Grid-utilization and cost-
reduction
Generate reports on cluster performance,
utilization of available capacity, etc.
SE / Grid Solutions / Grid
Performance / Hadoop
Core Development Team
Data-mining for product
improvements and best-practices
Determine KPIs of Hadoop stack components
(Pig, Oozie, etc.)
SE / Grid Solutions /
Hadoop Stack
Development Team
Feedback for product
improvements
Find efficacy of various heuristics in Hadoop
(data-locality of Tasks, replication of blocks,
etc.)
Hadoop Stack
Development Team
Fine-tune heuristics for better
efficiency
Tracking Hadoop performance and metrics continued
8 Yahoo Proprietary
9 Yahoo Proprietary
Sample search in Hunk
Measuring NameNode performance pre & post upgrades
10 Yahoo Proprietary
 Historical visualizations of all operations
 Search data in Hunk from billions of NameNode events
 Measure JVM and memory usage
 Insights into operational performance
Yahoo Proprietary
New Search
i ndex=" si mon_bl ue_new_al l " t hi s_cl ust er =" di l i t hi umbl ue* " ( l og_subt ype=" DFS" #hdf s=hdf s) | t i mechar t spa
n=1h avg( number * ) as num_*
Last 7 days
✓ 10,086 events (5/15/14 1:00:00.000 AM to 5/22/14 1:36:34.000 AM)
_time
num_BlockReports num_CopyBl...perations num_HeartBeats num_ReadBl...perations
num_ReadMe...perations num_Replac...Operations num_WriteB...Operations num_blockChecksumOp
Fri May 16
2014
Sun May 18 Tue May 20
200,000,000
400,000,000
600,000,000
_time ✓
num_Bl
ockRep
orts ✓
num_Copy
BlockOpera
tions ✓
num_
HeartB
eats ✓
num_Read
BlockOpera
tions ✓
num_ReadMe
tadataOperati
ons ✓
num_Replac
eBlockOperat
ions ✓
num_Write
BlockOpera
tions ✓
num_blo
ckChecks
umOp ✓
2014-05-15 01:00 112443
7.7359
02
46721126.
819672
51495
7.3840
98
12930433.0
77869
0.000000 94210832.78
6885
63512425.9
67213
13975.30
6557
Visualization
Sample visualization in Hunk
11
12 Yahoo Proprietary
✓ 2,753 events (5/20/14 1:14:21.000 AM to 5/22/14 1:14:21.000 AM)
_time
num_BlockReports num_CopyBl...perations num_HeartBeats num_ReadBl...perations
num_ReadMe...perations num_Replac...Operations num_WriteB...Operations num_blockChecksumOp
12:00 PM
Tue May 20
2014
12:00 AM
Wed May 21
12:00 PM
1,000,000,000
250,000,000
500,000,000
750,000,000
_time ✓
num_Bl
ockRep
orts ✓
num_Copy
BlockOpera
tions ✓
num_
HeartB
eats ✓
num_Read
BlockOpera
tions ✓
num_ReadMe
tadataOperati
ons ✓
num_Replac
eBlockOperat
ions ✓
num_Write
BlockOpera
tions ✓
num_blo
ckChecks
umOp ✓
Visualization
Sample troubleshooting in Hunk of 750 million events
13 Yahoo Proprietary
New Search
i ndex=" si mon_bl ue_new_al l " t hi s_cl ust er =" di l i t hi umbl ue* " ( l og_subt ype=" JVM" Pr ocessName=" NameNode" ) | t i m
echar t span=5m avg( Thr eads* ) as t hr eads_*
Last 2 days
✓ 8,463 events (5/20/14 12:00:00.000 AM to 5/22/14 12:00:00.000 AM)
_time
threads_Blocked threads_New threads_Runnable threads_Terminated threads_TimedWaiting
threads_Waiting
12:00 AM
Tue May 20
2014
12:00 PM 12:00 AM
Wed May 21
12:00 PM
200
400
_time ✓
threads_Block
ed ✓
threads_Ne
w ✓
threads_Runna
ble ✓
threads_Terminat
ed ✓
threads_TimedWait
ing ✓
threads_Waiti
ng ✓
2014-05-20 00:00:00 72.360000 10.638333 5.485833 0.000000 21.208333 78.555000
2014-05-20 00:05:00 70.177333 10.554667 5.277333 0.000000 20.744667 76.578000
2014-05-20 00:10:00 70.211333 9.998667 5.022000 0.000000 19.333333 73.766667
Visualization
Big picture plus granular details
Analyzing NameNode RPC calls
14 Yahoo Proprietary
 Who is making what RPC call (open, listStatus, create, etc.)
 How often are they making these RPC calls
 From which IP/host are they coming from
 Search and visualize historical data from billions of events
 Prevent NameNode abuse/misuse
15 Yahoo Proprietary
Visualizing 834 million discrete events …
16 Yahoo Confidential & Proprietary
… continued
Queue insights
 Each Hadoop job runs in a specific queue
 We track every aspect of the YARN framework
 Immediate queue performance and configuration profiling via job
history server
 Historical views and trends that enable better capacity management
 Improved queue utilization and allocation management
17 Yahoo Proprietary
New Search
i ndex=" j obsummar y_l ogs_al l _r ed" cl ust er =" di l i t hi um* " | eval t ot al _sl ot _seconds=( m apSl ot Seconds + r educeSl ot Sec
onds) | eval gb_hour s=( ( t ot al _sl ot _seconds * 0. 5) / 3600) | eval gb_hour s=r ound( gb_h our s) | t i mechar t span=6h sum
( gb_hour s) as gb_hour s by queue
Last 7 days
✓ 1,175,726 events (5/20/14 8:00:00.000 PM to 5/27/14 8:26:26.000 PM)
200,000
400,000
600,000
OTH apg_dai apg_dail apg_hou apg_ho apg_hourl apg curveb curveb sling sling
Visualization
_time
Wed May 21
2014
Thu May 22 Fri May 23 Sat May 24 Sun May 25 Mon May 26
Visualizing queues
18 Yahoo Proprietary
Creating job reports per user
19 Yahoo Proprietary
 Each job is unique and so are the map and reduce elements
 How to start analyzing jobs?
 Historical job performance and profiling enables in-depth
performance tuning
 Long terms historical views and trending of growth
More data to tap into with the metastore/hive sources
20 Yahoo Proprietary
 We will provide Hunk as a self-service to explore & visualize data in HDFS
 Using the metastore we can setup virtual indexes to any table(s) in Hive,
without the need to define the schema up-front
 Allows for visually browsing very complex tables (250+ fields)
 Rapid prototyping for new jobs with almost instant results for searches,
without having to wait for the entire job/query to finish
 Cuts down on the development cycles by faster interaction with results
 Built-in graphs/charts makes for a powerful solution for many situations
Hunk + Hadoop Demo
21Yahoo Proprietary
22 Yahoo Proprietary
23 Yahoo Proprietary
24 Yahoo Proprietary
25 Yahoo Proprietary
26 Yahoo Proprietary
© 2014 Splunk Inc.
Meet Hunk 6.1
28
Integrated Analytics Platform
Full-featured,
Integrated
Product
Insights for
Everyone
Works with
What You
Have Today
Explore Visualize Dashboard
s
ShareAnalyze
Hadoop Clusters NoSQL and Other Data Stores
Hadoop Client Libraries Streaming Resource Libraries
for Diverse Data Stores
29
Fast Deployment and Configuration
Just point at Hadoop
• Certified integrations to all
major Hadoop distributions
• Choose 1st-gen MapReduce
or YARN
• Create Virtual Indexes
across one or more clusters
• From download to
searching data in < 60
minutes
Connect to one or multiple Hadoop clusters
YARN
certified
Interactive Search and Results Preview
Rapidly interact with data
• Powerful Search Processing
Language (SPL™)
• Ad hoc exploratory analytics
across massive datasets
• Preview results
• No fixed schema
• No requirement to
“understand” data upfront
Search
interface
Preview
results
30
Drill down
to raw data
Pause or stop MapReduce jobs
31
Powerful Dashboards for Self-Service Analytics
Interactive Dashboards
and Charts
• Easy-to-use dashboard editor
• Chart overlay
• Pan and zoom
• In-dashboard drilldown
• Embed charts and
dashboards in 3rd party apps
• Reuse skills with Splunk
Enterprise 6.1 and Hunk 6.1
32
Hive Data Support
Supported File Formats
• Text files
• Sequence files
• RCFile
• ORC files
• Parquet
33
Role-based Security for Shared Clusters
Pass-through
Authentication
• Provide role-based security
for Hadoop clusters
• Access Hadoop resources
under security and
compliance
• Integrates with Kerberos
for Hadoop security
Business
Analyst
Marketing
Analyst
Sys
Admin
Business
Analyst
Queue:
Biz Analytics
Marketing
Analyst
Queue:
Marketing
Sys
Admin2
Queue:
Prod
34
Powerful Developer
Environment
• Use a standards-based web
framework and REST API
• Customize dashboards and
UIs with Simple XML,
JavaScript or Django
• Choose among SDKs
• One integration for both
Splunk Enterprise and Hunk
Build Analytics-Rich Big Data Apps
35
Explore, analyze and visualize data in
one integrated platform
Point Hunk at your storage clusters and
explore data immediately
Preview results as MapReduce jobs run and
accelerate reports with no fixed schemas
INTERACTIVE
SEARCH
RICH DEVELOPER
ENVIRONMENT
Build big data apps using standard web
languages and frameworks
FULL-FEATURED
ANALYTICS
FAST TO DEPLOY
AND DRIVE VALUE
Hunk: One Integrated Platform
Question/Comments?
Sagi Zelnick – Principal Architect
Email: zelnicks@yahoo-inc.com
Ledion Bitincka – Principal Architect
Email: lbitincka@splunk.com

Weitere ähnliche Inhalte

Was ist angesagt?

Finding URL pattern with MapReduce and Apache Hadoop
Finding URL pattern with MapReduce and Apache HadoopFinding URL pattern with MapReduce and Apache Hadoop
Finding URL pattern with MapReduce and Apache Hadoop
Nushrat
 
Start Getting Your Feet Wet in Open Source Machine and Deep Learning
Start Getting Your Feet Wet in Open Source Machine and Deep Learning Start Getting Your Feet Wet in Open Source Machine and Deep Learning
Start Getting Your Feet Wet in Open Source Machine and Deep Learning
Ian Gomez
 
SplunkLive! Washington DC May 2013 - Big Data Architectural Patterns
SplunkLive! Washington DC May 2013 - Big Data Architectural PatternsSplunkLive! Washington DC May 2013 - Big Data Architectural Patterns
SplunkLive! Washington DC May 2013 - Big Data Architectural Patterns
Splunk
 
Hadoop & Hive Change the Data Warehousing Game Forever
Hadoop & Hive Change the Data Warehousing Game ForeverHadoop & Hive Change the Data Warehousing Game Forever
Hadoop & Hive Change the Data Warehousing Game Forever
DataWorks Summit
 

Was ist angesagt? (20)

Webinar: MongoDB + Hadoop
Webinar: MongoDB + HadoopWebinar: MongoDB + Hadoop
Webinar: MongoDB + Hadoop
 
Using MongoDB with Hadoop & Spark
Using MongoDB with Hadoop & SparkUsing MongoDB with Hadoop & Spark
Using MongoDB with Hadoop & Spark
 
Splunk Ninjas: New features, pivot, and search dojo
Splunk Ninjas: New features, pivot, and search dojoSplunk Ninjas: New features, pivot, and search dojo
Splunk Ninjas: New features, pivot, and search dojo
 
Spark and Cassandra: An Amazing Apache Love Story by Patrick McFadin
Spark and Cassandra: An Amazing Apache Love Story by Patrick McFadinSpark and Cassandra: An Amazing Apache Love Story by Patrick McFadin
Spark and Cassandra: An Amazing Apache Love Story by Patrick McFadin
 
Finding URL pattern with MapReduce and Apache Hadoop
Finding URL pattern with MapReduce and Apache HadoopFinding URL pattern with MapReduce and Apache Hadoop
Finding URL pattern with MapReduce and Apache Hadoop
 
Applying Machine Learning using H2O
Applying Machine Learning using H2OApplying Machine Learning using H2O
Applying Machine Learning using H2O
 
Interactive query using hadoop
Interactive query using hadoopInteractive query using hadoop
Interactive query using hadoop
 
Start Getting Your Feet Wet in Open Source Machine and Deep Learning
Start Getting Your Feet Wet in Open Source Machine and Deep Learning Start Getting Your Feet Wet in Open Source Machine and Deep Learning
Start Getting Your Feet Wet in Open Source Machine and Deep Learning
 
Introduction to Hadoop Administration
Introduction to Hadoop AdministrationIntroduction to Hadoop Administration
Introduction to Hadoop Administration
 
SplunkLive! Washington DC May 2013 - Big Data Architectural Patterns
SplunkLive! Washington DC May 2013 - Big Data Architectural PatternsSplunkLive! Washington DC May 2013 - Big Data Architectural Patterns
SplunkLive! Washington DC May 2013 - Big Data Architectural Patterns
 
Hadoop - Architectural road map for Hadoop Ecosystem
Hadoop -  Architectural road map for Hadoop EcosystemHadoop -  Architectural road map for Hadoop Ecosystem
Hadoop - Architectural road map for Hadoop Ecosystem
 
Ultra Fast Deep Learning in Hybrid Cloud using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud using Intel Analytics Zoo & AlluxioUltra Fast Deep Learning in Hybrid Cloud using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud using Intel Analytics Zoo & Alluxio
 
Hpdw 2015-v10-paper
Hpdw 2015-v10-paperHpdw 2015-v10-paper
Hpdw 2015-v10-paper
 
Druid: Under the Covers (Virtual Meetup)
Druid: Under the Covers (Virtual Meetup)Druid: Under the Covers (Virtual Meetup)
Druid: Under the Covers (Virtual Meetup)
 
Hadoop World 2011: Storing and Indexing Social Media Content in the Hadoop Ec...
Hadoop World 2011: Storing and Indexing Social Media Content in the Hadoop Ec...Hadoop World 2011: Storing and Indexing Social Media Content in the Hadoop Ec...
Hadoop World 2011: Storing and Indexing Social Media Content in the Hadoop Ec...
 
Intro to Python Data Analysis in Wakari
Intro to Python Data Analysis in WakariIntro to Python Data Analysis in Wakari
Intro to Python Data Analysis in Wakari
 
Hadoop & Hive Change the Data Warehousing Game Forever
Hadoop & Hive Change the Data Warehousing Game ForeverHadoop & Hive Change the Data Warehousing Game Forever
Hadoop & Hive Change the Data Warehousing Game Forever
 
Interactive Realtime Dashboards on Data Streams using Kafka, Druid and Superset
Interactive Realtime Dashboards on Data Streams using Kafka, Druid and SupersetInteractive Realtime Dashboards on Data Streams using Kafka, Druid and Superset
Interactive Realtime Dashboards on Data Streams using Kafka, Druid and Superset
 
VMworld 2013: Big Data Extensions: Advanced Features and Customer Case Study
VMworld 2013: Big Data Extensions: Advanced Features and Customer Case Study VMworld 2013: Big Data Extensions: Advanced Features and Customer Case Study
VMworld 2013: Big Data Extensions: Advanced Features and Customer Case Study
 
Transitioning from Traditional DW to Apache® Spark™ in Operating Room Predict...
Transitioning from Traditional DW to Apache® Spark™ in Operating Room Predict...Transitioning from Traditional DW to Apache® Spark™ in Operating Room Predict...
Transitioning from Traditional DW to Apache® Spark™ in Operating Room Predict...
 

Andere mochten auch

Netflix: A Case Study
Netflix: A Case StudyNetflix: A Case Study
Netflix: A Case Study
Morgan Miller
 

Andere mochten auch (7)

Maintaining Low Latency While Maximizing Throughput on a Single Cluster
Maintaining Low Latency While Maximizing Throughput on a Single ClusterMaintaining Low Latency While Maximizing Throughput on a Single Cluster
Maintaining Low Latency While Maximizing Throughput on a Single Cluster
 
My sql cluster case study apr16
My sql cluster case study apr16My sql cluster case study apr16
My sql cluster case study apr16
 
The mysqlnd replication and load balancing plugin
The mysqlnd replication and load balancing pluginThe mysqlnd replication and load balancing plugin
The mysqlnd replication and load balancing plugin
 
Buckle promotional campaign
Buckle promotional campaignBuckle promotional campaign
Buckle promotional campaign
 
Netflix: A Case Study
Netflix: A Case StudyNetflix: A Case Study
Netflix: A Case Study
 
Cluster analysis
Cluster analysisCluster analysis
Cluster analysis
 
MBA case study presentation template
MBA case study presentation templateMBA case study presentation template
MBA case study presentation template
 

Ähnlich wie Enabling Exploratory Analytics of Data in Shared-service Hadoop Clusters

Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Cloudera, Inc.
 
Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010
nzhang
 
Hw09 Hadoop Applications At Yahoo!
Hw09   Hadoop Applications At Yahoo!Hw09   Hadoop Applications At Yahoo!
Hw09 Hadoop Applications At Yahoo!
Cloudera, Inc.
 
Hadoop at Yahoo! -- Hadoop World NY 2009
Hadoop at Yahoo! -- Hadoop World NY 2009Hadoop at Yahoo! -- Hadoop World NY 2009
Hadoop at Yahoo! -- Hadoop World NY 2009
yhadoop
 

Ähnlich wie Enabling Exploratory Analytics of Data in Shared-service Hadoop Clusters (20)

Yahoo Enabling Exploratory Analytics of Data in Shared-service Hadoop Clusters
Yahoo Enabling Exploratory Analytics of Data in Shared-service Hadoop ClustersYahoo Enabling Exploratory Analytics of Data in Shared-service Hadoop Clusters
Yahoo Enabling Exploratory Analytics of Data in Shared-service Hadoop Clusters
 
Testing Big Data: Automated ETL Testing of Hadoop
Testing Big Data: Automated ETL Testing of HadoopTesting Big Data: Automated ETL Testing of Hadoop
Testing Big Data: Automated ETL Testing of Hadoop
 
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
 
Testing Big Data: Automated Testing of Hadoop with QuerySurge
Testing Big Data: Automated  Testing of Hadoop with QuerySurgeTesting Big Data: Automated  Testing of Hadoop with QuerySurge
Testing Big Data: Automated Testing of Hadoop with QuerySurge
 
Modernizing Your Data Warehouse using APS
Modernizing Your Data Warehouse using APSModernizing Your Data Warehouse using APS
Modernizing Your Data Warehouse using APS
 
What it takes to run Hadoop at Scale: Yahoo! Perspectives
What it takes to run Hadoop at Scale: Yahoo! PerspectivesWhat it takes to run Hadoop at Scale: Yahoo! Perspectives
What it takes to run Hadoop at Scale: Yahoo! Perspectives
 
Analysis of historical movie data by BHADRA
Analysis of historical movie data by BHADRAAnalysis of historical movie data by BHADRA
Analysis of historical movie data by BHADRA
 
Architecting the Future of Big Data and Search
Architecting the Future of Big Data and SearchArchitecting the Future of Big Data and Search
Architecting the Future of Big Data and Search
 
Sparkflows - Build E2E Data Analytics Use Cases in less than 30 mins
Sparkflows - Build E2E Data Analytics Use Cases in less than 30 minsSparkflows - Build E2E Data Analytics Use Cases in less than 30 mins
Sparkflows - Build E2E Data Analytics Use Cases in less than 30 mins
 
What is hadoop
What is hadoopWhat is hadoop
What is hadoop
 
Atlanta Data Science Meetup | Qubole slides
Atlanta Data Science Meetup | Qubole slidesAtlanta Data Science Meetup | Qubole slides
Atlanta Data Science Meetup | Qubole slides
 
Hortonworks.bdb
Hortonworks.bdbHortonworks.bdb
Hortonworks.bdb
 
Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010
 
Modern data warehouse
Modern data warehouseModern data warehouse
Modern data warehouse
 
Modern data warehouse
Modern data warehouseModern data warehouse
Modern data warehouse
 
Splunk hunkbeta
Splunk hunkbetaSplunk hunkbeta
Splunk hunkbeta
 
Hadoop from Hive with Stinger to Tez
Hadoop from Hive with Stinger to TezHadoop from Hive with Stinger to Tez
Hadoop from Hive with Stinger to Tez
 
Hw09 Hadoop Applications At Yahoo!
Hw09   Hadoop Applications At Yahoo!Hw09   Hadoop Applications At Yahoo!
Hw09 Hadoop Applications At Yahoo!
 
Hadoop at Yahoo! -- Hadoop World NY 2009
Hadoop at Yahoo! -- Hadoop World NY 2009Hadoop at Yahoo! -- Hadoop World NY 2009
Hadoop at Yahoo! -- Hadoop World NY 2009
 
Solution Brief: Big Data Lab Accelerator
Solution Brief: Big Data Lab AcceleratorSolution Brief: Big Data Lab Accelerator
Solution Brief: Big Data Lab Accelerator
 

Mehr von DataWorks Summit

HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
DataWorks Summit
 

Mehr von DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Kürzlich hochgeladen

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Kürzlich hochgeladen (20)

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 

Enabling Exploratory Analytics of Data in Shared-service Hadoop Clusters

  • 1. Enabling Exploratory Analytics of Data in Shared-service Hadoop Clusters P R E S E N T E D B Y S a g i Z e l n i c k P r i n c i p a l A r c h i t e c t @ Y a h o o a n d L e d i o n B i t i n c k a P r i n c i p a l A r c h i t e c t @ S p l u n k H a d o o p S u m m i t J u n e 2 0 1 4 S a n J o s e , C A
  • 2. About your speakers 2 Yahoo Proprietary Sagi Zelnick Ledion Bitincka Principal Architect Principal Architect Yahoo Splunk
  • 3. Background 3 Yahoo Proprietary  Hadoop @ Yahoo: 8+ years of innovation  Hunk @ Yahoo: organization-wide investment for next 3+ years  Yahoo providing Hunk as a self-service to explore, analyze & visualize data in HDFS  Hunk allows visually browsing of very complex tables (250+ fields)  Rapid prototyping for new jobs with almost instant results for searches, without having to wait for the entire job/query to finish  Cuts down on the development cycles by faster interaction with results
  • 4. 4 Yahoo Proprietary History of Hadoop innovation @ Yahoo
  • 5. Over 600PB of Hadoop storage (over half an exabyte) 5 Yahoo Proprietary  Very large clusters used by many groups across the enterprise  More than 40,000 individual datanodes  Hadoop is provided as a service  Multiple cluster types such as research, dev, sandbox and production  Services such as HBase, Hive, Oozie, etc…  Users are free to run jobs, but have resource constraints  Maintained by Grid Operations Group
  • 6. Improving visibility & providing operational insights with Hunk  We pointed Hunk at many operational logs and event data we already have on the grid  This includes system metrics, HDFS ops, JVM stats and YARN metrics  Created instrumentation to measure usage per user and job  Analyzed terabytes of NameNode audit logs  Job history leveraged for visualizing usage/growth and historical views  Custom events for HBase statistics 6 Yahoo Proprietary
  • 7. Use Case Customer Benefits Namenode metrics, block ops, memory usage Research, Dev Improved performance and stability System/Hadoop metrics of ~40,000 individual datanodes Grid Ops / Grid Customers Identify slow tasks/nodes when debugging Historical insights into resource consumption All Grid Customers Track organic growth Generate reports on job performance All Grid Customers Improved job SLAs HBase metrics All Grid Customers Track region/RS/table metrics… Track job logs in near real-time All Grid Customers / Ops Detect and search for errors directly from the YARN job logs for troubleshooting Tracking Hadoop performance and metrics in Hunk 7 Yahoo Proprietary
  • 8. Use Case Customer Benefits Find dataset instances/files that have never been accessed after creation Data Storage Efficiency Team, SE Savings via reduction of storage- costs How is each user/team using compute and disk capacity on a cluster? Management / Grid Customers Metering / Chargeback Replace ad hoc and legacy solutions for analyzing cluster-usage SE / Grid Solutions / Grid Performance / Hadoop Core Development Team Improved Grid-utilization and cost- reduction Generate reports on cluster performance, utilization of available capacity, etc. SE / Grid Solutions / Grid Performance / Hadoop Core Development Team Data-mining for product improvements and best-practices Determine KPIs of Hadoop stack components (Pig, Oozie, etc.) SE / Grid Solutions / Hadoop Stack Development Team Feedback for product improvements Find efficacy of various heuristics in Hadoop (data-locality of Tasks, replication of blocks, etc.) Hadoop Stack Development Team Fine-tune heuristics for better efficiency Tracking Hadoop performance and metrics continued 8 Yahoo Proprietary
  • 9. 9 Yahoo Proprietary Sample search in Hunk
  • 10. Measuring NameNode performance pre & post upgrades 10 Yahoo Proprietary  Historical visualizations of all operations  Search data in Hunk from billions of NameNode events  Measure JVM and memory usage  Insights into operational performance
  • 11. Yahoo Proprietary New Search i ndex=" si mon_bl ue_new_al l " t hi s_cl ust er =" di l i t hi umbl ue* " ( l og_subt ype=" DFS" #hdf s=hdf s) | t i mechar t spa n=1h avg( number * ) as num_* Last 7 days ✓ 10,086 events (5/15/14 1:00:00.000 AM to 5/22/14 1:36:34.000 AM) _time num_BlockReports num_CopyBl...perations num_HeartBeats num_ReadBl...perations num_ReadMe...perations num_Replac...Operations num_WriteB...Operations num_blockChecksumOp Fri May 16 2014 Sun May 18 Tue May 20 200,000,000 400,000,000 600,000,000 _time ✓ num_Bl ockRep orts ✓ num_Copy BlockOpera tions ✓ num_ HeartB eats ✓ num_Read BlockOpera tions ✓ num_ReadMe tadataOperati ons ✓ num_Replac eBlockOperat ions ✓ num_Write BlockOpera tions ✓ num_blo ckChecks umOp ✓ 2014-05-15 01:00 112443 7.7359 02 46721126. 819672 51495 7.3840 98 12930433.0 77869 0.000000 94210832.78 6885 63512425.9 67213 13975.30 6557 Visualization Sample visualization in Hunk 11
  • 12. 12 Yahoo Proprietary ✓ 2,753 events (5/20/14 1:14:21.000 AM to 5/22/14 1:14:21.000 AM) _time num_BlockReports num_CopyBl...perations num_HeartBeats num_ReadBl...perations num_ReadMe...perations num_Replac...Operations num_WriteB...Operations num_blockChecksumOp 12:00 PM Tue May 20 2014 12:00 AM Wed May 21 12:00 PM 1,000,000,000 250,000,000 500,000,000 750,000,000 _time ✓ num_Bl ockRep orts ✓ num_Copy BlockOpera tions ✓ num_ HeartB eats ✓ num_Read BlockOpera tions ✓ num_ReadMe tadataOperati ons ✓ num_Replac eBlockOperat ions ✓ num_Write BlockOpera tions ✓ num_blo ckChecks umOp ✓ Visualization Sample troubleshooting in Hunk of 750 million events
  • 13. 13 Yahoo Proprietary New Search i ndex=" si mon_bl ue_new_al l " t hi s_cl ust er =" di l i t hi umbl ue* " ( l og_subt ype=" JVM" Pr ocessName=" NameNode" ) | t i m echar t span=5m avg( Thr eads* ) as t hr eads_* Last 2 days ✓ 8,463 events (5/20/14 12:00:00.000 AM to 5/22/14 12:00:00.000 AM) _time threads_Blocked threads_New threads_Runnable threads_Terminated threads_TimedWaiting threads_Waiting 12:00 AM Tue May 20 2014 12:00 PM 12:00 AM Wed May 21 12:00 PM 200 400 _time ✓ threads_Block ed ✓ threads_Ne w ✓ threads_Runna ble ✓ threads_Terminat ed ✓ threads_TimedWait ing ✓ threads_Waiti ng ✓ 2014-05-20 00:00:00 72.360000 10.638333 5.485833 0.000000 21.208333 78.555000 2014-05-20 00:05:00 70.177333 10.554667 5.277333 0.000000 20.744667 76.578000 2014-05-20 00:10:00 70.211333 9.998667 5.022000 0.000000 19.333333 73.766667 Visualization Big picture plus granular details
  • 14. Analyzing NameNode RPC calls 14 Yahoo Proprietary  Who is making what RPC call (open, listStatus, create, etc.)  How often are they making these RPC calls  From which IP/host are they coming from  Search and visualize historical data from billions of events  Prevent NameNode abuse/misuse
  • 15. 15 Yahoo Proprietary Visualizing 834 million discrete events …
  • 16. 16 Yahoo Confidential & Proprietary … continued
  • 17. Queue insights  Each Hadoop job runs in a specific queue  We track every aspect of the YARN framework  Immediate queue performance and configuration profiling via job history server  Historical views and trends that enable better capacity management  Improved queue utilization and allocation management 17 Yahoo Proprietary
  • 18. New Search i ndex=" j obsummar y_l ogs_al l _r ed" cl ust er =" di l i t hi um* " | eval t ot al _sl ot _seconds=( m apSl ot Seconds + r educeSl ot Sec onds) | eval gb_hour s=( ( t ot al _sl ot _seconds * 0. 5) / 3600) | eval gb_hour s=r ound( gb_h our s) | t i mechar t span=6h sum ( gb_hour s) as gb_hour s by queue Last 7 days ✓ 1,175,726 events (5/20/14 8:00:00.000 PM to 5/27/14 8:26:26.000 PM) 200,000 400,000 600,000 OTH apg_dai apg_dail apg_hou apg_ho apg_hourl apg curveb curveb sling sling Visualization _time Wed May 21 2014 Thu May 22 Fri May 23 Sat May 24 Sun May 25 Mon May 26 Visualizing queues 18 Yahoo Proprietary
  • 19. Creating job reports per user 19 Yahoo Proprietary  Each job is unique and so are the map and reduce elements  How to start analyzing jobs?  Historical job performance and profiling enables in-depth performance tuning  Long terms historical views and trending of growth
  • 20. More data to tap into with the metastore/hive sources 20 Yahoo Proprietary  We will provide Hunk as a self-service to explore & visualize data in HDFS  Using the metastore we can setup virtual indexes to any table(s) in Hive, without the need to define the schema up-front  Allows for visually browsing very complex tables (250+ fields)  Rapid prototyping for new jobs with almost instant results for searches, without having to wait for the entire job/query to finish  Cuts down on the development cycles by faster interaction with results  Built-in graphs/charts makes for a powerful solution for many situations
  • 21. Hunk + Hadoop Demo 21Yahoo Proprietary
  • 27. © 2014 Splunk Inc. Meet Hunk 6.1
  • 28. 28 Integrated Analytics Platform Full-featured, Integrated Product Insights for Everyone Works with What You Have Today Explore Visualize Dashboard s ShareAnalyze Hadoop Clusters NoSQL and Other Data Stores Hadoop Client Libraries Streaming Resource Libraries for Diverse Data Stores
  • 29. 29 Fast Deployment and Configuration Just point at Hadoop • Certified integrations to all major Hadoop distributions • Choose 1st-gen MapReduce or YARN • Create Virtual Indexes across one or more clusters • From download to searching data in < 60 minutes Connect to one or multiple Hadoop clusters YARN certified
  • 30. Interactive Search and Results Preview Rapidly interact with data • Powerful Search Processing Language (SPL™) • Ad hoc exploratory analytics across massive datasets • Preview results • No fixed schema • No requirement to “understand” data upfront Search interface Preview results 30 Drill down to raw data Pause or stop MapReduce jobs
  • 31. 31 Powerful Dashboards for Self-Service Analytics Interactive Dashboards and Charts • Easy-to-use dashboard editor • Chart overlay • Pan and zoom • In-dashboard drilldown • Embed charts and dashboards in 3rd party apps • Reuse skills with Splunk Enterprise 6.1 and Hunk 6.1
  • 32. 32 Hive Data Support Supported File Formats • Text files • Sequence files • RCFile • ORC files • Parquet
  • 33. 33 Role-based Security for Shared Clusters Pass-through Authentication • Provide role-based security for Hadoop clusters • Access Hadoop resources under security and compliance • Integrates with Kerberos for Hadoop security Business Analyst Marketing Analyst Sys Admin Business Analyst Queue: Biz Analytics Marketing Analyst Queue: Marketing Sys Admin2 Queue: Prod
  • 34. 34 Powerful Developer Environment • Use a standards-based web framework and REST API • Customize dashboards and UIs with Simple XML, JavaScript or Django • Choose among SDKs • One integration for both Splunk Enterprise and Hunk Build Analytics-Rich Big Data Apps
  • 35. 35 Explore, analyze and visualize data in one integrated platform Point Hunk at your storage clusters and explore data immediately Preview results as MapReduce jobs run and accelerate reports with no fixed schemas INTERACTIVE SEARCH RICH DEVELOPER ENVIRONMENT Build big data apps using standard web languages and frameworks FULL-FEATURED ANALYTICS FAST TO DEPLOY AND DRIVE VALUE Hunk: One Integrated Platform
  • 36. Question/Comments? Sagi Zelnick – Principal Architect Email: zelnicks@yahoo-inc.com Ledion Bitincka – Principal Architect Email: lbitincka@splunk.com