2. Overview
Build System/Framework to Aggregate and Visualize user/system
insights from production application data sets.
Motivation: Production Servers are generating huge chunk of logs and
we realize that capturing all our data is now economical and valuable
Implemention
Hadoop & Hbase/Hive
Aggregators
Scribe / FLUME
Visualization
Hive With Mysql Or Splunk
Evaluate Methodologies like
SHARK ( Which is suitable for the scale of million events per hour and
Terabytes of data store )
3. Use Cases
All about end user level metrics/events in production.
Track outbound/Inbound mails which is in billions
Consolidate scattered data sets spread across multiple application
servers.
Mail delivery Percentile, based on latency buckets.
9. Architecture
Data
logs
9
v
HdfsProxy
Fetch Data tool
HDFS Proxy
Grid ( Bassnium-Tan )
Oozie Jobs
SSH Keys
GDM Pull
YCA AUTH
Data Push
Aggregator
Formatted Data/graphs
App Logs/UnStructured Data
HBASE
Launcher/UI
ProdHosts
Semi formated Data
YCA AUTH
DS Kerberose
PIGUn Structured Structured Data
Apache/PHP
Kafka/Flume
SHARK
HIVE
MySql
10. Technology Stack
Data Aggregation framework :
Started with custom scripts, based on logtail and Parallel
Processing . Currently collected Based on time interval in the span
of 30-60min interval .
Evaluated with Scribe , Fluentd and Flume
Hadoop: For Raw data storage And processing and making
relation between user events
Oozie: The data aggregation , relation/processing, Data
management , All these controlled in scheduled workflows.
Hbase: For storing processed data in Htable format. To retrieve the
results effectively.
Hive/SHARK: Evaluate Shark to store the data to retrieve faster
even in seconds interval with in-memory store.
14. Why Not Splunk
SPLUNK is commercial , Lacks versatility
Splunk is not very customizable
We have to depend on their tools and system
We Are using Splunk
As splunk is available for yahoo,we intend to use only for data
visualization with hadoop Connect Interface.
HUNK is a virtual Indexer service. Would be right choice for us.
15. Challenges & Learning
Hadoop Data access is Sequential
Smaller Output files Hits namespace quota
Hbase good for Storing Output files
Hbase Schema Decision ( row key design , Reduce No. of region
Servers )
SHARK evaluated , good but not feasible to implement
Flume: Durable, Flexible
Hive layer helped for data visualization
Use Splunk License availability , Made Visualization Simpler
Manage Delayed Mail Info in Delivery pipe Line
Consistency to Plot 300-500Million Events Per day
17. Demo
Outputs: With hadoop System:
http://trackoutbound.mail.yahoo.com:9999/trackoutboundmail/
Demo Back-End Systems:
Hbase Analysis:
http://twiki.corp.yahoo.com/view/Mail/SELogsOnHBASE
TechPulse2013:http://techpulse-submission.corp.yahoo.com/paper?p=7&ls=1
Achievements:
It works on Hadoop Infrastructure
Data aggregation in <10 minutes interval
Sequential Read is taken off , with Hbase Store
Flexible and Scalable Aggregation framework
Feature rich mechanism for data visualization with Hive Or splunk
19. HBase Overview
HBase Overview
Apache HBase is an open source Bigtable-like, distributed, scalable, consistent, random access, columnar, key-
value store built on Apache Hadoop
Column Family - Info
Rowkey Email Age Password
Alice alice@wonderland.com 23
Bob bob@myworld.com 25 Iambob
Eve hithere@getintouch.com 30 nice1pass
Table is
lexicographically
sorted on
rowkeys
1
2
3
trickedyou
newpassword
Cells
4
ts1 = 1
ts2 = 2
Each cell has multiple
versions represented by
timestamp where
ts2>ts1
Identify your data (cell value) in the HBase table by
[1] rowkey, [2] column family, [3] column qualifier, [4] timestamp/ version]
HBase Data Model
20. HBase Operations
get(<ROW>)
put(<ROW>, Map<KEY,VALUE>)
scan(<TABLE>)
checkAndDelete()
checkAndPut()
increment()
…check HTable class for further details on operations
Caution:
No queries
No secondary indexes
Billions of Rows * Millions of colums * thousands of versions
Zookeeper as disctributed coordination service
HBase Overview