Scalable Application Insight Framework

Scalable Application Insights
Framework
Yahoo! Architect Development Program-
2013
Rajesh Chandramohan
(rajeshc@)

Overview
Build System/Framework to Aggregate and Visualize user/system
insights from production application data sets.
Motivation: Production Servers are generating huge chunk of logs and
we realize that capturing all our data is now economical and valuable
Implemention
 Hadoop & Hbase/Hive
 Aggregators
Scribe / FLUME
 Visualization
Hive With Mysql Or Splunk
Evaluate Methodologies like
SHARK ( Which is suitable for the scale of million events per hour and
Terabytes of data store )

Use Cases
 All about end user level metrics/events in production.
 Track outbound/Inbound mails which is in billions
 Consolidate scattered data sets spread across multiple application
servers.
 Mail delivery Percentile, based on latency buckets.

Requirements
Functional Requirements:
 Data Aggregation over time window
 Pattern Matching with Pig scripting and custom Scripts
 Create relation between data sets & Persistent Storage
 Informative Visualization
NON-Functional Requirements:
 Review with Stake holder & Architects
 Scalable Data Aggregation
 Easy operability
 Consistency, Reliability, Durability
 DATA Quality: ACCURACY , completeness
 Performance: Scalability, Latency

Outbound Data Scale
~400million Outbound Mail per day
Components Size/Hour
1 Webfarm
300 Farms
~1GB
300GB
Total ~300GB*24 ~7TB

Architecture
Data
logs
9
v
HdfsProxy
Fetch Data tool
HDFS Proxy
Grid ( Bassnium-Tan )
Oozie Jobs
SSH Keys
GDM Pull
YCA AUTH
Data Push
Aggregator
Formatted Data/graphs
App Logs/UnStructured Data
HBASE
Launcher/UI
ProdHosts
Semi formated Data
YCA AUTH
DS Kerberose
PIGUn Structured Structured Data
Apache/PHP
Kafka/Flume
SHARK
HIVE
MySql

Technology Stack
Data Aggregation framework :
 Started with custom scripts, based on logtail and Parallel
Processing . Currently collected Based on time interval in the span
of 30-60min interval .
Evaluated with Scribe , Fluentd and Flume
 Hadoop: For Raw data storage And processing and making
relation between user events
 Oozie: The data aggregation , relation/processing, Data
management , All these controlled in scheduled workflows.
 Hbase: For storing processed data in Htable format. To retrieve the
results effectively.
 Hive/SHARK: Evaluate Shark to store the data to retrieve faster
even in seconds interval with in-memory store.

Hbase Schema
Mail Success Delivery Bounced/ Failed Delivery

Why Not Splunk
 SPLUNK is commercial , Lacks versatility
 Splunk is not very customizable
 We have to depend on their tools and system
We Are using Splunk
 As splunk is available for yahoo,we intend to use only for data
visualization with hadoop Connect Interface.
 HUNK is a virtual Indexer service. Would be right choice for us.

Challenges & Learning
 Hadoop Data access is Sequential
 Smaller Output files Hits namespace quota
 Hbase good for Storing Output files
 Hbase Schema Decision ( row key design , Reduce No. of region
Servers )
 SHARK evaluated , good but not feasible to implement
 Flume: Durable, Flexible
 Hive layer helped for data visualization
 Use Splunk License availability , Made Visualization Simpler
 Manage Delayed Mail Info in Delivery pipe Line
 Consistency to Plot 300-500Million Events Per day

Demo
Outputs: With hadoop System:
http://trackoutbound.mail.yahoo.com:9999/trackoutboundmail/
Demo Back-End Systems:
Hbase Analysis:
http://twiki.corp.yahoo.com/view/Mail/SELogsOnHBASE
TechPulse2013:http://techpulse-submission.corp.yahoo.com/paper?p=7&ls=1
Achievements:
It works on Hadoop Infrastructure
 Data aggregation in <10 minutes interval
 Sequential Read is taken off , with Hbase Store
 Flexible and Scalable Aggregation framework
 Feature rich mechanism for data visualization with Hive Or splunk

rajeshc@yahoo-inc.com
THANKS
Q & A

HBase Overview
HBase Overview
Apache HBase is an open source Bigtable-like, distributed, scalable, consistent, random access, columnar, key-
value store built on Apache Hadoop
Column Family - Info
Rowkey Email Age Password
Alice alice@wonderland.com 23
Bob bob@myworld.com 25 Iambob
Eve hithere@getintouch.com 30 nice1pass
Table is
lexicographically
sorted on
rowkeys
1
2
3
trickedyou
newpassword
Cells
4
ts1 = 1
ts2 = 2
Each cell has multiple
versions represented by
timestamp where
ts2>ts1
Identify your data (cell value) in the HBase table by
[1] rowkey, [2] column family, [3] column qualifier, [4] timestamp/ version]
HBase Data Model

HBase Operations
 get(<ROW>)
 put(<ROW>, Map<KEY,VALUE>)
 scan(<TABLE>)
 checkAndDelete()
 checkAndPut()
 increment()
…check HTable class for further details on operations
Caution:
 No queries
 No secondary indexes
 Billions of Rows * Millions of colums * thousands of versions
 Zookeeper as disctributed coordination service
HBase Overview

Scalable Application Insight Framework

Scalable Application Insight Framework

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (11)

Ähnlich wie Scalable Application Insight Framework

Ähnlich wie Scalable Application Insight Framework (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Scalable Application Insight Framework