Hyperion: Real Time Analytics at flipkart.com

Hyperion
Santanu Sinha
Architect,
flipkart.com

What?
Event processing system
Used by dev teams at ﬂipkart for system
level insights and for debugging problems in
customer environments
Used by on-call for debugging CS issues
Used by business team for several use-cases

Why?
How many books were downloaded today?
How many people read books on web, android, iOS,
windows?
What was device/app condition before crash
eBook download failed .. why?
JavaScript error on webreader... how to debug?
Supplier uploaded a book.. why didn’t it become live?
Why and when was a notiﬁcation sent to the user?

Problem Statement
Persisting events to the system should not slow down the client
Should work with all types of clients written in various languages
Should be dead simple to onboard
Has to provide guarantee, that once an event is accepted it will get
persisted
Storage system will go down
Should not lead to message loss
Should be able to catch up quickly
Critical functionality should not be broken

Problem Statement
Should provide mostly used functionality in the real-time
scenario over a pre-specified period of data
Search on fields
Nested group counts on fields
Event generation and ingestion histograms
Trending
Should provide APIs for custom consoles
Backend storage should allow for easy consumption later

Messaging
Needed a highly scalable messaging system
Choice: Apache Kafka
Open Source
Extremely fast
Simple architecture
Replication (0.8)
Provides offset based reads to be used for retries
Problems:
It was it beta when we started
Where to store the offsets?

Ingestion API
Decided to go with HTTP
Framework should add as little as possible over
HTTP delay
Choice: RestExpress
Open Source
Fast
Extremely lightweight

Storage
Broken up into two parts
Short term store
Used for search, group, histogram, trends etc
Messages will be automatically deleted using TTL
Break up the query space
Long term store
To be used with custom ofﬂine jobs

Storage
Choices:
Short Term: MongoDB
Open Source
Provides a stable document store
Provides good search and grouping capabilities
Easy to maintain
Problems
Indexes need to be present for searching to work properly
Large aggregations will take time
Long Term: HBase

Processing pipeline
Needs to be very fast with an onus on retries
Choice: Storm
Open Source
Proven speed in production
Handling system failure is at the heart of design
Problems
No spout for Kafka 0.8 (yay!!)

How to process...
Store offsets in Hbase
Fast reads/writes for row get/put
Don’t bug Mongo for frequently used stuff like number of events getting ingested
Precompute in Storm, store specialized documents containing counts at different
granularity
Allow only specific search operations on fields
Detect fields from messages in Storm
Detect field metadata in Storm
Write message to both stores or fail
Use Transactional Topology
Write the spout
Spread out query space
Divide events into apps

Event/Message
{
"header" : {
"app" : "appName",
"eventType" : "BOOK_DOWNLOAD_SUCCESS",
"platform" : "android",
"timestamp" : 137292987252,
"instance" : "blah-blah",
"eventId" : "sd2131as-2131asa-3214asda2-dffsd223"
},
"data" : {
"bookId" : "SDF123121231",
"timeTakenSecs" : 30
}
}

Build on top
Messages relayed out to another Kafka cluster
Simple predicate based subscription from
clients
Publish from a separate topology
Build alerting etc on top of this
Custom consoles can be built on query APIs

Status
Processes around 35million events/day
3 node Kafka cluster
2 worker nodes on storm
3 node MongoDB cluster
~900GB of data on query cluster
More than 5TB on HBase

Speed Zen...
Don’t do a lot in the ingestion API
Do basic checking and forward, handle in
downstream system
Use batching wherever possible
Batch writes to kafka for reasonable size batches is
almost same as single writes
Batch writes to HBase is very fast
Do not call update at event level on MongoDB

Speed Zen...
Storm has a neat tricks up it’s sleeve .. use it!!
We were getting bogged down by speed of writer bolts
Problem: Connectivity to DB for every batch
Use TopologyContext
Save connection and other repeat use objects in context
Reuse
Connection is closed when bolt dies(generally never)
MASSIVE speed gain...

Keep it real!!
System is part of the problem
Fix app level schema containing all the data ﬁelds
Only consider a small set of analytics for real-time
Fancy stuff can be done ofﬂine
Go with the simplest solution
Keep a focus on speed, everything else can wait
Consider scale from start

Upstream
Precog
A flexible dynamic event processing system
Build and spawn computations on the fly based on simple processing
primitives
Already in production, storing around 100 million filtered and enriched
log lines per day
Foxtrot
A flexible, specialized storage layer for effectively querying stored events
Extremely extensible allowing for addition of new analytics features very
easily
Allow for distributed and async query execution for long running queries
Intelligent distributed caching for query responses

Resources
Kafka: http://kafka.apache.org/
Storm: http://storm.incubator.apache.org/
RestExpress: https://github.com/
RestExpress/RestExpress
MongoDB: https://www.mongodb.org/
HBase: https://hbase.apache.org/

Hyperion: Real Time Analytics at flipkart.com

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Empfohlen

Empfohlen (20)

Hyperion: Real Time Analytics at flipkart.com