A presentation on Hyperion, the in-house real-time analytics platform at Flipkart.com. The backend has changed now to use Foxtrot, available for free at: https://github.com/flipkart-incubator/foxtrot
2. What?
Event processing system
Used by dev teams at ïŹipkart for system
level insights and for debugging problems in
customer environments
Used by on-call for debugging CS issues
Used by business team for several use-cases
5. Why?
How many books were downloaded today?
How many people read books on web, android, iOS,
windows?
What was device/app condition before crash
eBook download failed .. why?
JavaScript error on webreader... how to debug?
Supplier uploaded a book.. why didnât it become live?
Why and when was a notiïŹcation sent to the user?
6. Problem Statement
Persisting events to the system should not slow down the client
Should work with all types of clients written in various languages
Should be dead simple to onboard
Has to provide guarantee, that once an event is accepted it will get
persisted
Storage system will go down
Should not lead to message loss
Should be able to catch up quickly
Critical functionality should not be broken
7. Problem Statement
Should provide mostly used functionality in the real-time
scenario over a pre-speciïŹed period of data
Search on ïŹelds
Nested group counts on ïŹelds
Event generation and ingestion histograms
Trending
Should provide APIs for custom consoles
Backend storage should allow for easy consumption later
10. Messaging
Needed a highly scalable messaging system
Choice: Apache Kafka
Open Source
Extremely fast
Simple architecture
Replication (0.8)
Provides offset based reads to be used for retries
Problems:
It was it beta when we started
Where to store the offsets?
11. Ingestion API
Decided to go with HTTP
Framework should add as little as possible over
HTTP delay
Choice: RestExpress
Open Source
Fast
Extremely lightweight
12. Storage
Broken up into two parts
Short term store
Used for search, group, histogram, trends etc
Messages will be automatically deleted using TTL
Break up the query space
Long term store
To be used with custom ofïŹine jobs
13. Storage
Choices:
Short Term: MongoDB
Open Source
Provides a stable document store
Provides good search and grouping capabilities
Easy to maintain
Problems
Indexes need to be present for searching to work properly
Large aggregations will take time
Long Term: HBase
14. Processing pipeline
Needs to be very fast with an onus on retries
Choice: Storm
Open Source
Proven speed in production
Handling system failure is at the heart of design
Problems
No spout for Kafka 0.8 (yay!!)
15. How to process...
Store offsets in Hbase
Fast reads/writes for row get/put
Donât bug Mongo for frequently used stuff like number of events getting ingested
Precompute in Storm, store specialized documents containing counts at different
granularity
Allow only speciïŹc search operations on ïŹelds
Detect ïŹelds from messages in Storm
Detect ïŹeld metadata in Storm
Write message to both stores or fail
Use Transactional Topology
Write the spout
Spread out query space
Divide events into apps
17. Build on top
Messages relayed out to another Kafka cluster
Simple predicate based subscription from
clients
Publish from a separate topology
Build alerting etc on top of this
Custom consoles can be built on query APIs
18. Status
Processes around 35million events/day
3 node Kafka cluster
2 worker nodes on storm
3 node MongoDB cluster
~900GB of data on query cluster
More than 5TB on HBase
19. Speed Zen...
Donât do a lot in the ingestion API
Do basic checking and forward, handle in
downstream system
Use batching wherever possible
Batch writes to kafka for reasonable size batches is
almost same as single writes
Batch writes to HBase is very fast
Do not call update at event level on MongoDB
20. Speed Zen...
Storm has a neat tricks up itâs sleeve .. use it!!
We were getting bogged down by speed of writer bolts
Problem: Connectivity to DB for every batch
Use TopologyContext
Save connection and other repeat use objects in context
Reuse
Connection is closed when bolt dies(generally never)
MASSIVE speed gain...
21. Keep it real!!
System is part of the problem
Fix app level schema containing all the data ïŹelds
Only consider a small set of analytics for real-time
Fancy stuff can be done ofïŹine
Go with the simplest solution
Keep a focus on speed, everything else can wait
Consider scale from start
22. Upstream
Precog
A ïŹexible dynamic event processing system
Build and spawn computations on the ïŹy based on simple processing
primitives
Already in production, storing around 100 million ïŹltered and enriched
log lines per day
Foxtrot
A ïŹexible, specialized storage layer for effectively querying stored events
Extremely extensible allowing for addition of new analytics features very
easily
Allow for distributed and async query execution for long running queries
Intelligent distributed caching for query responses