This is the presentation given in Fifth Elephant Conference 2013. It talks about how we've created a cloud based big data which is low on maintenance and running cost. Key technologies used here are Twitter Finagle, Apache Kafka, Apache Zookeeper, Amazon S3 and Amazon EMR.
4. Use case : List products based on CTR
● Take all impressions of a product and action
performed
● Some products are more attractive than
others
● Give benefit to such products
5. Use case : List products based on CTR
● select product_id, sum(clicked)/sum
(appeared) as ctr from tbl_prod_log group
by product_id order by ctr desc
● >100K products, > 500 million impressions
a day --- DIFFICULT TO SCALE
6. Use case : User segmentation
● Different users have
different browsing
patterns
● Segment them based
on their history
● Provide them different
experience
7. Use case : User segmentation
● select depth, count
(cookie_id), group by
depth from user_log
● > 1m users daily,
multiple browsers,
devices
● DIFFICULT TO SCALE
8. Use case : Recommend similar products
● Compute score of
products based on
various attributes
● Compute score of a
user based on
products (s)he
browses
● Recommend similar
products
9. Use case : Recommend similar products
● select id, (w1.att1 + w2.
att2 + ... wN.attN) as
score from products
● select userid, (v1.
score1 + v2.score2 + ...
+ vN.scoreN)
● >1m user >100K
products DIFFICULT TO
COMPUTE
11. Design goals
● Solution should be able to scale up and
down
● Record data now, ask questions later
● Generic data model
● Segregate reads from writes
● Low running cost
● Low maintenance overhead
12. Cloud computing
Pros
● No setup cost
● Pay as you use
● Scaling is a breeze
● Managed services
Cons
● Performance
● Reliability
● Data security
● Control
13. A very basic Big Data system
Highly available
Very low latency
Initial filtering
Storage agnostic
Scale up and down
easily
Essentially distributed
Very easy to use
Highly reliable
Huge capacity
Cater to any data model
Cheap
15. Architecture Diagram Hadoop on cloud
Easy to scale up
and down
Pay as you use
Infinite capacity
11 nines of
durability
Flat file storage
Cheap
Persistent
distributed Q
100K msg/sec
Events can be
played back
Highly concurrent
server
Very easy to use
Flexible
Much easier to
introduce HA,
reliability etc
Both server and
client side data
Segregate and
upload events to
S3
Scales horizontally
Distributed
config mgmt
Fault tolerant
16. Some numbers
● ~20 million events getting logged daily
● Corresponds to ~800 million data points
● & ~25GB
● Close to a 100 jobs a day
● The biggest job has footprints of ~2
billion events
● Platform costs ~20$ daily; jobs ~15$ daily
17. ● One can code in english (Finagle)
myService = handleExceptions andThen recordInKafka andThen respond
● Need not be in C or Erlang to be performant (Kafka)
● Can search without index
s3://<BUCKET>/addToCart/y=2013/m=06/d=14/h=13/min=30
s3://<BUCKET>/orderConfirmation/y=2013/m=06/d=14/h=13/min=30
● Spot EMR clusters effeciently
● m1.small are not small
● awk + grep = awesome
● Apache mailing lists SUCK!!!
Some key learnings