Myntra.com's Big Data Platform

•

5 gefällt mir•3,190 views

This is the presentation given in Fifth Elephant Conference 2013. It talks about how we've created a cloud based big data which is low on maintenance and running cost. Key technologies used here are Twitter Finagle, Apache Kafka, Apache Zookeeper, Amazon S3 and Amazon EMR.

Technologie

Cloud based low cost, low
maintenance, scalable data platform
Apoorva Gaurav

Why hunt elephant to sell shoes ?
WHOM
HOW
WHAT

Use case : List products based on CTR
● Take all impressions of a product and action
performed
● Some products are more attractive than
others
● Give benefit to such products

Use case : List products based on CTR
● select product_id, sum(clicked)/sum
(appeared) as ctr from tbl_prod_log group
by product_id order by ctr desc
● >100K products, > 500 million impressions
a day --- DIFFICULT TO SCALE

Use case : User segmentation
● Different users have
different browsing
patterns
● Segment them based
on their history
● Provide them different
experience

Use case : User segmentation
● select depth, count
(cookie_id), group by
depth from user_log
● > 1m users daily,
multiple browsers,
devices
● DIFFICULT TO SCALE

Use case : Recommend similar products
● Compute score of
products based on
various attributes
● Compute score of a
user based on
products (s)he
browses
● Recommend similar
products

Use case : Recommend similar products
● select id, (w1.att1 + w2.
att2 + ... wN.attN) as
score from products
● select userid, (v1.
score1 + v2.score2 + ...
+ vN.scoreN)
● >1m user >100K
products DIFFICULT TO
COMPUTE

Constraints
● Fast paced
● Tangible results
● Limited budget
● Low engineering bandwidth

Design goals
● Solution should be able to scale up and
down
● Record data now, ask questions later
● Generic data model
● Segregate reads from writes
● Low running cost
● Low maintenance overhead

Cloud computing
Pros
● No setup cost
● Pay as you use
● Scaling is a breeze
● Managed services
Cons
● Performance
● Reliability
● Data security
● Control

A very basic Big Data system
Highly available
Very low latency
Initial filtering
Storage agnostic
Scale up and down
easily
Essentially distributed
Very easy to use
Highly reliable
Huge capacity
Cater to any data model
Cheap

Architecture Diagram Hadoop on cloud
Easy to scale up
and down
Pay as you use
Infinite capacity
11 nines of
durability
Flat file storage
Cheap
Persistent
distributed Q
100K msg/sec
Events can be
played back
Highly concurrent
server
Very easy to use
Flexible
Much easier to
introduce HA,
reliability etc
Both server and
client side data
Segregate and
upload events to
S3
Scales horizontally
Distributed
config mgmt
Fault tolerant

Some numbers
● ~20 million events getting logged daily
● Corresponds to ~800 million data points
● & ~25GB
● Close to a 100 jobs a day
● The biggest job has footprints of ~2
billion events
● Platform costs ~20$ daily; jobs ~15$ daily

● One can code in english (Finagle)
myService = handleExceptions andThen recordInKafka andThen respond
● Need not be in C or Erlang to be performant (Kafka)
● Can search without index
s3://<BUCKET>/addToCart/y=2013/m=06/d=14/h=13/min=30
s3://<BUCKET>/orderConfirmation/y=2013/m=06/d=14/h=13/min=30
● Spot EMR clusters effeciently
● m1.small are not small
● awk + grep = awesome
● Apache mailing lists SUCK!!!
Some key learnings

Thank you!!
& we are hiring
apoorva.gaurav@myntra.com

Empfohlen

247 overviewmongodbevening-bangaloreMongoDB APAC

DocumentDB - NoSQL on Cloud at Reboot2015Vidyasagar Machupalli

Myntra - APAC local success storyFacebook

Google cloud platform introductionSimon Su

Data Con LA 2018 - Enabling real-time exploration and analytics at scale at H...Data Con LA

OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...NETWAYS

Cloud Jam .pptxVISHNURAJSSNSCEAD

Cloud computingYash Patel

Empfohlen

247 overviewmongodbevening-bangaloreMongoDB APAC

DocumentDB - NoSQL on Cloud at Reboot2015Vidyasagar Machupalli

Myntra - APAC local success storyFacebook

Google cloud platform introductionSimon Su

Data Con LA 2018 - Enabling real-time exploration and analytics at scale at H...Data Con LA

OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...NETWAYS

Cloud Jam .pptxVISHNURAJSSNSCEAD

Cloud computingYash Patel

Introduction to Google Cloud Platformdhruv_chaudhari

SaaS startups - Software Engineering ChallengesMalinda Kapuruge

Getting more into GCP.pdfKnoldus Inc.

[WSO2Con USA 2018] Patterns for Building Streaming AppsWSO2

Google на конференции Big Data Russiarusbase.vc

Design Like a Pro: How to Pick the Right System ArchitectureInductive Automation

Executive Intro to BigQueryWilliam M. Cohee

MongoDB@sfr.frbeboutou

Eko10 - Security Monitoring for Big Infrastructures without a Million Dollar ...Hernan Costante

Big Data on Cloud Native PlatformSunil Govindan

Overcoming Data Gravity in Multi-Cloud Enterprise ArchitecturesVMware Tanzu

Why data warehouses cannot support hot analyticsImply

Big Data Pipeline for Analytics at Scale @ FIT CVUT 2014Jaroslav Gergic

Security Monitoring for big Infrastructures without a Million Dollar budgetJuan Berner

Efficiently Building Machine Learning Models for Predictive Maintenance in th...Databricks

Apache Druid Design and Future prospectc-bslim

Using Elasticsearch for AnalyticsVaidik Kapoor

Building what's next with google cloud's powerful infrastructureMediaAgility

Big Data overviewalexisroos

TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey

"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays

Weitere ähnliche Inhalte

Ähnlich wie Myntra.com's Big Data Platform

Introduction to Google Cloud Platformdhruv_chaudhari

SaaS startups - Software Engineering ChallengesMalinda Kapuruge

Getting more into GCP.pdfKnoldus Inc.

[WSO2Con USA 2018] Patterns for Building Streaming AppsWSO2

Google на конференции Big Data Russiarusbase.vc

Design Like a Pro: How to Pick the Right System ArchitectureInductive Automation

Executive Intro to BigQueryWilliam M. Cohee

MongoDB@sfr.frbeboutou

Eko10 - Security Monitoring for Big Infrastructures without a Million Dollar ...Hernan Costante

Big Data on Cloud Native PlatformSunil Govindan

Overcoming Data Gravity in Multi-Cloud Enterprise ArchitecturesVMware Tanzu

Why data warehouses cannot support hot analyticsImply

Big Data Pipeline for Analytics at Scale @ FIT CVUT 2014Jaroslav Gergic

Security Monitoring for big Infrastructures without a Million Dollar budgetJuan Berner

Efficiently Building Machine Learning Models for Predictive Maintenance in th...Databricks

Apache Druid Design and Future prospectc-bslim

Using Elasticsearch for AnalyticsVaidik Kapoor

Building what's next with google cloud's powerful infrastructureMediaAgility

Big Data overviewalexisroos

Ähnlich wie Myntra.com's Big Data Platform (20)

Introduction to Google Cloud Platform

SaaS startups - Software Engineering Challenges

Getting more into GCP.pdf

[WSO2Con USA 2018] Patterns for Building Streaming Apps

Google на конференции Big Data Russia

Design Like a Pro: How to Pick the Right System Architecture

Executive Intro to BigQuery

MongoDB@sfr.fr

Eko10 - Security Monitoring for Big Infrastructures without a Million Dollar ...

Big Data on Cloud Native Platform

Overcoming Data Gravity in Multi-Cloud Enterprise Architectures

Why data warehouses cannot support hot analytics

Big Data Pipeline for Analytics at Scale @ FIT CVUT 2014

Security Monitoring for big Infrastructures without a Million Dollar budget

Efficiently Building Machine Learning Models for Predictive Maintenance in th...

Apache Druid Design and Future prospect

Using Elasticsearch for Analytics

Building what's next with google cloud's powerful infrastructure

Big Data overview

Kürzlich hochgeladen

TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey

"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays

DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy

Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge

Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren

Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada

Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson

H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati

"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays

Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University

Story boards and shot lists for my a level piececharlottematthew16

Commit 2024 - Secret Management made easyAlfredo García Lavilla

E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxnull - The Open Security Community

CloudStudio User manual (basic edition):comworks

Search Engine Optimization SEO PDF for 2024.pdfRankYa

Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada

Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos

Anypoint Exchange: It’s Not Just a Repo!Manik S Magar

Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation

Kürzlich hochgeladen (20)

TeamStation AI System Report LATAM IT Salaries 2024

"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack

DevoxxFR 2024 Reproducible Builds with Apache Maven

Designing IA for AI - Information Architecture Conference 2024

Advanced Test Driven-Development @ php[tek] 2024

Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024

Are Multi-Cloud and Serverless Good or Bad?

H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day

"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...

Nell’iperspazio con Rocket: il Framework Web di Rust!

Story boards and shot lists for my a level piece

Commit 2024 - Secret Management made easy

E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx

CloudStudio User manual (basic edition):

Search Engine Optimization SEO PDF for 2024.pdf

Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024

Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)

Anypoint Exchange: It’s Not Just a Repo!

Connect Wave/ connectwave Pitch Deck Presentation

Myntra.com's Big Data Platform

1. Cloud based low cost, low maintenance, scalable data platform Apoorva Gaurav

2. Why hunt elephant to sell shoes ?

3. Why hunt elephant to sell shoes ? WHOM HOW WHAT

4. Use case : List products based on CTR ● Take all impressions of a product and action performed ● Some products are more attractive than others ● Give benefit to such products

5. Use case : List products based on CTR ● select product_id, sum(clicked)/sum (appeared) as ctr from tbl_prod_log group by product_id order by ctr desc ● >100K products, > 500 million impressions a day --- DIFFICULT TO SCALE

6. Use case : User segmentation ● Different users have different browsing patterns ● Segment them based on their history ● Provide them different experience

7. Use case : User segmentation ● select depth, count (cookie_id), group by depth from user_log ● > 1m users daily, multiple browsers, devices ● DIFFICULT TO SCALE

8. Use case : Recommend similar products ● Compute score of products based on various attributes ● Compute score of a user based on products (s)he browses ● Recommend similar products

9. Use case : Recommend similar products ● select id, (w1.att1 + w2. att2 + ... wN.attN) as score from products ● select userid, (v1. score1 + v2.score2 + ... + vN.scoreN) ● >1m user >100K products DIFFICULT TO COMPUTE

10. Constraints ● Fast paced ● Tangible results ● Limited budget ● Low engineering bandwidth

11. Design goals ● Solution should be able to scale up and down ● Record data now, ask questions later ● Generic data model ● Segregate reads from writes ● Low running cost ● Low maintenance overhead

12. Cloud computing Pros ● No setup cost ● Pay as you use ● Scaling is a breeze ● Managed services Cons ● Performance ● Reliability ● Data security ● Control

13. A very basic Big Data system Highly available Very low latency Initial filtering Storage agnostic Scale up and down easily Essentially distributed Very easy to use Highly reliable Huge capacity Cater to any data model Cheap

14. Architecture Diagram

15. Architecture Diagram Hadoop on cloud Easy to scale up and down Pay as you use Infinite capacity 11 nines of durability Flat file storage Cheap Persistent distributed Q 100K msg/sec Events can be played back Highly concurrent server Very easy to use Flexible Much easier to introduce HA, reliability etc Both server and client side data Segregate and upload events to S3 Scales horizontally Distributed config mgmt Fault tolerant

16. Some numbers ● ~20 million events getting logged daily ● Corresponds to ~800 million data points ● & ~25GB ● Close to a 100 jobs a day ● The biggest job has footprints of ~2 billion events ● Platform costs ~20$ daily; jobs ~15$ daily

17. ● One can code in english (Finagle) myService = handleExceptions andThen recordInKafka andThen respond ● Need not be in C or Erlang to be performant (Kafka) ● Can search without index s3://<BUCKET>/addToCart/y=2013/m=06/d=14/h=13/min=30 s3://<BUCKET>/orderConfirmation/y=2013/m=06/d=14/h=13/min=30 ● Spot EMR clusters effeciently ● m1.small are not small ● awk + grep = awesome ● Apache mailing lists SUCK!!! Some key learnings

18. Thank you!! & we are hiring apoorva.gaurav@myntra.com