eHarmony in the Cloud

•Als PPT, PDF herunterladen•

1 gefällt mir•943 views

This is a lightning presentation given by Brian Ko a member of my development team. It is a recap of a presentation he attended at JavaOne 2009.

Technologie Bildung Business

eHarmony
• Online subscription-based matchmaking
service
• Available in United States, Canada,
Australia and United Kingdom.
• On average, 236 members in US marry
every day.
• More than 20 million registered users.

1

Why Cloud?
• Problem exceeds the limits of the data
center and data warehouse environment.
• Leverage EC2 and Hadoop to scale data

2

Find Matching
• Predicative Model Scores

5

Requirement
• All the matches, scores, and user
information should be archived daily
• Ready for 10X growth
• Possible O(n2) problem
• Need to support set of models becoming
more complex

6

Challenge
• Current architecture is multi-tiered with a
relational back-end
• Scoring is DB join intensive
• Data need constant archiving
– Matches, match scores, user attributes at
time of match creation
– Model validation is done at a later time
across many days
• Need a non-DB solution

7

Solution
• Open Source Java implementation of
Google’s MapReduce framework

– Distributes work across vast amounts of data
– Hadoop Distributed File System (HDFS)
provides reliability through replication
– Automatic re-execution on failure/distribution
– Scale horizontally on commodity hardware

8

Slide 9
• Simple Storage Service (S3) provides
cheap unlimited storage.
• Elastic Cloud Computing (EC2) enables
horizontal scaling by adding servers on
demand.

9

MapReduce
• A large server farm can use MapReduce to
process huge dataset.
• Map step
– Master node takes the input
– Chops it up into smaller sub-problems
– Distributes those to worker nodes.
• Reduce step
– Master node takes the answers to all the sub-
problems
– Combines them in a way to get the output

10

Why Hadoop
• Mapper and Reducer are written by you
• Hadoop provides
– Parallelization
– Shuffle and sort

11

Actual Process
• Upload to S3 and start EC2 Cluster

13

Actual Process
• Process and archive

14

Amazon Elastic MapReduce
• It is a web service
• EC2 cluster is managed for you behind the
scenes
• Starts Hadoop implementation of the
MapReduce framework on Amazon EC2
• Each step can read and write data directly
from and to S3
• Based on Hadoop 0.18.3

15

Elastic MapReduce
• No need to explicitly allocate, start and
shutdown EC2 instances
• Individual jobs were managed by a remote
script running on master node (no longer
required)
• Jobs are arranged into a job flow, created
with a single command
• Status of a job flow and all its steps are
accessible by a REST service

16

Before Elastic Map Reduce
• Allocate/Verify cluster
• Push application to cluster
• Run a control script on the master
• Kick off each job step on the master
• Create and detect a job completion token
• Shut the cluster down

17

After Elastic MapReduce
• With Elastic MapReduce we can do all this
with a single local command
• Uses jar and conf files stored on S3
• Various monitoring tools for EC2 and S3
are provided

18

Development Environment
• Cheap to set up on Amazon
• Quick setup - Number of servers is
controlled by a config variable
• Identical to production
• Separate development account
recommended

19

Cost comparison
• Average EC2 and S3 Cost
– Each run is 2 to 3 hours
– $1200/month for EC2
– $100/month for S3
• Projected in-house cost
– $5000/month for a local cluster of 50 nodes
running 24/7
– A new company needs to add data center and
operation personnel expense

20

Summary
• Dev tools really easy to work with and just
work right out of the box
• Standard Hadoop AMI worked great
• Easy to write unit tests for MapReduce
• Hadoop community support is great.
• EC2/S3/EMR are cost effective

The End

5 minutes of question time
starts now!

Empfohlen

vBrownBag @ VMworld - Apache CloudStack (ACS) & vSphereAaron Delp

Apache Spark on Kubernetesharidasnss

Mining public datasets using opensource tools: Zeppelin, Spark and Jujuseoul_engineer

Java scalability considerations yogesh deshpandeIndicThreads

Scaling Traffic from 0 to 139 Million Unique VisitorsYelp Engineering

Spotify's journey to GCPAlexey Lapitsky

Scylla Summit 2022: Scylla 5.0 New Features, Part 2ScyllaDB

Scylla Summit 2022: Scylla 5.0 New Features, Part 1ScyllaDB

Empfohlen

vBrownBag @ VMworld - Apache CloudStack (ACS) & vSphereAaron Delp

Apache Spark on Kubernetesharidasnss

Mining public datasets using opensource tools: Zeppelin, Spark and Jujuseoul_engineer

Java scalability considerations yogesh deshpandeIndicThreads

Scaling Traffic from 0 to 139 Million Unique VisitorsYelp Engineering

Spotify's journey to GCPAlexey Lapitsky

Scylla Summit 2022: Scylla 5.0 New Features, Part 2ScyllaDB

Scylla Summit 2022: Scylla 5.0 New Features, Part 1ScyllaDB

Real Time Data Processing With Spark Streaming, Node.js and Redis with Visual...Brandon O'Brien

Spark volume requirements 2018Rachit Arora

Netflix's Could MigrationChef

Scylla Summit 2022: How ScyllaDB Powers This Next Tech CycleScyllaDB

Scylla Summit 2022: New AWS Instances Perfect for ScyllaDBScyllaDB

Building a unified data pipeline in Apache SparkDataWorks Summit

Lessons Learned - Monitoring the Data Pipeline at HuluDataWorks Summit

Building Realtime Data Pipelines with Kafka Connect and Spark StreamingJen Aman

A Collaborative Data Science Development WorkflowDatabricks

Scylla Summit 2016: Graph Processing with Titan and ScyllaScyllaDB

Apache Cassandra in the CloudInstaclustr

Apache Superset at AirbnbBill Liu

Platform as a service standard for hadoop environmentAbhay Pai

Taboola Road To Scale With Apache Sparktsliwowicz

Should you read Kafka as a stream or in batch? Should you even care? | Ido Na...HostedbyConfluent

A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & Deep Learning ...Databricks

Spark Summit EU talk by Oscar CastanedaSpark Summit

On-boarding with JanusGraph PerformanceChin Huang

REDSHIFT - AmazonDouglas Bernardini

Running Apache Spark Jobs Using KubernetesDatabricks

سکوهای ابری و مدل های برنامه نویسی در ابرdatastack

Apache Tez : Accelerating Hadoop Query ProcessingBikas Saha

Weitere ähnliche Inhalte

Was ist angesagt?

Real Time Data Processing With Spark Streaming, Node.js and Redis with Visual...Brandon O'Brien

Spark volume requirements 2018Rachit Arora

Netflix's Could MigrationChef

Scylla Summit 2022: How ScyllaDB Powers This Next Tech CycleScyllaDB

Scylla Summit 2022: New AWS Instances Perfect for ScyllaDBScyllaDB

Building a unified data pipeline in Apache SparkDataWorks Summit

Lessons Learned - Monitoring the Data Pipeline at HuluDataWorks Summit

Building Realtime Data Pipelines with Kafka Connect and Spark StreamingJen Aman

A Collaborative Data Science Development WorkflowDatabricks

Scylla Summit 2016: Graph Processing with Titan and ScyllaScyllaDB

Apache Cassandra in the CloudInstaclustr

Apache Superset at AirbnbBill Liu

Platform as a service standard for hadoop environmentAbhay Pai

Taboola Road To Scale With Apache Sparktsliwowicz

Should you read Kafka as a stream or in batch? Should you even care? | Ido Na...HostedbyConfluent

A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & Deep Learning ...Databricks

Spark Summit EU talk by Oscar CastanedaSpark Summit

On-boarding with JanusGraph PerformanceChin Huang

REDSHIFT - AmazonDouglas Bernardini

Running Apache Spark Jobs Using KubernetesDatabricks

Was ist angesagt? (20)

Real Time Data Processing With Spark Streaming, Node.js and Redis with Visual...

Spark volume requirements 2018

Netflix's Could Migration

Scylla Summit 2022: How ScyllaDB Powers This Next Tech Cycle

Scylla Summit 2022: New AWS Instances Perfect for ScyllaDB

Building a unified data pipeline in Apache Spark

Lessons Learned - Monitoring the Data Pipeline at Hulu

Building Realtime Data Pipelines with Kafka Connect and Spark Streaming

A Collaborative Data Science Development Workflow

Scylla Summit 2016: Graph Processing with Titan and Scylla

Apache Cassandra in the Cloud

Apache Superset at Airbnb

Platform as a service standard for hadoop environment

Taboola Road To Scale With Apache Spark

Should you read Kafka as a stream or in batch? Should you even care? | Ido Na...

A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & Deep Learning ...

Spark Summit EU talk by Oscar Castaneda

On-boarding with JanusGraph Performance

REDSHIFT - Amazon

Running Apache Spark Jobs Using Kubernetes

Ähnlich wie eHarmony in the Cloud

سکوهای ابری و مدل های برنامه نویسی در ابرdatastack

Apache Tez : Accelerating Hadoop Query ProcessingBikas Saha

Distributed Computing with Apache Hadoop: Technology OverviewKonstantin V. Shvachko

Large scale computing with mapreducehansen3032

YARN Ready: Integrating to YARN with Tez Hortonworks

Bigdata and Hadoop with Dockerharidasnss

Apache Tez - A New Chapter in Hadoop Data ProcessingDataWorks Summit

Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Deanna Kosaraju

Apache Tez - Accelerating Hadoop Data Processinghitesh1892

Intro to Big Data and NoSQLDon Demcsak

Apache Tez: Accelerating Hadoop Query ProcessingHortonworks

Mobile+Cloud: a viable replacement for desktop cheminformatics?Alex Clark

Flood modelling on the Cloudasm100

Cloud infrastructure. Google File System and MapReduce - Andrii VozniukAndrii Vozniuk

Tez big datacamp-la-bikas_sahaData Con LA

A Survey of Advanced Non-relational Database Systems: Approaches and Applicat...Qian Lin

Big Data trainingvishal192091

Writing Scalable Software in JavaRuben Badaró

Apache Tez -- A modern processing enginebigdatagurus_meetup

Rapid Cluster Computing with Apache Spark 2016Zohar Elkayam

Ähnlich wie eHarmony in the Cloud (20)

سکوهای ابری و مدل های برنامه نویسی در ابر

Apache Tez : Accelerating Hadoop Query Processing

Distributed Computing with Apache Hadoop: Technology Overview

Large scale computing with mapreduce

YARN Ready: Integrating to YARN with Tez

Bigdata and Hadoop with Docker

Apache Tez - A New Chapter in Hadoop Data Processing

Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015

Apache Tez - Accelerating Hadoop Data Processing

Intro to Big Data and NoSQL

Apache Tez: Accelerating Hadoop Query Processing

Mobile+Cloud: a viable replacement for desktop cheminformatics?

Flood modelling on the Cloud

Cloud infrastructure. Google File System and MapReduce - Andrii Vozniuk

Tez big datacamp-la-bikas_saha

A Survey of Advanced Non-relational Database Systems: Approaches and Applicat...

Big Data training

Writing Scalable Software in Java

Apache Tez -- A modern processing engine

Rapid Cluster Computing with Apache Spark 2016

Mehr von Craig Dickson

Amazon Webservices for Java Developers - UCI WebinarCraig Dickson

Dead-Simple Deployment: Headache-Free Java Web Applications in the CloudCraig Dickson

Rapid RESTful Web Applications with Apache Sling and JackrabbitCraig Dickson

Java PaaS Vendor Survey - September 2011Craig Dickson

JDBC Basics (In 20 Minutes Flat)Craig Dickson

How to test drive development using LinuxCraig Dickson

Google Wave IntroductionCraig Dickson

Adobe Flex 4 OverviewCraig Dickson

Palm WebOS OverviewCraig Dickson

Java Persistence API (JPA) - A Brief OverviewCraig Dickson

Fast and Free SSO: A Survey of Open-Source Solutions to Single Sign-onCraig Dickson

Building Social Applications using ZemblyCraig Dickson

Best Practices for Large-Scale Web SitesCraig Dickson

Cloud Computing IntroductionCraig Dickson

Performance Analysis and Monitoring with Perf4jCraig Dickson

JavaFX vs AJAX vs FlexCraig Dickson

Mehr von Craig Dickson (16)

Amazon Webservices for Java Developers - UCI Webinar

Dead-Simple Deployment: Headache-Free Java Web Applications in the Cloud

Rapid RESTful Web Applications with Apache Sling and Jackrabbit

Java PaaS Vendor Survey - September 2011

JDBC Basics (In 20 Minutes Flat)

How to test drive development using Linux

Google Wave Introduction

Adobe Flex 4 Overview

Palm WebOS Overview

Java Persistence API (JPA) - A Brief Overview

Fast and Free SSO: A Survey of Open-Source Solutions to Single Sign-on

Building Social Applications using Zembly

Best Practices for Large-Scale Web Sites

Cloud Computing Introduction

Performance Analysis and Monitoring with Perf4j

JavaFX vs AJAX vs Flex

Kürzlich hochgeladen

08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls

Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko

Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard

The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge

Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55

Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent

GenCyber Cyber Security Day PresentationMichael W. Hawkins

Salesforce Community Group Quito, Salesforce 101Paola De la Torre

Understanding the Laravel MVC ArchitecturePixlogix Infotech

Slack Application Development 101 Slidespraypatel2

FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh

Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun

Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada

08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls

Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix

08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls

Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC

Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700

#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada

Presentation on how to chat with PDF using ChatGPT code interpreternaman860154

Kürzlich hochgeladen (20)

08448380779 Call Girls In Diplomatic Enclave Women Seeking Men

Handwritten Text Recognition for manuscripts and early printed texts

Maximizing Board Effectiveness 2024 Webinar.pptx

The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf

Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...

Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...

GenCyber Cyber Security Day Presentation

Salesforce Community Group Quito, Salesforce 101

Understanding the Laravel MVC Architecture

Slack Application Development 101 Slides

FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi

Data Cloud, More than a CDP by Matt Robison

Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024

08448380779 Call Girls In Friends Colony Women Seeking Men

Swan(sea) Song – personal research during my six years at Swansea ... and bey...

08448380779 Call Girls In Greater Kailash - I Women Seeking Men

Breaking the Kubernetes Kill Chain: Host Path Mount

Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...

#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024

Presentation on how to chat with PDF using ChatGPT code interpreter

eHarmony in the Cloud

1. Subtitle eHarmony in Cloud Brian Ko

2. eHarmony • Online subscription-based matchmaking service • Available in United States, Canada, Australia and United Kingdom. • On average, 236 members in US marry every day. • More than 20 million registered users. 1

3. Why Cloud? • Problem exceeds the limits of the data center and data warehouse environment. • Leverage EC2 and Hadoop to scale data 2

4. Finding match • Model Creation 3

5. Find matching • Matching 4

6. Find Matching • Predicative Model Scores 5

7. Requirement • All the matches, scores, and user information should be archived daily • Ready for 10X growth • Possible O(n2) problem • Need to support set of models becoming more complex 6

8. Challenge • Current architecture is multi-tiered with a relational back-end • Scoring is DB join intensive • Data need constant archiving – Matches, match scores, user attributes at time of match creation – Model validation is done at a later time across many days • Need a non-DB solution 7

9. Solution • Open Source Java implementation of Google’s MapReduce framework – Distributes work across vast amounts of data – Hadoop Distributed File System (HDFS) provides reliability through replication – Automatic re-execution on failure/distribution – Scale horizontally on commodity hardware 8

10. Slide 9 • Simple Storage Service (S3) provides cheap unlimited storage. • Elastic Cloud Computing (EC2) enables horizontal scaling by adding servers on demand. 9

11. MapReduce • A large server farm can use MapReduce to process huge dataset. • Map step – Master node takes the input – Chops it up into smaller sub-problems – Distributes those to worker nodes. • Reduce step – Master node takes the answers to all the sub- problems – Combines them in a way to get the output 10

12. Why Hadoop • Mapper and Reducer are written by you • Hadoop provides – Parallelization – Shuffle and sort 11

13. Actual Process • Upload to S3 and start EC2 Cluster 13

14. Actual Process • Process and archive 14

15. Amazon Elastic MapReduce • It is a web service • EC2 cluster is managed for you behind the scenes • Starts Hadoop implementation of the MapReduce framework on Amazon EC2 • Each step can read and write data directly from and to S3 • Based on Hadoop 0.18.3 15

16. Elastic MapReduce • No need to explicitly allocate, start and shutdown EC2 instances • Individual jobs were managed by a remote script running on master node (no longer required) • Jobs are arranged into a job flow, created with a single command • Status of a job flow and all its steps are accessible by a REST service 16

17. Before Elastic Map Reduce • Allocate/Verify cluster • Push application to cluster • Run a control script on the master • Kick off each job step on the master • Create and detect a job completion token • Shut the cluster down 17

18. After Elastic MapReduce • With Elastic MapReduce we can do all this with a single local command • Uses jar and conf files stored on S3 • Various monitoring tools for EC2 and S3 are provided 18

19. Development Environment • Cheap to set up on Amazon • Quick setup - Number of servers is controlled by a config variable • Identical to production • Separate development account recommended 19

20. Cost comparison • Average EC2 and S3 Cost – Each run is 2 to 3 hours – $1200/month for EC2 – $100/month for S3 • Projected in-house cost – $5000/month for a local cluster of 50 nodes running 24/7 – A new company needs to add data center and operation personnel expense 20

21. Summary • Dev tools really easy to work with and just work right out of the box • Standard Hadoop AMI worked great • Easy to write unit tests for MapReduce • Hadoop community support is great. • EC2/S3/EMR are cost effective

22. The End 5 minutes of question time starts now!

23. Questions 4 minutes left!

24. Questions 3 minutes left!

25. Questions 2 minutes left!

26. Questions 1 minute left!

27. Questions 30 seconds left!

28. Questions TIME IS UP!