SlideShare ist ein Scribd-Unternehmen logo
1 von 31
Downloaden Sie, um offline zu lesen
Hadoop at Lookout
Aug 13, 2014
Yash Ranadive
@yashranadive
Thursday, August 14, 14
BIO
• Data Engineer
• From Mumbai, India
• Lived in 7 different cities in US
• @yashranadive
• etl.svbtle.com
Thursday, August 14, 14
AGENDA
• What we do @Lookout
• Data warehouse
• Evolution from monolithic to micro-services
• Protocol Buffers
• Areas we are exploring
Thursday, August 14, 14
WHAT WE DO
@LOOKOUT
Thursday, August 14, 14
Over 50 million registered users
Thursday, August 14, 14
DATA TEAM
• 3 Data Engineers
• 6 data analysts
• Hadoop
• 64 hosts
• 300 TB capacity
Thursday, August 14, 14
DATA WAREHOUSE
INTERNAL AND EXTERNAL DATA SOURCES
MySQL Star
Schema
Warehouse
HDFS
HIVE HBase Impala
Chunker
Mudskipper
R Hue Shiny Tableau Custom
Apps
WAREHOUSE
Thursday, August 14, 14
FROM MONOLITHIC TO
MICROSERVICES
Thursday, August 14, 14
MONOLITHIC APPLICATION
Routing
Controller
Mobile/Web Clients
Database
RAILS APPLICATION
HTTP
ORM
Views
Tables
Thursday, August 14, 14
DATA INGESTION - MONOLITHIC
Application master_db slave_db
Data Warehouse
MySQL Hive
ETL
ELT
MySQL
Replication
External
Sources
Reporting
Ingestion is batch-oriented
Thursday, August 14, 14
PROBLEM
• Rails has fast TTM but challenges in scaling
• One code base
• Slower Deployments
• Too complex and large to manage
• Solution
• Microservices / service oriented architecture
• Break out the app in to smaller services
Thursday, August 14, 14
MICROSERVICES ARCHITECTURE
Routing
Controller
Mobile/Web Clients
Database
RAILS APPLICATION
HTTP
ORM
Views
Tables
Settings
Service
Photo
Backup
We frequently add new services
Thursday, August 14, 14
DATA INGESTION - MICROSERVICES
Application master_db slave_db
Data Warehouse
MySQL Hive
ETL
ELT
MySQL
Replication
External
Sources
Reporting
Settings
Service
Backup
Service
Locate
Service
Messaging
Layer
Consumer
Thursday, August 14, 14
DATA INGESTION -
MONOLITIHIC VS MICROSERVICES
select * from user_settings;
id | setting_id | user_id | modified_at
===========================
1 backup 2629 20140709T0400Z
3 locate 2682 20140709T0402Z
8 wipe 2629 20140709T0403Z
9 theft_alert 2629 20140709T0407Z
{guid: 1, event_type: “modify_setting”,
setting_id: “backup”, setting_status:
“ON”, user_id: “2629”, timestamp:
“20140709T0400Z”}
{guid: 3, event_type: “start_backup”,
user_id: “2629”, timestamp:
“20140709T0400Z”}
...
Monolithic - Snapshot of a
point in time
Microservices - Events
Thursday, August 14, 14
DESIGN
• We wanted to create an always-on event
ingestion framework that:
• Would scale workers on demand
• Would be easy to monitor
Thursday, August 14, 14
FIRST STAB - WORKER
Service ActiveMQ Ruby Worker HIVE
• Upstart script that daemonized Ruby process
• Monitoring using Zenoss
• Very easy to set up
• Mapping Files for JSON -> CSV
• Ruby is terse and clean
Thursday, August 14, 14
PROBLEMS
• ActiveMQ
• ActiveMQ did not scale well - even with
multiple machines in the AMQ cluster
• ActiveMQ creates a separate queue for every
consumer of the topic
• Monitoring using Zenoss is not ideal especially for
multi-process consumers
• The worker ran on a single machine- not fault
tolerant
Thursday, August 14, 14
CURRENT ARCHITECTURE - WORKER
Service Kafka Storm HIVE
• Monitoring using Storm’s thrift API
• Scaling number of workers is easy
• Kafka has better scalability than Kafka
Service ActiveMQ
Thursday, August 14, 14
Storm
STORM TOPOLOGY
Service Kafka HDFS
Kafka
Spout
ActiveMQ
Spout
Processing
Bolt
Storm-hdfs
bolt
Landing
Directory
Hive
Directory
Thursday, August 14, 14
JSON PROBLEMS
• Problems with JSON
• No predefined schema
• No enforcement of backward compatibility
• Solution
• Protocol Buffers (also Avro/Thrift)
Thursday, August 14, 14
PROTOBUFS
• What?
• Way of encoding structured data
• Binary
• Why?
• Schema
• Backward compatibility
• Smaller in size than JSON
Thursday, August 14, 14
VERSIONING
• backward compatible changes only
,proto ,proto
Version 1.4 Version 1.1
Producer ConsumerQueue
Thursday, August 14, 14
SHARING PROTOBUF SCHEMAS
Artifactory
(Schema Repo)
Data Team
Storm
Project
Producers
Push
Java jars
Ruby gems
Pull
Java jars
Thursday, August 14, 14
BUT HOW DO YOU STORE
PROTOBUFS IN HDFS?
Thursday, August 14, 14
HOW WE STORE PROTOBUFS
• Store raw version
• Raw dump of kafka topic in to HDFS
• Convert them to a tuple using Storm
• Inflate then convert to TSV
• Can query raw protobufs directly from HIVE but we
don’t yet
• elephant-bird (difficult to get it working)
Thursday, August 14, 14
Storm
STORM TOPOLOGY
Service Kafka HDFS
Kafka
Spout
ActiveMQ
Spout
Deserialize
Protobuf
Storm-hdfs
bolt
Landing
Directory
Hive
Directory
Thursday, August 14, 14
AREAS WE ARE
EXPLORING
Thursday, August 14, 14
SPARK
• ETL
• Wordcount ~5 lines of scala code vs. 58 lines of
Java Map reduce code
• Spark Streaming can achieve similar results as of
storm through micro-batching
http://www.slideshare.net/ptgoetz/apache-storm-vs-spark-streaming
• Machine Learning
• Online learning using MLLIB
• Logistic Regression and SVM
Thursday, August 14, 14
H20
• In-memory machine learning
• Tight integration with R
• Preferred by Data Scientists
Thursday, August 14, 14
OPEN SOURCE PROJECTS
• Currently open sourced
• Pipefish - write from MySQL to HDFS
github.com/lookout/pipefish
• Future
• Mudskipper - capture change-data
events from MySQL binlogs.
• Chunker - download mysql table data
in chunks
Thursday, August 14, 14
Questions
Thursday, August 14, 14

Más contenido relacionado

Was ist angesagt?

Apache Kafkaとグラフデータベースによる成長するネットワークグラフを分析・可視化する基盤
Apache Kafkaとグラフデータベースによる成長するネットワークグラフを分析・可視化する基盤Apache Kafkaとグラフデータベースによる成長するネットワークグラフを分析・可視化する基盤
Apache Kafkaとグラフデータベースによる成長するネットワークグラフを分析・可視化する基盤Yoshiyasu SAEKI
 
Empowering developers to deploy their own data stores
Empowering developers to deploy their own data storesEmpowering developers to deploy their own data stores
Empowering developers to deploy their own data storesTomas Doran
 
ストリーミングデータのアドホック分析エンジンの比較
ストリーミングデータのアドホック分析エンジンの比較ストリーミングデータのアドホック分析エンジンの比較
ストリーミングデータのアドホック分析エンジンの比較Yoshiyasu SAEKI
 
AWS for Start-ups - Case Study - PeoplePerHour
AWS for Start-ups - Case Study - PeoplePerHour AWS for Start-ups - Case Study - PeoplePerHour
AWS for Start-ups - Case Study - PeoplePerHour Amazon Web Services
 
Big Data DC - BenchPress
Big Data DC - BenchPressBig Data DC - BenchPress
Big Data DC - BenchPressDrew Stephens
 
2013-cloudconnect-OpenStack@BT
2013-cloudconnect-OpenStack@BT2013-cloudconnect-OpenStack@BT
2013-cloudconnect-OpenStack@BTuictamale
 
Apache Kafka 0.11 の Exactly Once Semantics
Apache Kafka 0.11 の Exactly Once SemanticsApache Kafka 0.11 の Exactly Once Semantics
Apache Kafka 0.11 の Exactly Once SemanticsYoshiyasu SAEKI
 
Queryable State for Kafka Streamsを使ってみた
Queryable State for Kafka Streamsを使ってみたQueryable State for Kafka Streamsを使ってみた
Queryable State for Kafka Streamsを使ってみたYoshiyasu SAEKI
 
StackStormを1年間データ基盤で使ってみてぶつかったトラブルとその解決策の共有
StackStormを1年間データ基盤で使ってみてぶつかったトラブルとその解決策の共有StackStormを1年間データ基盤で使ってみてぶつかったトラブルとその解決策の共有
StackStormを1年間データ基盤で使ってみてぶつかったトラブルとその解決策の共有Yoshiyasu SAEKI
 
Planet-scale Data Ingestion Pipeline: Bigdam
Planet-scale Data Ingestion Pipeline: BigdamPlanet-scale Data Ingestion Pipeline: Bigdam
Planet-scale Data Ingestion Pipeline: BigdamSATOSHI TAGOMORI
 
Advanced Spark and TensorFlow Meetup 08-04-2016 One Click Spark ML Pipeline D...
Advanced Spark and TensorFlow Meetup 08-04-2016 One Click Spark ML Pipeline D...Advanced Spark and TensorFlow Meetup 08-04-2016 One Click Spark ML Pipeline D...
Advanced Spark and TensorFlow Meetup 08-04-2016 One Click Spark ML Pipeline D...Chris Fregly
 
High Availability from the DevOps side - OpenStack Summit Portland
High Availability from the DevOps side - OpenStack Summit PortlandHigh Availability from the DevOps side - OpenStack Summit Portland
High Availability from the DevOps side - OpenStack Summit PortlandeNovance
 
Apache Pulsar Community-Jennifer
Apache Pulsar Community-JenniferApache Pulsar Community-Jennifer
Apache Pulsar Community-JenniferStreamNative
 
Ruby and Distributed Storage Systems
Ruby and Distributed Storage SystemsRuby and Distributed Storage Systems
Ruby and Distributed Storage SystemsSATOSHI TAGOMORI
 
DrupalCampLA 2014 - Drupal backend performance and scalability
DrupalCampLA 2014 - Drupal backend performance and scalabilityDrupalCampLA 2014 - Drupal backend performance and scalability
DrupalCampLA 2014 - Drupal backend performance and scalabilitycherryhillco
 

Was ist angesagt? (20)

October 2013 HUG: HBase 0.96
October 2013 HUG: HBase 0.96October 2013 HUG: HBase 0.96
October 2013 HUG: HBase 0.96
 
Apache Kafkaとグラフデータベースによる成長するネットワークグラフを分析・可視化する基盤
Apache Kafkaとグラフデータベースによる成長するネットワークグラフを分析・可視化する基盤Apache Kafkaとグラフデータベースによる成長するネットワークグラフを分析・可視化する基盤
Apache Kafkaとグラフデータベースによる成長するネットワークグラフを分析・可視化する基盤
 
Empowering developers to deploy their own data stores
Empowering developers to deploy their own data storesEmpowering developers to deploy their own data stores
Empowering developers to deploy their own data stores
 
Spotify services (SDC 2013)
Spotify services (SDC 2013)Spotify services (SDC 2013)
Spotify services (SDC 2013)
 
ストリーミングデータのアドホック分析エンジンの比較
ストリーミングデータのアドホック分析エンジンの比較ストリーミングデータのアドホック分析エンジンの比較
ストリーミングデータのアドホック分析エンジンの比較
 
AWS for Start-ups - Case Study - PeoplePerHour
AWS for Start-ups - Case Study - PeoplePerHour AWS for Start-ups - Case Study - PeoplePerHour
AWS for Start-ups - Case Study - PeoplePerHour
 
Big Data DC - BenchPress
Big Data DC - BenchPressBig Data DC - BenchPress
Big Data DC - BenchPress
 
2013-cloudconnect-OpenStack@BT
2013-cloudconnect-OpenStack@BT2013-cloudconnect-OpenStack@BT
2013-cloudconnect-OpenStack@BT
 
Apache Kafka 0.11 の Exactly Once Semantics
Apache Kafka 0.11 の Exactly Once SemanticsApache Kafka 0.11 の Exactly Once Semantics
Apache Kafka 0.11 の Exactly Once Semantics
 
Queryable State for Kafka Streamsを使ってみた
Queryable State for Kafka Streamsを使ってみたQueryable State for Kafka Streamsを使ってみた
Queryable State for Kafka Streamsを使ってみた
 
tdtechtalk20160330johan
tdtechtalk20160330johantdtechtalk20160330johan
tdtechtalk20160330johan
 
StackStormを1年間データ基盤で使ってみてぶつかったトラブルとその解決策の共有
StackStormを1年間データ基盤で使ってみてぶつかったトラブルとその解決策の共有StackStormを1年間データ基盤で使ってみてぶつかったトラブルとその解決策の共有
StackStormを1年間データ基盤で使ってみてぶつかったトラブルとその解決策の共有
 
Planet-scale Data Ingestion Pipeline: Bigdam
Planet-scale Data Ingestion Pipeline: BigdamPlanet-scale Data Ingestion Pipeline: Bigdam
Planet-scale Data Ingestion Pipeline: Bigdam
 
Olivier_Tisserand_projects
Olivier_Tisserand_projectsOlivier_Tisserand_projects
Olivier_Tisserand_projects
 
Campus days Azure HDInsight automation
Campus days Azure HDInsight automationCampus days Azure HDInsight automation
Campus days Azure HDInsight automation
 
Advanced Spark and TensorFlow Meetup 08-04-2016 One Click Spark ML Pipeline D...
Advanced Spark and TensorFlow Meetup 08-04-2016 One Click Spark ML Pipeline D...Advanced Spark and TensorFlow Meetup 08-04-2016 One Click Spark ML Pipeline D...
Advanced Spark and TensorFlow Meetup 08-04-2016 One Click Spark ML Pipeline D...
 
High Availability from the DevOps side - OpenStack Summit Portland
High Availability from the DevOps side - OpenStack Summit PortlandHigh Availability from the DevOps side - OpenStack Summit Portland
High Availability from the DevOps side - OpenStack Summit Portland
 
Apache Pulsar Community-Jennifer
Apache Pulsar Community-JenniferApache Pulsar Community-Jennifer
Apache Pulsar Community-Jennifer
 
Ruby and Distributed Storage Systems
Ruby and Distributed Storage SystemsRuby and Distributed Storage Systems
Ruby and Distributed Storage Systems
 
DrupalCampLA 2014 - Drupal backend performance and scalability
DrupalCampLA 2014 - Drupal backend performance and scalabilityDrupalCampLA 2014 - Drupal backend performance and scalability
DrupalCampLA 2014 - Drupal backend performance and scalability
 

Andere mochten auch

Conformed Dimension and Data Mining
Conformed Dimension and Data MiningConformed Dimension and Data Mining
Conformed Dimension and Data MiningDylan Wan
 
Data Mining Scoring Engine development process
Data Mining Scoring Engine development processData Mining Scoring Engine development process
Data Mining Scoring Engine development processDylan Wan
 
Cullen Presentation
Cullen PresentationCullen Presentation
Cullen Presentationsggibson
 
Incorta Data Security
Incorta Data SecurityIncorta Data Security
Incorta Data SecurityDylan Wan
 
BI Apps Architecture
BI Apps ArchitectureBI Apps Architecture
BI Apps ArchitectureDylan Wan
 
Apache storm vs. Spark Streaming
Apache storm vs. Spark StreamingApache storm vs. Spark Streaming
Apache storm vs. Spark StreamingP. Taylor Goetz
 
Incorta spark integration
Incorta spark integrationIncorta spark integration
Incorta spark integrationDylan Wan
 

Andere mochten auch (10)

Conformed Dimension and Data Mining
Conformed Dimension and Data MiningConformed Dimension and Data Mining
Conformed Dimension and Data Mining
 
Data Mining Scoring Engine development process
Data Mining Scoring Engine development processData Mining Scoring Engine development process
Data Mining Scoring Engine development process
 
AhmedEltanahy
AhmedEltanahyAhmedEltanahy
AhmedEltanahy
 
Cullen Presentation
Cullen PresentationCullen Presentation
Cullen Presentation
 
Capital raising
Capital raisingCapital raising
Capital raising
 
Incorta Data Security
Incorta Data SecurityIncorta Data Security
Incorta Data Security
 
BI Apps Architecture
BI Apps ArchitectureBI Apps Architecture
BI Apps Architecture
 
Paralelizacion
ParalelizacionParalelizacion
Paralelizacion
 
Apache storm vs. Spark Streaming
Apache storm vs. Spark StreamingApache storm vs. Spark Streaming
Apache storm vs. Spark Streaming
 
Incorta spark integration
Incorta spark integrationIncorta spark integration
Incorta spark integration
 

Ähnlich wie SF Hadoop Users Group August 2014 Meetup Slides

Cassandra Day 2014: Interactive Analytics with Cassandra and Spark
Cassandra Day 2014: Interactive Analytics with Cassandra and SparkCassandra Day 2014: Interactive Analytics with Cassandra and Spark
Cassandra Day 2014: Interactive Analytics with Cassandra and SparkEvan Chan
 
Red Dirt Ruby Conference
Red Dirt Ruby ConferenceRed Dirt Ruby Conference
Red Dirt Ruby ConferenceJohn Woodell
 
Boston Hadoop Meetup: Presto for the Enterprise
Boston Hadoop Meetup: Presto for the EnterpriseBoston Hadoop Meetup: Presto for the Enterprise
Boston Hadoop Meetup: Presto for the EnterpriseMatt Fuller
 
Introduction to Big Data Technologies: Hadoop/EMR/Map Reduce & Redshift
Introduction to Big Data Technologies:  Hadoop/EMR/Map Reduce & RedshiftIntroduction to Big Data Technologies:  Hadoop/EMR/Map Reduce & Redshift
Introduction to Big Data Technologies: Hadoop/EMR/Map Reduce & RedshiftDataKitchen
 
Workflow Engines for Hadoop
Workflow Engines for HadoopWorkflow Engines for Hadoop
Workflow Engines for HadoopJoe Crobak
 
ASTQB washington-sept-2015
ASTQB washington-sept-2015ASTQB washington-sept-2015
ASTQB washington-sept-2015Dan Boutin
 
Active Cloud DB at CloudComp '10
Active Cloud DB at CloudComp '10Active Cloud DB at CloudComp '10
Active Cloud DB at CloudComp '10Chris Bunch
 
FOSS4G In The Cloud: Using Open Source to build Cloud based Spatial Infrastru...
FOSS4G In The Cloud: Using Open Source to build Cloud based Spatial Infrastru...FOSS4G In The Cloud: Using Open Source to build Cloud based Spatial Infrastru...
FOSS4G In The Cloud: Using Open Source to build Cloud based Spatial Infrastru...Mohamed Sayed
 
Why and How to integrate Hadoop and NoSQL?
Why and How to integrate Hadoop and NoSQL?Why and How to integrate Hadoop and NoSQL?
Why and How to integrate Hadoop and NoSQL?Tugdual Grall
 
How Parse Built a Mobile Backend as a Service on AWS (MBL307) | AWS re:Invent...
How Parse Built a Mobile Backend as a Service on AWS (MBL307) | AWS re:Invent...How Parse Built a Mobile Backend as a Service on AWS (MBL307) | AWS re:Invent...
How Parse Built a Mobile Backend as a Service on AWS (MBL307) | AWS re:Invent...Amazon Web Services
 
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...ssuserd3a367
 
Scala eXchange: Building robust data pipelines in Scala
Scala eXchange: Building robust data pipelines in ScalaScala eXchange: Building robust data pipelines in Scala
Scala eXchange: Building robust data pipelines in ScalaAlexander Dean
 
NoSQL Matters 2013 - Introduction to Map Reduce with Couchbase 2.0
NoSQL Matters 2013 - Introduction to Map Reduce with Couchbase 2.0NoSQL Matters 2013 - Introduction to Map Reduce with Couchbase 2.0
NoSQL Matters 2013 - Introduction to Map Reduce with Couchbase 2.0Tugdual Grall
 
AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09Chris Purrington
 
iServe: A Linked Services Publishing Platform
iServe: A Linked Services Publishing PlatformiServe: A Linked Services Publishing Platform
iServe: A Linked Services Publishing PlatformCarlos Pedrinaci
 
Introduction to PaleoWeb by Arwen Vaughan, Rothwell - 2014 PaleoGIS & PaleoCl...
Introduction to PaleoWeb by Arwen Vaughan, Rothwell - 2014 PaleoGIS & PaleoCl...Introduction to PaleoWeb by Arwen Vaughan, Rothwell - 2014 PaleoGIS & PaleoCl...
Introduction to PaleoWeb by Arwen Vaughan, Rothwell - 2014 PaleoGIS & PaleoCl...The Rothwell Group, L.P.
 
Elasticsearch, Logstash, Kibana. Cool search, analytics, data mining and more...
Elasticsearch, Logstash, Kibana. Cool search, analytics, data mining and more...Elasticsearch, Logstash, Kibana. Cool search, analytics, data mining and more...
Elasticsearch, Logstash, Kibana. Cool search, analytics, data mining and more...Oleksiy Panchenko
 

Ähnlich wie SF Hadoop Users Group August 2014 Meetup Slides (20)

Cassandra Day 2014: Interactive Analytics with Cassandra and Spark
Cassandra Day 2014: Interactive Analytics with Cassandra and SparkCassandra Day 2014: Interactive Analytics with Cassandra and Spark
Cassandra Day 2014: Interactive Analytics with Cassandra and Spark
 
Red Dirt Ruby Conference
Red Dirt Ruby ConferenceRed Dirt Ruby Conference
Red Dirt Ruby Conference
 
Boston Hadoop Meetup: Presto for the Enterprise
Boston Hadoop Meetup: Presto for the EnterpriseBoston Hadoop Meetup: Presto for the Enterprise
Boston Hadoop Meetup: Presto for the Enterprise
 
Introduction to Big Data Technologies: Hadoop/EMR/Map Reduce & Redshift
Introduction to Big Data Technologies:  Hadoop/EMR/Map Reduce & RedshiftIntroduction to Big Data Technologies:  Hadoop/EMR/Map Reduce & Redshift
Introduction to Big Data Technologies: Hadoop/EMR/Map Reduce & Redshift
 
Workflow Engines for Hadoop
Workflow Engines for HadoopWorkflow Engines for Hadoop
Workflow Engines for Hadoop
 
ASTQB washington-sept-2015
ASTQB washington-sept-2015ASTQB washington-sept-2015
ASTQB washington-sept-2015
 
Spotify: Data center & Backend buildout
Spotify: Data center & Backend buildoutSpotify: Data center & Backend buildout
Spotify: Data center & Backend buildout
 
App Engine Meetup
App Engine MeetupApp Engine Meetup
App Engine Meetup
 
Active Cloud DB at CloudComp '10
Active Cloud DB at CloudComp '10Active Cloud DB at CloudComp '10
Active Cloud DB at CloudComp '10
 
FOSS4G In The Cloud: Using Open Source to build Cloud based Spatial Infrastru...
FOSS4G In The Cloud: Using Open Source to build Cloud based Spatial Infrastru...FOSS4G In The Cloud: Using Open Source to build Cloud based Spatial Infrastru...
FOSS4G In The Cloud: Using Open Source to build Cloud based Spatial Infrastru...
 
Why and How to integrate Hadoop and NoSQL?
Why and How to integrate Hadoop and NoSQL?Why and How to integrate Hadoop and NoSQL?
Why and How to integrate Hadoop and NoSQL?
 
How Parse Built a Mobile Backend as a Service on AWS (MBL307) | AWS re:Invent...
How Parse Built a Mobile Backend as a Service on AWS (MBL307) | AWS re:Invent...How Parse Built a Mobile Backend as a Service on AWS (MBL307) | AWS re:Invent...
How Parse Built a Mobile Backend as a Service on AWS (MBL307) | AWS re:Invent...
 
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
 
Scala eXchange: Building robust data pipelines in Scala
Scala eXchange: Building robust data pipelines in ScalaScala eXchange: Building robust data pipelines in Scala
Scala eXchange: Building robust data pipelines in Scala
 
NoSQL Matters 2013 - Introduction to Map Reduce with Couchbase 2.0
NoSQL Matters 2013 - Introduction to Map Reduce with Couchbase 2.0NoSQL Matters 2013 - Introduction to Map Reduce with Couchbase 2.0
NoSQL Matters 2013 - Introduction to Map Reduce with Couchbase 2.0
 
AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09
 
Geosense Geoportal
Geosense GeoportalGeosense Geoportal
Geosense Geoportal
 
iServe: A Linked Services Publishing Platform
iServe: A Linked Services Publishing PlatformiServe: A Linked Services Publishing Platform
iServe: A Linked Services Publishing Platform
 
Introduction to PaleoWeb by Arwen Vaughan, Rothwell - 2014 PaleoGIS & PaleoCl...
Introduction to PaleoWeb by Arwen Vaughan, Rothwell - 2014 PaleoGIS & PaleoCl...Introduction to PaleoWeb by Arwen Vaughan, Rothwell - 2014 PaleoGIS & PaleoCl...
Introduction to PaleoWeb by Arwen Vaughan, Rothwell - 2014 PaleoGIS & PaleoCl...
 
Elasticsearch, Logstash, Kibana. Cool search, analytics, data mining and more...
Elasticsearch, Logstash, Kibana. Cool search, analytics, data mining and more...Elasticsearch, Logstash, Kibana. Cool search, analytics, data mining and more...
Elasticsearch, Logstash, Kibana. Cool search, analytics, data mining and more...
 

Último

0950_Rodriguez_200520_Work_done-GEOGalicia_ELAB-converted.pptx
0950_Rodriguez_200520_Work_done-GEOGalicia_ELAB-converted.pptx0950_Rodriguez_200520_Work_done-GEOGalicia_ELAB-converted.pptx
0950_Rodriguez_200520_Work_done-GEOGalicia_ELAB-converted.pptxssuser886c55
 
Evaluation Methods for Social XR Experiences
Evaluation Methods for Social XR ExperiencesEvaluation Methods for Social XR Experiences
Evaluation Methods for Social XR ExperiencesMark Billinghurst
 
12. Stairs by U Nyi Hla ngae from Myanmar.pdf
12. Stairs by U Nyi Hla ngae from Myanmar.pdf12. Stairs by U Nyi Hla ngae from Myanmar.pdf
12. Stairs by U Nyi Hla ngae from Myanmar.pdftpo482247
 
Navigating Process Safety through Automation and Digitalization in the Oil an...
Navigating Process Safety through Automation and Digitalization in the Oil an...Navigating Process Safety through Automation and Digitalization in the Oil an...
Navigating Process Safety through Automation and Digitalization in the Oil an...soginsider
 
Flutter GDE session GDSC ZHCET AMU, aligarh
Flutter GDE session GDSC ZHCET AMU, aligarhFlutter GDE session GDSC ZHCET AMU, aligarh
Flutter GDE session GDSC ZHCET AMU, aligarhjamesbond00714
 
Introduction to Machine Learning Unit-2 Notes for II-II Mechanical Engineering
Introduction to Machine Learning Unit-2 Notes for II-II Mechanical EngineeringIntroduction to Machine Learning Unit-2 Notes for II-II Mechanical Engineering
Introduction to Machine Learning Unit-2 Notes for II-II Mechanical EngineeringC Sai Kiran
 
Conventional vs Modern method (Philosophies) of Tunneling-re.pptx
Conventional vs Modern method (Philosophies) of Tunneling-re.pptxConventional vs Modern method (Philosophies) of Tunneling-re.pptx
Conventional vs Modern method (Philosophies) of Tunneling-re.pptxSAQIB KHURSHEED WANI
 
NIPORT Home Economics Questions Solution 2024.pdf
NIPORT Home Economics Questions Solution 2024.pdfNIPORT Home Economics Questions Solution 2024.pdf
NIPORT Home Economics Questions Solution 2024.pdfMohonDas
 
This chapter gives an outline of the security.
This chapter gives an outline of the security.This chapter gives an outline of the security.
This chapter gives an outline of the security.RoshniIsrani1
 
First Review Group 1 PPT.pptx with slide
First Review Group 1 PPT.pptx with slideFirst Review Group 1 PPT.pptx with slide
First Review Group 1 PPT.pptx with slideMonika860882
 
Support nodes for large-span coal storage structures
Support nodes for large-span coal storage structuresSupport nodes for large-span coal storage structures
Support nodes for large-span coal storage structureswendy cai
 
Chapter 2 Canal Falls at Mnnit Allahabad .pptx
Chapter 2 Canal Falls at Mnnit Allahabad .pptxChapter 2 Canal Falls at Mnnit Allahabad .pptx
Chapter 2 Canal Falls at Mnnit Allahabad .pptxButcher771
 
Tekom Netherlands | The evolving landscape of Simplified Technical English b...
Tekom Netherlands | The evolving landscape of Simplified Technical English  b...Tekom Netherlands | The evolving landscape of Simplified Technical English  b...
Tekom Netherlands | The evolving landscape of Simplified Technical English b...Shumin Chen
 
Searching and Sorting Algorithms
Searching and Sorting AlgorithmsSearching and Sorting Algorithms
Searching and Sorting AlgorithmsAshutosh Satapathy
 
Governors ppt.pdf .
Governors ppt.pdf                              .Governors ppt.pdf                              .
Governors ppt.pdf .happycocoman
 
A brief about Jeypore Sub-station Presentation
A brief about Jeypore Sub-station PresentationA brief about Jeypore Sub-station Presentation
A brief about Jeypore Sub-station PresentationJeyporess2021
 
Research paper publications: Meaning of Q1 Q2 Q3 Q4 Journal
Research paper publications: Meaning of Q1 Q2 Q3 Q4 JournalResearch paper publications: Meaning of Q1 Q2 Q3 Q4 Journal
Research paper publications: Meaning of Q1 Q2 Q3 Q4 JournalDr. Manjunatha. P
 

Último (20)

0950_Rodriguez_200520_Work_done-GEOGalicia_ELAB-converted.pptx
0950_Rodriguez_200520_Work_done-GEOGalicia_ELAB-converted.pptx0950_Rodriguez_200520_Work_done-GEOGalicia_ELAB-converted.pptx
0950_Rodriguez_200520_Work_done-GEOGalicia_ELAB-converted.pptx
 
Industry perspective on cold in-place recycling
Industry perspective on cold in-place recyclingIndustry perspective on cold in-place recycling
Industry perspective on cold in-place recycling
 
Evaluation Methods for Social XR Experiences
Evaluation Methods for Social XR ExperiencesEvaluation Methods for Social XR Experiences
Evaluation Methods for Social XR Experiences
 
12. Stairs by U Nyi Hla ngae from Myanmar.pdf
12. Stairs by U Nyi Hla ngae from Myanmar.pdf12. Stairs by U Nyi Hla ngae from Myanmar.pdf
12. Stairs by U Nyi Hla ngae from Myanmar.pdf
 
Navigating Process Safety through Automation and Digitalization in the Oil an...
Navigating Process Safety through Automation and Digitalization in the Oil an...Navigating Process Safety through Automation and Digitalization in the Oil an...
Navigating Process Safety through Automation and Digitalization in the Oil an...
 
Flutter GDE session GDSC ZHCET AMU, aligarh
Flutter GDE session GDSC ZHCET AMU, aligarhFlutter GDE session GDSC ZHCET AMU, aligarh
Flutter GDE session GDSC ZHCET AMU, aligarh
 
Caltrans view on recycling of in-place asphalt pavements
Caltrans view on recycling of in-place asphalt pavementsCaltrans view on recycling of in-place asphalt pavements
Caltrans view on recycling of in-place asphalt pavements
 
Introduction to Machine Learning Unit-2 Notes for II-II Mechanical Engineering
Introduction to Machine Learning Unit-2 Notes for II-II Mechanical EngineeringIntroduction to Machine Learning Unit-2 Notes for II-II Mechanical Engineering
Introduction to Machine Learning Unit-2 Notes for II-II Mechanical Engineering
 
Conventional vs Modern method (Philosophies) of Tunneling-re.pptx
Conventional vs Modern method (Philosophies) of Tunneling-re.pptxConventional vs Modern method (Philosophies) of Tunneling-re.pptx
Conventional vs Modern method (Philosophies) of Tunneling-re.pptx
 
NIPORT Home Economics Questions Solution 2024.pdf
NIPORT Home Economics Questions Solution 2024.pdfNIPORT Home Economics Questions Solution 2024.pdf
NIPORT Home Economics Questions Solution 2024.pdf
 
This chapter gives an outline of the security.
This chapter gives an outline of the security.This chapter gives an outline of the security.
This chapter gives an outline of the security.
 
First Review Group 1 PPT.pptx with slide
First Review Group 1 PPT.pptx with slideFirst Review Group 1 PPT.pptx with slide
First Review Group 1 PPT.pptx with slide
 
Support nodes for large-span coal storage structures
Support nodes for large-span coal storage structuresSupport nodes for large-span coal storage structures
Support nodes for large-span coal storage structures
 
Chapter 2 Canal Falls at Mnnit Allahabad .pptx
Chapter 2 Canal Falls at Mnnit Allahabad .pptxChapter 2 Canal Falls at Mnnit Allahabad .pptx
Chapter 2 Canal Falls at Mnnit Allahabad .pptx
 
Tekom Netherlands | The evolving landscape of Simplified Technical English b...
Tekom Netherlands | The evolving landscape of Simplified Technical English  b...Tekom Netherlands | The evolving landscape of Simplified Technical English  b...
Tekom Netherlands | The evolving landscape of Simplified Technical English b...
 
Update on the latest research with regard to RAP
Update on the latest research with regard to RAPUpdate on the latest research with regard to RAP
Update on the latest research with regard to RAP
 
Searching and Sorting Algorithms
Searching and Sorting AlgorithmsSearching and Sorting Algorithms
Searching and Sorting Algorithms
 
Governors ppt.pdf .
Governors ppt.pdf                              .Governors ppt.pdf                              .
Governors ppt.pdf .
 
A brief about Jeypore Sub-station Presentation
A brief about Jeypore Sub-station PresentationA brief about Jeypore Sub-station Presentation
A brief about Jeypore Sub-station Presentation
 
Research paper publications: Meaning of Q1 Q2 Q3 Q4 Journal
Research paper publications: Meaning of Q1 Q2 Q3 Q4 JournalResearch paper publications: Meaning of Q1 Q2 Q3 Q4 Journal
Research paper publications: Meaning of Q1 Q2 Q3 Q4 Journal
 

SF Hadoop Users Group August 2014 Meetup Slides

  • 1. Hadoop at Lookout Aug 13, 2014 Yash Ranadive @yashranadive Thursday, August 14, 14
  • 2. BIO • Data Engineer • From Mumbai, India • Lived in 7 different cities in US • @yashranadive • etl.svbtle.com Thursday, August 14, 14
  • 3. AGENDA • What we do @Lookout • Data warehouse • Evolution from monolithic to micro-services • Protocol Buffers • Areas we are exploring Thursday, August 14, 14
  • 5. Over 50 million registered users Thursday, August 14, 14
  • 6. DATA TEAM • 3 Data Engineers • 6 data analysts • Hadoop • 64 hosts • 300 TB capacity Thursday, August 14, 14
  • 7. DATA WAREHOUSE INTERNAL AND EXTERNAL DATA SOURCES MySQL Star Schema Warehouse HDFS HIVE HBase Impala Chunker Mudskipper R Hue Shiny Tableau Custom Apps WAREHOUSE Thursday, August 14, 14
  • 9. MONOLITHIC APPLICATION Routing Controller Mobile/Web Clients Database RAILS APPLICATION HTTP ORM Views Tables Thursday, August 14, 14
  • 10. DATA INGESTION - MONOLITHIC Application master_db slave_db Data Warehouse MySQL Hive ETL ELT MySQL Replication External Sources Reporting Ingestion is batch-oriented Thursday, August 14, 14
  • 11. PROBLEM • Rails has fast TTM but challenges in scaling • One code base • Slower Deployments • Too complex and large to manage • Solution • Microservices / service oriented architecture • Break out the app in to smaller services Thursday, August 14, 14
  • 12. MICROSERVICES ARCHITECTURE Routing Controller Mobile/Web Clients Database RAILS APPLICATION HTTP ORM Views Tables Settings Service Photo Backup We frequently add new services Thursday, August 14, 14
  • 13. DATA INGESTION - MICROSERVICES Application master_db slave_db Data Warehouse MySQL Hive ETL ELT MySQL Replication External Sources Reporting Settings Service Backup Service Locate Service Messaging Layer Consumer Thursday, August 14, 14
  • 14. DATA INGESTION - MONOLITIHIC VS MICROSERVICES select * from user_settings; id | setting_id | user_id | modified_at =========================== 1 backup 2629 20140709T0400Z 3 locate 2682 20140709T0402Z 8 wipe 2629 20140709T0403Z 9 theft_alert 2629 20140709T0407Z {guid: 1, event_type: “modify_setting”, setting_id: “backup”, setting_status: “ON”, user_id: “2629”, timestamp: “20140709T0400Z”} {guid: 3, event_type: “start_backup”, user_id: “2629”, timestamp: “20140709T0400Z”} ... Monolithic - Snapshot of a point in time Microservices - Events Thursday, August 14, 14
  • 15. DESIGN • We wanted to create an always-on event ingestion framework that: • Would scale workers on demand • Would be easy to monitor Thursday, August 14, 14
  • 16. FIRST STAB - WORKER Service ActiveMQ Ruby Worker HIVE • Upstart script that daemonized Ruby process • Monitoring using Zenoss • Very easy to set up • Mapping Files for JSON -> CSV • Ruby is terse and clean Thursday, August 14, 14
  • 17. PROBLEMS • ActiveMQ • ActiveMQ did not scale well - even with multiple machines in the AMQ cluster • ActiveMQ creates a separate queue for every consumer of the topic • Monitoring using Zenoss is not ideal especially for multi-process consumers • The worker ran on a single machine- not fault tolerant Thursday, August 14, 14
  • 18. CURRENT ARCHITECTURE - WORKER Service Kafka Storm HIVE • Monitoring using Storm’s thrift API • Scaling number of workers is easy • Kafka has better scalability than Kafka Service ActiveMQ Thursday, August 14, 14
  • 19. Storm STORM TOPOLOGY Service Kafka HDFS Kafka Spout ActiveMQ Spout Processing Bolt Storm-hdfs bolt Landing Directory Hive Directory Thursday, August 14, 14
  • 20. JSON PROBLEMS • Problems with JSON • No predefined schema • No enforcement of backward compatibility • Solution • Protocol Buffers (also Avro/Thrift) Thursday, August 14, 14
  • 21. PROTOBUFS • What? • Way of encoding structured data • Binary • Why? • Schema • Backward compatibility • Smaller in size than JSON Thursday, August 14, 14
  • 22. VERSIONING • backward compatible changes only ,proto ,proto Version 1.4 Version 1.1 Producer ConsumerQueue Thursday, August 14, 14
  • 23. SHARING PROTOBUF SCHEMAS Artifactory (Schema Repo) Data Team Storm Project Producers Push Java jars Ruby gems Pull Java jars Thursday, August 14, 14
  • 24. BUT HOW DO YOU STORE PROTOBUFS IN HDFS? Thursday, August 14, 14
  • 25. HOW WE STORE PROTOBUFS • Store raw version • Raw dump of kafka topic in to HDFS • Convert them to a tuple using Storm • Inflate then convert to TSV • Can query raw protobufs directly from HIVE but we don’t yet • elephant-bird (difficult to get it working) Thursday, August 14, 14
  • 26. Storm STORM TOPOLOGY Service Kafka HDFS Kafka Spout ActiveMQ Spout Deserialize Protobuf Storm-hdfs bolt Landing Directory Hive Directory Thursday, August 14, 14
  • 28. SPARK • ETL • Wordcount ~5 lines of scala code vs. 58 lines of Java Map reduce code • Spark Streaming can achieve similar results as of storm through micro-batching http://www.slideshare.net/ptgoetz/apache-storm-vs-spark-streaming • Machine Learning • Online learning using MLLIB • Logistic Regression and SVM Thursday, August 14, 14
  • 29. H20 • In-memory machine learning • Tight integration with R • Preferred by Data Scientists Thursday, August 14, 14
  • 30. OPEN SOURCE PROJECTS • Currently open sourced • Pipefish - write from MySQL to HDFS github.com/lookout/pipefish • Future • Mudskipper - capture change-data events from MySQL binlogs. • Chunker - download mysql table data in chunks Thursday, August 14, 14