SlideShare ist ein Scribd-Unternehmen logo
1 von 15
Downloaden Sie, um offline zu lesen
Gobblin @ NerdWallet
By Akshay Nanavati and Eric Ogren
akshay@nerdwallet.com eric@nerdwallet.com
Agenda
● Introduction to NerdWallet
● Gobblin @ NerdWallet Today
● Initial Pain Points & Learnings
● Contributions (Present and Future)
● Future Use Cases & Requests
2
What Is NerdWallet?
● Started in 2009. 275+ employees
● Highly profitable. Series A funding Feb 2015.
● We want to bring clarity to life’s financial decisions.
3
Front-End
Services Tier
NerdWallet Tech Stack
Data Analytics
Data Systems & Platforms
4
Data Types @ NerdWallet
● Partner Offer Data (MySQL & ElasticSearch: heavy reads, rare writes)
○ Synced to Redshift periodically
● Consumer Identity Data (Postgres: medium reads, medium writes)
● Site Generated Tracking Data (Redshift: heavy reads, heavy writes)
● Operational Data (e.g. Nginx logs) (Redshift: low reads, heavy writes) ?
● Internal Business Data (e.g. Salesforce) (Redshift: medium reads, rare writes)
● External 3rd Party Analytics Data (Redshift: medium reads, batch import)
5
Gobblin @ NW Today
● Running in standalone mode
● Ingests user tracking and operational log data
● Tracking Data:
○ ~10 Kafka topics - 1 per event & schema type
○ Hourly Gobblin jobs pull from kafka and dump to date-partitioned directory in S3
○ Events are already serialized as protobuf in each Kafka topic
○ Around 100 events/second
● Log Ingestion (Operational Data):
○ Extracts data from AWS logs sitting in S3
○ Parses log lines and serializes them to protobuf
○ Writes the serialized protobuf files back to S3 and eventually into redshift
6
Tracking Pipeline
7
Learnings: Deploying Gobblin w/Internal Code
● Have a repo of internal Gobblin modules (this is where we compile everything)
● Modified the build script to link the gobblin project to our gobblin-modules
project
● Use jenkins to compile gobblin on the remote machine
● Maintain a separate repository with .pull files that we can sync with our stage
and production environments
8
Current Contributions
● Simple Data Writer
○ class gobblin.writer.SimpleDataWriter
○ Writes binary record as bytes with no regard to encoding
○ Optionally prepends records by record size or uses a char delimiter at the end of records (i.e. n
for string data)
● Kafka Simple Extractor
○ class gobblin.source.extractor.extract.kafka.KafkaSimpleExtractor
○ class gobblin.source.extractor.extract.kafka.KafkaSimpleSource
○ Extracts binary data from Kafka as an array of bytes without any serde
9
Future Contributions
● Gobblin Dashboards
● S3 Source & Extractor
○ Given an S3 bucket, extract all files matching a regex
■ Leverages FileBasedExtractor
■ We would also like to modify this to have similar functionality to
DatePartitionedDailyAvroSource
● S3 Publisher
○ Publishes files to S3
○ Currently there is an issue where the AWS S3 Java API doesn’t work correctly with HDFS; since
we are running in standalone this is not an issue for us
10
Future: Dashboards
11
Gobblin @ NW tomorrow
● More data types
○ Offer data from partners: JSON/CSV/XML over {HTTP, FTP} => S3
○ Offer data from our site: MySQL => S3 (batch and incremental)
○ Identity data from out site: Postgres => S3 (batch and incremental, data hiding)
○ Salesforce Data
● Integration with Airflow DAGs
● Integration with data cleansing & entity matching frameworks
12
Early Adoption Pain Points & Solutions
● Best practices around for ingestion w/ transformation steps
● Initial problems integrating NW specific code (especially extractors &
converters) into Gobblin’s build process
● Best practices around scheduler integration - Quartz (built-in) vs ETL
schedulers
● Backwards incompatible changes caused us to make migrations to upgrade
versions
● No changelogs & tagged releases
13
Things we would like to see/add in future
● Abstract out Avro specific code
● Best practices for scheduler integration (can contribute for Airflow)
● Clustering without requiring Hadoop & YARN
● Metadata support (job X produced files Y,Z)
● Release notes & tags :)
● The build & unit test process is very bloated
○ Hard to differentiate warnings/stack traces vs legitimate build issues
○ Opens ports, creates temporary dbs, etc which make it difficult to test on arbitrary machines
(port collisions)
14
Thanks!
Questions??
15

Weitere ähnliche Inhalte

Was ist angesagt?

Data Distribution and Ordering for Efficient Data Source V2
Data Distribution and Ordering for Efficient Data Source V2Data Distribution and Ordering for Efficient Data Source V2
Data Distribution and Ordering for Efficient Data Source V2
Databricks
 
OAP: Optimized Analytics Package for Spark Platform with Daoyuan Wang and Yua...
OAP: Optimized Analytics Package for Spark Platform with Daoyuan Wang and Yua...OAP: Optimized Analytics Package for Spark Platform with Daoyuan Wang and Yua...
OAP: Optimized Analytics Package for Spark Platform with Daoyuan Wang and Yua...
Databricks
 
Large Scale Feature Aggregation Using Apache Spark with Pulkit Bhanot and Ami...
Large Scale Feature Aggregation Using Apache Spark with Pulkit Bhanot and Ami...Large Scale Feature Aggregation Using Apache Spark with Pulkit Bhanot and Ami...
Large Scale Feature Aggregation Using Apache Spark with Pulkit Bhanot and Ami...
Databricks
 

Was ist angesagt? (20)

Flink SQL & TableAPI in Large Scale Production at Alibaba
Flink SQL & TableAPI in Large Scale Production at AlibabaFlink SQL & TableAPI in Large Scale Production at Alibaba
Flink SQL & TableAPI in Large Scale Production at Alibaba
 
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
 
Building the Petcare Data Platform using Delta Lake and 'Kyte': Our Spark ETL...
Building the Petcare Data Platform using Delta Lake and 'Kyte': Our Spark ETL...Building the Petcare Data Platform using Delta Lake and 'Kyte': Our Spark ETL...
Building the Petcare Data Platform using Delta Lake and 'Kyte': Our Spark ETL...
 
Data Distribution and Ordering for Efficient Data Source V2
Data Distribution and Ordering for Efficient Data Source V2Data Distribution and Ordering for Efficient Data Source V2
Data Distribution and Ordering for Efficient Data Source V2
 
Symantec: Cassandra Data Modelling techniques in action
Symantec: Cassandra Data Modelling techniques in actionSymantec: Cassandra Data Modelling techniques in action
Symantec: Cassandra Data Modelling techniques in action
 
Observability for Data Pipelines With OpenLineage
Observability for Data Pipelines With OpenLineageObservability for Data Pipelines With OpenLineage
Observability for Data Pipelines With OpenLineage
 
Spark and S3 with Ryan Blue
Spark and S3 with Ryan BlueSpark and S3 with Ryan Blue
Spark and S3 with Ryan Blue
 
Big Data Platform at Pinterest
Big Data Platform at PinterestBig Data Platform at Pinterest
Big Data Platform at Pinterest
 
OAP: Optimized Analytics Package for Spark Platform with Daoyuan Wang and Yua...
OAP: Optimized Analytics Package for Spark Platform with Daoyuan Wang and Yua...OAP: Optimized Analytics Package for Spark Platform with Daoyuan Wang and Yua...
OAP: Optimized Analytics Package for Spark Platform with Daoyuan Wang and Yua...
 
Large Scale Feature Aggregation Using Apache Spark with Pulkit Bhanot and Ami...
Large Scale Feature Aggregation Using Apache Spark with Pulkit Bhanot and Ami...Large Scale Feature Aggregation Using Apache Spark with Pulkit Bhanot and Ami...
Large Scale Feature Aggregation Using Apache Spark with Pulkit Bhanot and Ami...
 
How We Optimize Spark SQL Jobs With parallel and sync IO
How We Optimize Spark SQL Jobs With parallel and sync IOHow We Optimize Spark SQL Jobs With parallel and sync IO
How We Optimize Spark SQL Jobs With parallel and sync IO
 
Graph ql and enterprise
Graph ql and enterpriseGraph ql and enterprise
Graph ql and enterprise
 
Change Data Capture with Data Collector @OVH
Change Data Capture with Data Collector @OVHChange Data Capture with Data Collector @OVH
Change Data Capture with Data Collector @OVH
 
Building scalable data with kafka and spark
Building scalable data with kafka and sparkBuilding scalable data with kafka and spark
Building scalable data with kafka and spark
 
Migrating Airflow-based Apache Spark Jobs to Kubernetes – the Native Way
Migrating Airflow-based Apache Spark Jobs to Kubernetes – the Native WayMigrating Airflow-based Apache Spark Jobs to Kubernetes – the Native Way
Migrating Airflow-based Apache Spark Jobs to Kubernetes – the Native Way
 
Pinot: Near Realtime Analytics @ Uber
Pinot: Near Realtime Analytics @ UberPinot: Near Realtime Analytics @ Uber
Pinot: Near Realtime Analytics @ Uber
 
Cloud dwh
Cloud dwhCloud dwh
Cloud dwh
 
Bullet: A Real Time Data Query Engine
Bullet: A Real Time Data Query EngineBullet: A Real Time Data Query Engine
Bullet: A Real Time Data Query Engine
 
Introduction to Data Engineer and Data Pipeline at Credit OK
Introduction to Data Engineer and Data Pipeline at Credit OKIntroduction to Data Engineer and Data Pipeline at Credit OK
Introduction to Data Engineer and Data Pipeline at Credit OK
 
What is Change Data Capture (CDC) and Why is it Important?
What is Change Data Capture (CDC) and Why is it Important?What is Change Data Capture (CDC) and Why is it Important?
What is Change Data Capture (CDC) and Why is it Important?
 

Andere mochten auch

Banki.ru Иван Ильин: Емейл-маркетинг в банковской лидогенерации.
Banki.ru Иван Ильин: Емейл-маркетинг в банковской лидогенерации.Banki.ru Иван Ильин: Емейл-маркетинг в банковской лидогенерации.
Banki.ru Иван Ильин: Емейл-маркетинг в банковской лидогенерации.
CPAex
 

Andere mochten auch (20)

Banki.ru Иван Ильин: Емейл-маркетинг в банковской лидогенерации.
Banki.ru Иван Ильин: Емейл-маркетинг в банковской лидогенерации.Banki.ru Иван Ильин: Емейл-маркетинг в банковской лидогенерации.
Banki.ru Иван Ильин: Емейл-маркетинг в банковской лидогенерации.
 
Как получить и использовать знания о клиентах! Bablometer.ru
Как получить и использовать знания о клиентах!  Bablometer.ruКак получить и использовать знания о клиентах!  Bablometer.ru
Как получить и использовать знания о клиентах! Bablometer.ru
 
Как собрать команду мечты
Как собрать команду мечтыКак собрать команду мечты
Как собрать команду мечты
 
Agile days 2015. Непрерывное качество в непрерывной разработке
Agile days 2015. Непрерывное качество в непрерывной разработкеAgile days 2015. Непрерывное качество в непрерывной разработке
Agile days 2015. Непрерывное качество в непрерывной разработке
 
работа в крупной компании на примере Banki.ru
работа в крупной компании на примере Banki.ruработа в крупной компании на примере Banki.ru
работа в крупной компании на примере Banki.ru
 
Talent 2020 Webinar Will Your High Potentials Stick Around
Talent 2020 Webinar Will Your High Potentials Stick AroundTalent 2020 Webinar Will Your High Potentials Stick Around
Talent 2020 Webinar Will Your High Potentials Stick Around
 
Angular meetup
Angular meetupAngular meetup
Angular meetup
 
Nectarin Digital Digest №5
Nectarin Digital Digest №5Nectarin Digital Digest №5
Nectarin Digital Digest №5
 
Как маркетплейсы произведут революцию на рынке электронного ОСАГО
Как маркетплейсы произведут революцию на рынке электронного ОСАГОКак маркетплейсы произведут революцию на рынке электронного ОСАГО
Как маркетплейсы произведут революцию на рынке электронного ОСАГО
 
Gobblin' Big Data With Ease @ QConSF 2014
Gobblin' Big Data With Ease @ QConSF 2014Gobblin' Big Data With Ease @ QConSF 2014
Gobblin' Big Data With Ease @ QConSF 2014
 
Newsbrands and finance
Newsbrands and financeNewsbrands and finance
Newsbrands and finance
 
Mail.ru: Как вырастить в себе автоматизатора и разработчика
Mail.ru:  Как вырастить в себе автоматизатора и разработчикаMail.ru:  Как вырастить в себе автоматизатора и разработчика
Mail.ru: Как вырастить в себе автоматизатора и разработчика
 
Optimising your rtb infrastructure sammy austin - follow up
Optimising your rtb infrastructure   sammy austin - follow upOptimising your rtb infrastructure   sammy austin - follow up
Optimising your rtb infrastructure sammy austin - follow up
 
Reduce, Reuse, Recycle: How Moneysupermarket.com Created APIs Without Startin...
Reduce, Reuse, Recycle: How Moneysupermarket.com Created APIs Without Startin...Reduce, Reuse, Recycle: How Moneysupermarket.com Created APIs Without Startin...
Reduce, Reuse, Recycle: How Moneysupermarket.com Created APIs Without Startin...
 
MoneySuperMarket.com: Case Study
MoneySuperMarket.com: Case StudyMoneySuperMarket.com: Case Study
MoneySuperMarket.com: Case Study
 
Почему почта не работает
Почему почта не работаетПочему почта не работает
Почему почта не работает
 
MoneySupermarket.com: Customer Case Study - Layer 7 API Management Workshop L...
MoneySupermarket.com: Customer Case Study - Layer 7 API Management Workshop L...MoneySupermarket.com: Customer Case Study - Layer 7 API Management Workshop L...
MoneySupermarket.com: Customer Case Study - Layer 7 API Management Workshop L...
 
Идеальный тестдизайн: от Цема Канера до Agile
Идеальный тестдизайн: от Цема Канера до AgileИдеальный тестдизайн: от Цема Канера до Agile
Идеальный тестдизайн: от Цема Канера до Agile
 
Pankov
PankovPankov
Pankov
 
Внедрение измениений. Рефакторинг Vs реинжиниринг
Внедрение измениений. Рефакторинг Vs реинжинирингВнедрение измениений. Рефакторинг Vs реинжиниринг
Внедрение измениений. Рефакторинг Vs реинжиниринг
 

Ähnlich wie Gobblin @ NerdWallet (Nov 2015)

Benchmarking for postgresql workloads in kubernetes
Benchmarking for postgresql workloads in kubernetesBenchmarking for postgresql workloads in kubernetes
Benchmarking for postgresql workloads in kubernetes
DoKC
 
A Day in the Life of a Druid Implementor and Druid's Roadmap
A Day in the Life of a Druid Implementor and Druid's RoadmapA Day in the Life of a Druid Implementor and Druid's Roadmap
A Day in the Life of a Druid Implementor and Druid's Roadmap
Itai Yaffe
 

Ähnlich wie Gobblin @ NerdWallet (Nov 2015) (20)

Kotlin REST & GraphQL API
Kotlin REST & GraphQL APIKotlin REST & GraphQL API
Kotlin REST & GraphQL API
 
Big data should be simple
Big data should be simpleBig data should be simple
Big data should be simple
 
Flink Forward San Francisco 2018: Gregory Fee - "Bootstrapping State In Apach...
Flink Forward San Francisco 2018: Gregory Fee - "Bootstrapping State In Apach...Flink Forward San Francisco 2018: Gregory Fee - "Bootstrapping State In Apach...
Flink Forward San Francisco 2018: Gregory Fee - "Bootstrapping State In Apach...
 
Apache Airflow in the Cloud: Programmatically orchestrating workloads with Py...
Apache Airflow in the Cloud: Programmatically orchestrating workloads with Py...Apache Airflow in the Cloud: Programmatically orchestrating workloads with Py...
Apache Airflow in the Cloud: Programmatically orchestrating workloads with Py...
 
Snowflake Automated Deployments / CI/CD Pipelines
Snowflake Automated Deployments / CI/CD PipelinesSnowflake Automated Deployments / CI/CD Pipelines
Snowflake Automated Deployments / CI/CD Pipelines
 
Structured Streaming in Spark
Structured Streaming in SparkStructured Streaming in Spark
Structured Streaming in Spark
 
BlackRay - The open Source Data Engine
BlackRay - The open Source Data EngineBlackRay - The open Source Data Engine
BlackRay - The open Source Data Engine
 
Blackray @ SAPO CodeBits 2009
Blackray @ SAPO CodeBits 2009Blackray @ SAPO CodeBits 2009
Blackray @ SAPO CodeBits 2009
 
What's coming in Airflow 2.0? - NYC Apache Airflow Meetup
What's coming in Airflow 2.0? - NYC Apache Airflow MeetupWhat's coming in Airflow 2.0? - NYC Apache Airflow Meetup
What's coming in Airflow 2.0? - NYC Apache Airflow Meetup
 
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB AtlasMongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
 
Data Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFixData Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFix
 
Spark Meetup at Uber
Spark Meetup at UberSpark Meetup at Uber
Spark Meetup at Uber
 
OpenLineage for Stream Processing | Kafka Summit London
OpenLineage for Stream Processing | Kafka Summit LondonOpenLineage for Stream Processing | Kafka Summit London
OpenLineage for Stream Processing | Kafka Summit London
 
Benchmarking for postgresql workloads in kubernetes
Benchmarking for postgresql workloads in kubernetesBenchmarking for postgresql workloads in kubernetes
Benchmarking for postgresql workloads in kubernetes
 
PGConf APAC 2018 - High performance json postgre-sql vs. mongodb
PGConf APAC 2018 - High performance json  postgre-sql vs. mongodbPGConf APAC 2018 - High performance json  postgre-sql vs. mongodb
PGConf APAC 2018 - High performance json postgre-sql vs. mongodb
 
Bootstrapping state in Apache Flink
Bootstrapping state in Apache FlinkBootstrapping state in Apache Flink
Bootstrapping state in Apache Flink
 
A Day in the Life of a Druid Implementor and Druid's Roadmap
A Day in the Life of a Druid Implementor and Druid's RoadmapA Day in the Life of a Druid Implementor and Druid's Roadmap
A Day in the Life of a Druid Implementor and Druid's Roadmap
 
MySQL Time Machine by replicating into HBase - Slides from Percona Live Amste...
MySQL Time Machine by replicating into HBase - Slides from Percona Live Amste...MySQL Time Machine by replicating into HBase - Slides from Percona Live Amste...
MySQL Time Machine by replicating into HBase - Slides from Percona Live Amste...
 
High performance json- postgre sql vs. mongodb
High performance json- postgre sql vs. mongodbHigh performance json- postgre sql vs. mongodb
High performance json- postgre sql vs. mongodb
 
OSA Con 2022 - Extract, Transform, and Learn about your developers - Brian Le...
OSA Con 2022 - Extract, Transform, and Learn about your developers - Brian Le...OSA Con 2022 - Extract, Transform, and Learn about your developers - Brian Le...
OSA Con 2022 - Extract, Transform, and Learn about your developers - Brian Le...
 

Kürzlich hochgeladen

%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
masabamasaba
 
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
masabamasaba
 

Kürzlich hochgeladen (20)

WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...
WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...
WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...
 
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
 
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
 
WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...
WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...
WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation Template
 
WSO2CON 2024 Slides - Open Source to SaaS
WSO2CON 2024 Slides - Open Source to SaaSWSO2CON 2024 Slides - Open Source to SaaS
WSO2CON 2024 Slides - Open Source to SaaS
 
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
 
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital TransformationWSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
 
WSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open Source
WSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open SourceWSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open Source
WSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open Source
 
WSO2CON 2024 - How to Run a Security Program
WSO2CON 2024 - How to Run a Security ProgramWSO2CON 2024 - How to Run a Security Program
WSO2CON 2024 - How to Run a Security Program
 
WSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go PlatformlessWSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go Platformless
 
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
Artyushina_Guest lecture_YorkU CS May 2024.pptx
Artyushina_Guest lecture_YorkU CS May 2024.pptxArtyushina_Guest lecture_YorkU CS May 2024.pptx
Artyushina_Guest lecture_YorkU CS May 2024.pptx
 
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
 
WSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
WSO2Con2024 - Enabling Transactional System's Exponential Growth With SimplicityWSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
WSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
 
tonesoftg
tonesoftgtonesoftg
tonesoftg
 
Architecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the pastArchitecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the past
 
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
 
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
 

Gobblin @ NerdWallet (Nov 2015)

  • 1. Gobblin @ NerdWallet By Akshay Nanavati and Eric Ogren akshay@nerdwallet.com eric@nerdwallet.com
  • 2. Agenda ● Introduction to NerdWallet ● Gobblin @ NerdWallet Today ● Initial Pain Points & Learnings ● Contributions (Present and Future) ● Future Use Cases & Requests 2
  • 3. What Is NerdWallet? ● Started in 2009. 275+ employees ● Highly profitable. Series A funding Feb 2015. ● We want to bring clarity to life’s financial decisions. 3
  • 4. Front-End Services Tier NerdWallet Tech Stack Data Analytics Data Systems & Platforms 4
  • 5. Data Types @ NerdWallet ● Partner Offer Data (MySQL & ElasticSearch: heavy reads, rare writes) ○ Synced to Redshift periodically ● Consumer Identity Data (Postgres: medium reads, medium writes) ● Site Generated Tracking Data (Redshift: heavy reads, heavy writes) ● Operational Data (e.g. Nginx logs) (Redshift: low reads, heavy writes) ? ● Internal Business Data (e.g. Salesforce) (Redshift: medium reads, rare writes) ● External 3rd Party Analytics Data (Redshift: medium reads, batch import) 5
  • 6. Gobblin @ NW Today ● Running in standalone mode ● Ingests user tracking and operational log data ● Tracking Data: ○ ~10 Kafka topics - 1 per event & schema type ○ Hourly Gobblin jobs pull from kafka and dump to date-partitioned directory in S3 ○ Events are already serialized as protobuf in each Kafka topic ○ Around 100 events/second ● Log Ingestion (Operational Data): ○ Extracts data from AWS logs sitting in S3 ○ Parses log lines and serializes them to protobuf ○ Writes the serialized protobuf files back to S3 and eventually into redshift 6
  • 8. Learnings: Deploying Gobblin w/Internal Code ● Have a repo of internal Gobblin modules (this is where we compile everything) ● Modified the build script to link the gobblin project to our gobblin-modules project ● Use jenkins to compile gobblin on the remote machine ● Maintain a separate repository with .pull files that we can sync with our stage and production environments 8
  • 9. Current Contributions ● Simple Data Writer ○ class gobblin.writer.SimpleDataWriter ○ Writes binary record as bytes with no regard to encoding ○ Optionally prepends records by record size or uses a char delimiter at the end of records (i.e. n for string data) ● Kafka Simple Extractor ○ class gobblin.source.extractor.extract.kafka.KafkaSimpleExtractor ○ class gobblin.source.extractor.extract.kafka.KafkaSimpleSource ○ Extracts binary data from Kafka as an array of bytes without any serde 9
  • 10. Future Contributions ● Gobblin Dashboards ● S3 Source & Extractor ○ Given an S3 bucket, extract all files matching a regex ■ Leverages FileBasedExtractor ■ We would also like to modify this to have similar functionality to DatePartitionedDailyAvroSource ● S3 Publisher ○ Publishes files to S3 ○ Currently there is an issue where the AWS S3 Java API doesn’t work correctly with HDFS; since we are running in standalone this is not an issue for us 10
  • 12. Gobblin @ NW tomorrow ● More data types ○ Offer data from partners: JSON/CSV/XML over {HTTP, FTP} => S3 ○ Offer data from our site: MySQL => S3 (batch and incremental) ○ Identity data from out site: Postgres => S3 (batch and incremental, data hiding) ○ Salesforce Data ● Integration with Airflow DAGs ● Integration with data cleansing & entity matching frameworks 12
  • 13. Early Adoption Pain Points & Solutions ● Best practices around for ingestion w/ transformation steps ● Initial problems integrating NW specific code (especially extractors & converters) into Gobblin’s build process ● Best practices around scheduler integration - Quartz (built-in) vs ETL schedulers ● Backwards incompatible changes caused us to make migrations to upgrade versions ● No changelogs & tagged releases 13
  • 14. Things we would like to see/add in future ● Abstract out Avro specific code ● Best practices for scheduler integration (can contribute for Airflow) ● Clustering without requiring Hadoop & YARN ● Metadata support (job X produced files Y,Z) ● Release notes & tags :) ● The build & unit test process is very bloated ○ Hard to differentiate warnings/stack traces vs legitimate build issues ○ Opens ports, creates temporary dbs, etc which make it difficult to test on arbitrary machines (port collisions) 14