SlideShare ist ein Scribd-Unternehmen logo
1 von 34
Downloaden Sie, um offline zu lesen
Democratizing
Data at Go-jek
What do we mean by Democratizing Data?
What do we mean by Democratizing Data?
- Self serve in a way that you are dependent on tools and not on people
- Abstract in such a way that people only need to know
- data they are publishing
- Insights that they want
Agenda
- About Go-jek
- Core components of Data Platform
- Building Data platform around core components
- Whole picture of how components interact with each other
- References
- QnA
About Go-jek
- A mobile app for daily needs
- Transport
- Food Delivery
- Payments
- Logistics
- 18+ products
- Operational Internationally
- Indonesia
- Singapore
- Vietnam
- Thailand
2016 Expansion and
new services
2010 Call-center for
ojek* services
2015 App launched
with 3 services
*ojek is an Indonesian term of motorcycle ride hailing
Uses of Data
- Application Monitoring
- Fraud Detection
- Pricing
- Report Generation
- User segmentation
Let's talk about the scale of Data
Growth of Data
- 6666x growth in 18
months
- Expansion to 3
countries(Singapore,
Vietnam, Thailand)
Type of Data
- 18+ products ranging from:
- Ride
- Food Delivery
- Payments
- Shipments
Given the data at Go-jek it becomes essential to create a data platform at scale and supporting automation.
Core Components
Core Components
- Components that serve as backbone for the data platform
- We have a bunch of these core components for different use cases
- Multiple options available for changing any of these core components
with another well suited option
Core Components
- Main components of our Data platform:
- Kafka => message broker for producing and consuming data
Core Components
- Main components of our Data platform:
- Kafka => message broker for producing and consuming data
- Flink => real time stream processing
Core Components
- Main components of our Data platform:
- Kafka => message broker for producing and consuming data
- Flink => real time stream processing
- Kubernetes => Deployment and management of containers
Core Components
- Main components of our Data platform:
- Kafka => message broker for producing and consuming data
- Flink => real time stream processing
- Kubernetes => Deployment and management of containers
- Airflow => workflow management
Core Components
- Main components of our Data platform:
- Kafka => message broker for producing and consuming data
- Flink => real time stream processing
- Kubernetes => Deployment and management of containers
- Airflow => workflow management
- Dataproc => managed spark cluster for batch processing
Data Platform
Typical day for Data Engineer at Gojek
- Tackling the following aspects of Data:
- Ingestion
- Consumption
- Aggregation
- Cold Storage
- Visualization (in early stages currently)
- Automation for Resource creation:
- Resource is any tool used for data ingestion, consumption etc.
Focus areas for scaling
- How to take care of these aspects while scaling data platform:
- Reliability => No Data loss
- Abstraction => Downtime due to one team shouldn’t impact other
- Automation => Infrastructure, deployment
- Monitoring => Monitoring and Alerting for all resources
- Self Serve => Anyone can access platform without intervention
Data Ingestion
- Uniformity by Protobufs
- All kafka messages are serialized/deserialized by using protobuf
- Teams maintain the protobuf schema
- Client libraries in various languages to push messages
- Helps in terms of:
- Defining a schema for pushing/reading from kafka
- No breaking change, a newer version of the schema shouldn’t
break the older conventions
Why Protobufs?
- All protobufs can be imported like client library
- Use protoc command to generate library for different languages
- Easy to import and use
Create Golang library from Java library
Data Ingestion - Reliability
- Reliability by Fronting Systems
- Fronting serves as reliable way to push data to kafka
- First pushes to internal kafka and then pushes batched data to
main kafka
- Internal redis store to store failed messages
- Individual frontings for each team to isolate issues
- High availability with HAProxy
- Can be replaced with any system that enables reliable push to kafka
- Sidekiq jobs
- Application level retry mechanism
Fronting Architecture
Data Consumption
- Firehose: Custom kafka consumer that pushes to sinks:
- Database Sink
- HTTP Sink
- Influx Sink
- Elasticsearch Sink
Data Cold Storage
- Bigquery (Data warehouse) => Beast
- GCS (Cold storage) => Sakaar
- Pushes raw data from kafka to Bigquery and GCS
- Governance of data by service accounts
- Metabase and Zeppelin to query data
Kafka consumers architecture
Downstream
Kafka - Abstraction
- To scale kafka, formalize abstractions of different kafka clusters
- Different Kafka clusters for application usecases:
- Mainstream for all booking data
- Appstream for mobile application data
- Locstream for location data
- Mirror kafka data for data usecases:
- Data clusters used for aggregation, auditing, cold storage of
data
Data Aggregation
- Flink (for real time aggregation)
- Dataproc (for batch aggregation)
- SQL interface following Apache Calcite to create aggregation jobs
- User Defined Functions to support complex functionalities
- Different sinks to write aggregated data to Elasticsearch,
Database, Influx and others
- Airflow to schedule jobs
Data Visualization
- Geo visualization platform to explore location data
- D3 and chart libraries to explore data
- Deck.GL for building heatmaps, 2D/3D visualization layers
- Booking, payments heatmaps
Automation
- Terraform for IAC and following convention
- Creating all aspects of data platform through terraform:
- Kubernetes cluster
- VMs
- Kafka/Core Components
- Grants to service accounts
Monitoring
- TICK script based monitoring/alerting setup
- TICK(Telegraf, InfluxDB, Chronograph, Kapacitor)
- StatsD client to collect metrics and write data to influx
- Kapacitor scripts to create alerts
- Integrations with slack, pagerduty for alerting
- Grafana for monitoring
- Create generic Tick templates and then use template to create alerts
Tick setup example
In this template, the warn_threshold and
crit_threshold are to be supplied when creating alert.
Create task through a curl call
Self Serve
- Data Platform to DIY provision data products
- Web interface to collect information from user
- Helm chart for all data products
- Kubernetes client on backend to provision resource
- Resource to team mapping for authentication and authorization
The whole picture
References
- Data Infra blog: https://blog.gojekengineering.com/data-infrastructure-at-go-jek-cd4dc8cbd929
- Fronting blog: https://blog.gojekengineering.com/kafka-4066a4ea8d0d
- Aggregation blog:
https://blog.gojekengineering.com/daggers-data-aggregation-in-real-time-4a32eb9ad2d1
- Sakaar blog:
https://blog.gojekengineering.com/sakaar-taking-kafka-data-to-cloud-storage-at-go-jek-7839da20b5f3
- Data visualization blog:
https://blog.gojekengineering.com/atlas-go-jeks-real-time-geospatial-visualization-platform-1cf5e16814c5
- Open sourced repos:
- https://github.com/gojek/beast
- https://github.com/gojekfarm/stencil
- Helm charts: https://github.com/gojektech/charts
Questions?

Weitere ähnliche Inhalte

Was ist angesagt?

Kafka: Journey from Just Another Software to Being a Critical Part of PayPal ...
Kafka: Journey from Just Another Software to Being a Critical Part of PayPal ...Kafka: Journey from Just Another Software to Being a Critical Part of PayPal ...
Kafka: Journey from Just Another Software to Being a Critical Part of PayPal ...confluent
 
Webinar: What's new in CDAP 3.5?
Webinar: What's new in CDAP 3.5?Webinar: What's new in CDAP 3.5?
Webinar: What's new in CDAP 3.5?Cask Data
 
Introducing a horizontally scalable, inference-based business Rules Engine fo...
Introducing a horizontally scalable, inference-based business Rules Engine fo...Introducing a horizontally scalable, inference-based business Rules Engine fo...
Introducing a horizontally scalable, inference-based business Rules Engine fo...Cask Data
 
Real-Time Dynamic Data Export Using the Kafka Ecosystem
Real-Time Dynamic Data Export Using the Kafka EcosystemReal-Time Dynamic Data Export Using the Kafka Ecosystem
Real-Time Dynamic Data Export Using the Kafka Ecosystemconfluent
 
PCAP Graphs for Cybersecurity and System Tuning
PCAP Graphs for Cybersecurity and System TuningPCAP Graphs for Cybersecurity and System Tuning
PCAP Graphs for Cybersecurity and System TuningDr. Mirko Kämpf
 
New Approaches for Fraud Detection on Apache Kafka and KSQL
New Approaches for Fraud Detection on Apache Kafka and KSQLNew Approaches for Fraud Detection on Apache Kafka and KSQL
New Approaches for Fraud Detection on Apache Kafka and KSQLconfluent
 
Streaming data in the cloud with Confluent and MongoDB Atlas | Robert Waters,...
Streaming data in the cloud with Confluent and MongoDB Atlas | Robert Waters,...Streaming data in the cloud with Confluent and MongoDB Atlas | Robert Waters,...
Streaming data in the cloud with Confluent and MongoDB Atlas | Robert Waters,...HostedbyConfluent
 
How a Data Mesh is Driving our Platform | Trey Hicks, Gloo
How a Data Mesh is Driving our Platform | Trey Hicks, GlooHow a Data Mesh is Driving our Platform | Trey Hicks, Gloo
How a Data Mesh is Driving our Platform | Trey Hicks, GlooHostedbyConfluent
 
From Batch to Streaming with Apache Apex Dataworks Summit 2017
From Batch to Streaming with Apache Apex Dataworks Summit 2017From Batch to Streaming with Apache Apex Dataworks Summit 2017
From Batch to Streaming with Apache Apex Dataworks Summit 2017Apache Apex
 
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
Intro to Apache Apex - Next Gen Platform for Ingest and TransformIntro to Apache Apex - Next Gen Platform for Ingest and Transform
Intro to Apache Apex - Next Gen Platform for Ingest and TransformApache Apex
 
"Who Moved my Data? - Why tracking changes and sources of data is critical to...
"Who Moved my Data? - Why tracking changes and sources of data is critical to..."Who Moved my Data? - Why tracking changes and sources of data is critical to...
"Who Moved my Data? - Why tracking changes and sources of data is critical to...Cask Data
 
Ingesting Data from Kafka to JDBC with Transformation and Enrichment
Ingesting Data from Kafka to JDBC with Transformation and EnrichmentIngesting Data from Kafka to JDBC with Transformation and Enrichment
Ingesting Data from Kafka to JDBC with Transformation and EnrichmentApache Apex
 
Deploying Kafka Streams Applications with Docker and Kubernetes
Deploying Kafka Streams Applications with Docker and KubernetesDeploying Kafka Streams Applications with Docker and Kubernetes
Deploying Kafka Streams Applications with Docker and Kubernetesconfluent
 
Data Integration with Apache Kafka: What, Why, How
Data Integration with Apache Kafka: What, Why, HowData Integration with Apache Kafka: What, Why, How
Data Integration with Apache Kafka: What, Why, HowPat Patterson
 
Flink Forward San Francisco 2018 keynote: Stephan Ewen - "What turns stream p...
Flink Forward San Francisco 2018 keynote: Stephan Ewen - "What turns stream p...Flink Forward San Francisco 2018 keynote: Stephan Ewen - "What turns stream p...
Flink Forward San Francisco 2018 keynote: Stephan Ewen - "What turns stream p...Flink Forward
 
KSQL: Open Source Streaming for Apache Kafka
KSQL: Open Source Streaming for Apache KafkaKSQL: Open Source Streaming for Apache Kafka
KSQL: Open Source Streaming for Apache Kafkaconfluent
 
ksqlDB - Stream Processing simplified!
ksqlDB - Stream Processing simplified!ksqlDB - Stream Processing simplified!
ksqlDB - Stream Processing simplified!Guido Schmutz
 
Kafka Summit NYC 2017 - Venice: A Distributed Database on top of Kafka
Kafka Summit NYC 2017 - Venice: A Distributed Database on top of KafkaKafka Summit NYC 2017 - Venice: A Distributed Database on top of Kafka
Kafka Summit NYC 2017 - Venice: A Distributed Database on top of Kafkaconfluent
 
Building a Streaming Platform with Kafka
Building a Streaming Platform with KafkaBuilding a Streaming Platform with Kafka
Building a Streaming Platform with Kafkaconfluent
 
Ingesting and Processing IoT Data - using MQTT, Kafka Connect and KSQL
Ingesting and Processing IoT Data - using MQTT, Kafka Connect and KSQLIngesting and Processing IoT Data - using MQTT, Kafka Connect and KSQL
Ingesting and Processing IoT Data - using MQTT, Kafka Connect and KSQLGuido Schmutz
 

Was ist angesagt? (20)

Kafka: Journey from Just Another Software to Being a Critical Part of PayPal ...
Kafka: Journey from Just Another Software to Being a Critical Part of PayPal ...Kafka: Journey from Just Another Software to Being a Critical Part of PayPal ...
Kafka: Journey from Just Another Software to Being a Critical Part of PayPal ...
 
Webinar: What's new in CDAP 3.5?
Webinar: What's new in CDAP 3.5?Webinar: What's new in CDAP 3.5?
Webinar: What's new in CDAP 3.5?
 
Introducing a horizontally scalable, inference-based business Rules Engine fo...
Introducing a horizontally scalable, inference-based business Rules Engine fo...Introducing a horizontally scalable, inference-based business Rules Engine fo...
Introducing a horizontally scalable, inference-based business Rules Engine fo...
 
Real-Time Dynamic Data Export Using the Kafka Ecosystem
Real-Time Dynamic Data Export Using the Kafka EcosystemReal-Time Dynamic Data Export Using the Kafka Ecosystem
Real-Time Dynamic Data Export Using the Kafka Ecosystem
 
PCAP Graphs for Cybersecurity and System Tuning
PCAP Graphs for Cybersecurity and System TuningPCAP Graphs for Cybersecurity and System Tuning
PCAP Graphs for Cybersecurity and System Tuning
 
New Approaches for Fraud Detection on Apache Kafka and KSQL
New Approaches for Fraud Detection on Apache Kafka and KSQLNew Approaches for Fraud Detection on Apache Kafka and KSQL
New Approaches for Fraud Detection on Apache Kafka and KSQL
 
Streaming data in the cloud with Confluent and MongoDB Atlas | Robert Waters,...
Streaming data in the cloud with Confluent and MongoDB Atlas | Robert Waters,...Streaming data in the cloud with Confluent and MongoDB Atlas | Robert Waters,...
Streaming data in the cloud with Confluent and MongoDB Atlas | Robert Waters,...
 
How a Data Mesh is Driving our Platform | Trey Hicks, Gloo
How a Data Mesh is Driving our Platform | Trey Hicks, GlooHow a Data Mesh is Driving our Platform | Trey Hicks, Gloo
How a Data Mesh is Driving our Platform | Trey Hicks, Gloo
 
From Batch to Streaming with Apache Apex Dataworks Summit 2017
From Batch to Streaming with Apache Apex Dataworks Summit 2017From Batch to Streaming with Apache Apex Dataworks Summit 2017
From Batch to Streaming with Apache Apex Dataworks Summit 2017
 
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
Intro to Apache Apex - Next Gen Platform for Ingest and TransformIntro to Apache Apex - Next Gen Platform for Ingest and Transform
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
 
"Who Moved my Data? - Why tracking changes and sources of data is critical to...
"Who Moved my Data? - Why tracking changes and sources of data is critical to..."Who Moved my Data? - Why tracking changes and sources of data is critical to...
"Who Moved my Data? - Why tracking changes and sources of data is critical to...
 
Ingesting Data from Kafka to JDBC with Transformation and Enrichment
Ingesting Data from Kafka to JDBC with Transformation and EnrichmentIngesting Data from Kafka to JDBC with Transformation and Enrichment
Ingesting Data from Kafka to JDBC with Transformation and Enrichment
 
Deploying Kafka Streams Applications with Docker and Kubernetes
Deploying Kafka Streams Applications with Docker and KubernetesDeploying Kafka Streams Applications with Docker and Kubernetes
Deploying Kafka Streams Applications with Docker and Kubernetes
 
Data Integration with Apache Kafka: What, Why, How
Data Integration with Apache Kafka: What, Why, HowData Integration with Apache Kafka: What, Why, How
Data Integration with Apache Kafka: What, Why, How
 
Flink Forward San Francisco 2018 keynote: Stephan Ewen - "What turns stream p...
Flink Forward San Francisco 2018 keynote: Stephan Ewen - "What turns stream p...Flink Forward San Francisco 2018 keynote: Stephan Ewen - "What turns stream p...
Flink Forward San Francisco 2018 keynote: Stephan Ewen - "What turns stream p...
 
KSQL: Open Source Streaming for Apache Kafka
KSQL: Open Source Streaming for Apache KafkaKSQL: Open Source Streaming for Apache Kafka
KSQL: Open Source Streaming for Apache Kafka
 
ksqlDB - Stream Processing simplified!
ksqlDB - Stream Processing simplified!ksqlDB - Stream Processing simplified!
ksqlDB - Stream Processing simplified!
 
Kafka Summit NYC 2017 - Venice: A Distributed Database on top of Kafka
Kafka Summit NYC 2017 - Venice: A Distributed Database on top of KafkaKafka Summit NYC 2017 - Venice: A Distributed Database on top of Kafka
Kafka Summit NYC 2017 - Venice: A Distributed Database on top of Kafka
 
Building a Streaming Platform with Kafka
Building a Streaming Platform with KafkaBuilding a Streaming Platform with Kafka
Building a Streaming Platform with Kafka
 
Ingesting and Processing IoT Data - using MQTT, Kafka Connect and KSQL
Ingesting and Processing IoT Data - using MQTT, Kafka Connect and KSQLIngesting and Processing IoT Data - using MQTT, Kafka Connect and KSQL
Ingesting and Processing IoT Data - using MQTT, Kafka Connect and KSQL
 

Ähnlich wie OSDC 2019 | Democratizing Data at Go-JEK by Maulik Soneji

Current and Future of Apache Kafka
Current and Future of Apache KafkaCurrent and Future of Apache Kafka
Current and Future of Apache KafkaJoe Stein
 
Pivotal Real Time Data Stream Analytics
Pivotal Real Time Data Stream AnalyticsPivotal Real Time Data Stream Analytics
Pivotal Real Time Data Stream Analyticskgshukla
 
SnapLogic- iPaaS (Elastic Integration Cloud and Data Integration)
SnapLogic- iPaaS (Elastic Integration Cloud and Data Integration) SnapLogic- iPaaS (Elastic Integration Cloud and Data Integration)
SnapLogic- iPaaS (Elastic Integration Cloud and Data Integration) Surendar S
 
SingleStore & Kafka: Better Together to Power Modern Real-Time Data Architect...
SingleStore & Kafka: Better Together to Power Modern Real-Time Data Architect...SingleStore & Kafka: Better Together to Power Modern Real-Time Data Architect...
SingleStore & Kafka: Better Together to Power Modern Real-Time Data Architect...HostedbyConfluent
 
Connect K of SMACK:pykafka, kafka-python or?
Connect K of SMACK:pykafka, kafka-python or?Connect K of SMACK:pykafka, kafka-python or?
Connect K of SMACK:pykafka, kafka-python or?Micron Technology
 
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...AboutYouGmbH
 
Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couc...
Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couc...Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couc...
Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couc...Data Con LA
 
Data Engineer's Lunch #81: Reverse ETL Tools for Modern Data Platforms
Data Engineer's Lunch #81: Reverse ETL Tools for Modern Data PlatformsData Engineer's Lunch #81: Reverse ETL Tools for Modern Data Platforms
Data Engineer's Lunch #81: Reverse ETL Tools for Modern Data PlatformsAnant Corporation
 
(ARC346) Scaling To 25 Billion Daily Requests Within 3 Months On AWS
(ARC346) Scaling To 25 Billion Daily Requests Within 3 Months On AWS(ARC346) Scaling To 25 Billion Daily Requests Within 3 Months On AWS
(ARC346) Scaling To 25 Billion Daily Requests Within 3 Months On AWSAmazon Web Services
 
LeedsSharp May 2023 - Azure Integration Services
LeedsSharp May 2023 - Azure Integration ServicesLeedsSharp May 2023 - Azure Integration Services
LeedsSharp May 2023 - Azure Integration ServicesMichael Stephenson
 
Using a Fast Operational Database to Build Real-time Streaming Aggregations
Using a Fast Operational Database to Build Real-time Streaming AggregationsUsing a Fast Operational Database to Build Real-time Streaming Aggregations
Using a Fast Operational Database to Build Real-time Streaming AggregationsVoltDB
 
Meetup: Streaming Data Pipeline Development
Meetup:  Streaming Data Pipeline DevelopmentMeetup:  Streaming Data Pipeline Development
Meetup: Streaming Data Pipeline DevelopmentTimothy Spann
 
Busy Bee Application Develompent Platform
Busy Bee Application Develompent PlatformBusy Bee Application Develompent Platform
Busy Bee Application Develompent PlatformUtkarsh Shukla
 
End To End Machine Learning With Google Cloud
End To End Machine Learning With Google Cloud End To End Machine Learning With Google Cloud
End To End Machine Learning With Google Cloud Tu Pham
 
Spark and machine learning in microservices architecture
Spark and machine learning in microservices architectureSpark and machine learning in microservices architecture
Spark and machine learning in microservices architectureStepan Pushkarev
 
The role of AWS in the Datalandscape of a fast growing Startup
The role of AWS in the Datalandscape of a fast growing StartupThe role of AWS in the Datalandscape of a fast growing Startup
The role of AWS in the Datalandscape of a fast growing StartupMaximilian Ehrlich
 
Data science for infrastructure dev week 2022
Data science for infrastructure   dev week 2022Data science for infrastructure   dev week 2022
Data science for infrastructure dev week 2022ZainAsgar1
 
The Never Landing Stream with HTAP and Streaming
The Never Landing Stream with HTAP and StreamingThe Never Landing Stream with HTAP and Streaming
The Never Landing Stream with HTAP and StreamingTimothy Spann
 

Ähnlich wie OSDC 2019 | Democratizing Data at Go-JEK by Maulik Soneji (20)

Current and Future of Apache Kafka
Current and Future of Apache KafkaCurrent and Future of Apache Kafka
Current and Future of Apache Kafka
 
Data Infrastructure in Kumparan
Data Infrastructure in KumparanData Infrastructure in Kumparan
Data Infrastructure in Kumparan
 
Pivotal Real Time Data Stream Analytics
Pivotal Real Time Data Stream AnalyticsPivotal Real Time Data Stream Analytics
Pivotal Real Time Data Stream Analytics
 
SnapLogic- iPaaS (Elastic Integration Cloud and Data Integration)
SnapLogic- iPaaS (Elastic Integration Cloud and Data Integration) SnapLogic- iPaaS (Elastic Integration Cloud and Data Integration)
SnapLogic- iPaaS (Elastic Integration Cloud and Data Integration)
 
SingleStore & Kafka: Better Together to Power Modern Real-Time Data Architect...
SingleStore & Kafka: Better Together to Power Modern Real-Time Data Architect...SingleStore & Kafka: Better Together to Power Modern Real-Time Data Architect...
SingleStore & Kafka: Better Together to Power Modern Real-Time Data Architect...
 
Javantura v3 - Real-time BigData ingestion and querying of aggregated data – ...
Javantura v3 - Real-time BigData ingestion and querying of aggregated data – ...Javantura v3 - Real-time BigData ingestion and querying of aggregated data – ...
Javantura v3 - Real-time BigData ingestion and querying of aggregated data – ...
 
Connect K of SMACK:pykafka, kafka-python or?
Connect K of SMACK:pykafka, kafka-python or?Connect K of SMACK:pykafka, kafka-python or?
Connect K of SMACK:pykafka, kafka-python or?
 
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
 
Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couc...
Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couc...Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couc...
Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couc...
 
Data Engineer's Lunch #81: Reverse ETL Tools for Modern Data Platforms
Data Engineer's Lunch #81: Reverse ETL Tools for Modern Data PlatformsData Engineer's Lunch #81: Reverse ETL Tools for Modern Data Platforms
Data Engineer's Lunch #81: Reverse ETL Tools for Modern Data Platforms
 
(ARC346) Scaling To 25 Billion Daily Requests Within 3 Months On AWS
(ARC346) Scaling To 25 Billion Daily Requests Within 3 Months On AWS(ARC346) Scaling To 25 Billion Daily Requests Within 3 Months On AWS
(ARC346) Scaling To 25 Billion Daily Requests Within 3 Months On AWS
 
LeedsSharp May 2023 - Azure Integration Services
LeedsSharp May 2023 - Azure Integration ServicesLeedsSharp May 2023 - Azure Integration Services
LeedsSharp May 2023 - Azure Integration Services
 
Using a Fast Operational Database to Build Real-time Streaming Aggregations
Using a Fast Operational Database to Build Real-time Streaming AggregationsUsing a Fast Operational Database to Build Real-time Streaming Aggregations
Using a Fast Operational Database to Build Real-time Streaming Aggregations
 
Meetup: Streaming Data Pipeline Development
Meetup:  Streaming Data Pipeline DevelopmentMeetup:  Streaming Data Pipeline Development
Meetup: Streaming Data Pipeline Development
 
Busy Bee Application Develompent Platform
Busy Bee Application Develompent PlatformBusy Bee Application Develompent Platform
Busy Bee Application Develompent Platform
 
End To End Machine Learning With Google Cloud
End To End Machine Learning With Google Cloud End To End Machine Learning With Google Cloud
End To End Machine Learning With Google Cloud
 
Spark and machine learning in microservices architecture
Spark and machine learning in microservices architectureSpark and machine learning in microservices architecture
Spark and machine learning in microservices architecture
 
The role of AWS in the Datalandscape of a fast growing Startup
The role of AWS in the Datalandscape of a fast growing StartupThe role of AWS in the Datalandscape of a fast growing Startup
The role of AWS in the Datalandscape of a fast growing Startup
 
Data science for infrastructure dev week 2022
Data science for infrastructure   dev week 2022Data science for infrastructure   dev week 2022
Data science for infrastructure dev week 2022
 
The Never Landing Stream with HTAP and Streaming
The Never Landing Stream with HTAP and StreamingThe Never Landing Stream with HTAP and Streaming
The Never Landing Stream with HTAP and Streaming
 

Kürzlich hochgeladen

eSoftTools IMAP Backup Software and migration tools
eSoftTools IMAP Backup Software and migration toolseSoftTools IMAP Backup Software and migration tools
eSoftTools IMAP Backup Software and migration toolsosttopstonverter
 
2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf
2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf
2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdfAndrey Devyatkin
 
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full RecordingOpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full RecordingShane Coughlan
 
Effort Estimation Techniques used in Software Projects
Effort Estimation Techniques used in Software ProjectsEffort Estimation Techniques used in Software Projects
Effort Estimation Techniques used in Software ProjectsDEEPRAJ PATHAK
 
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...OnePlan Solutions
 
VictoriaMetrics Q1 Meet Up '24 - Community & News Update
VictoriaMetrics Q1 Meet Up '24 - Community & News UpdateVictoriaMetrics Q1 Meet Up '24 - Community & News Update
VictoriaMetrics Q1 Meet Up '24 - Community & News UpdateVictoriaMetrics
 
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...Bert Jan Schrijver
 
SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?Alexandre Beguel
 
Understanding Plagiarism: Causes, Consequences and Prevention.pptx
Understanding Plagiarism: Causes, Consequences and Prevention.pptxUnderstanding Plagiarism: Causes, Consequences and Prevention.pptx
Understanding Plagiarism: Causes, Consequences and Prevention.pptxSasikiranMarri
 
Osi security architecture in network.pptx
Osi security architecture in network.pptxOsi security architecture in network.pptx
Osi security architecture in network.pptxVinzoCenzo
 
Ronisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited CatalogueRonisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited Catalogueitservices996
 
[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf
[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf
[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdfSteve Caron
 
The Ultimate Guide to Performance Testing in Low-Code, No-Code Environments (...
The Ultimate Guide to Performance Testing in Low-Code, No-Code Environments (...The Ultimate Guide to Performance Testing in Low-Code, No-Code Environments (...
The Ultimate Guide to Performance Testing in Low-Code, No-Code Environments (...kalichargn70th171
 
Zer0con 2024 final share short version.pdf
Zer0con 2024 final share short version.pdfZer0con 2024 final share short version.pdf
Zer0con 2024 final share short version.pdfmaor17
 
Mastering Project Planning with Microsoft Project 2016.pptx
Mastering Project Planning with Microsoft Project 2016.pptxMastering Project Planning with Microsoft Project 2016.pptx
Mastering Project Planning with Microsoft Project 2016.pptxAS Design & AST.
 
Pros and Cons of Selenium In Automation Testing_ A Comprehensive Assessment.pdf
Pros and Cons of Selenium In Automation Testing_ A Comprehensive Assessment.pdfPros and Cons of Selenium In Automation Testing_ A Comprehensive Assessment.pdf
Pros and Cons of Selenium In Automation Testing_ A Comprehensive Assessment.pdfkalichargn70th171
 
Keeping your build tool updated in a multi repository world
Keeping your build tool updated in a multi repository worldKeeping your build tool updated in a multi repository world
Keeping your build tool updated in a multi repository worldRoberto Pérez Alcolea
 
What’s New in VictoriaMetrics: Q1 2024 Updates
What’s New in VictoriaMetrics: Q1 2024 UpdatesWhat’s New in VictoriaMetrics: Q1 2024 Updates
What’s New in VictoriaMetrics: Q1 2024 UpdatesVictoriaMetrics
 
Best Angular 17 Classroom & Online training - Naresh IT
Best Angular 17 Classroom & Online training - Naresh ITBest Angular 17 Classroom & Online training - Naresh IT
Best Angular 17 Classroom & Online training - Naresh ITmanoharjgpsolutions
 
Effectively Troubleshoot 9 Types of OutOfMemoryError
Effectively Troubleshoot 9 Types of OutOfMemoryErrorEffectively Troubleshoot 9 Types of OutOfMemoryError
Effectively Troubleshoot 9 Types of OutOfMemoryErrorTier1 app
 

Kürzlich hochgeladen (20)

eSoftTools IMAP Backup Software and migration tools
eSoftTools IMAP Backup Software and migration toolseSoftTools IMAP Backup Software and migration tools
eSoftTools IMAP Backup Software and migration tools
 
2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf
2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf
2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf
 
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full RecordingOpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
 
Effort Estimation Techniques used in Software Projects
Effort Estimation Techniques used in Software ProjectsEffort Estimation Techniques used in Software Projects
Effort Estimation Techniques used in Software Projects
 
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...
 
VictoriaMetrics Q1 Meet Up '24 - Community & News Update
VictoriaMetrics Q1 Meet Up '24 - Community & News UpdateVictoriaMetrics Q1 Meet Up '24 - Community & News Update
VictoriaMetrics Q1 Meet Up '24 - Community & News Update
 
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
 
SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?
 
Understanding Plagiarism: Causes, Consequences and Prevention.pptx
Understanding Plagiarism: Causes, Consequences and Prevention.pptxUnderstanding Plagiarism: Causes, Consequences and Prevention.pptx
Understanding Plagiarism: Causes, Consequences and Prevention.pptx
 
Osi security architecture in network.pptx
Osi security architecture in network.pptxOsi security architecture in network.pptx
Osi security architecture in network.pptx
 
Ronisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited CatalogueRonisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited Catalogue
 
[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf
[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf
[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf
 
The Ultimate Guide to Performance Testing in Low-Code, No-Code Environments (...
The Ultimate Guide to Performance Testing in Low-Code, No-Code Environments (...The Ultimate Guide to Performance Testing in Low-Code, No-Code Environments (...
The Ultimate Guide to Performance Testing in Low-Code, No-Code Environments (...
 
Zer0con 2024 final share short version.pdf
Zer0con 2024 final share short version.pdfZer0con 2024 final share short version.pdf
Zer0con 2024 final share short version.pdf
 
Mastering Project Planning with Microsoft Project 2016.pptx
Mastering Project Planning with Microsoft Project 2016.pptxMastering Project Planning with Microsoft Project 2016.pptx
Mastering Project Planning with Microsoft Project 2016.pptx
 
Pros and Cons of Selenium In Automation Testing_ A Comprehensive Assessment.pdf
Pros and Cons of Selenium In Automation Testing_ A Comprehensive Assessment.pdfPros and Cons of Selenium In Automation Testing_ A Comprehensive Assessment.pdf
Pros and Cons of Selenium In Automation Testing_ A Comprehensive Assessment.pdf
 
Keeping your build tool updated in a multi repository world
Keeping your build tool updated in a multi repository worldKeeping your build tool updated in a multi repository world
Keeping your build tool updated in a multi repository world
 
What’s New in VictoriaMetrics: Q1 2024 Updates
What’s New in VictoriaMetrics: Q1 2024 UpdatesWhat’s New in VictoriaMetrics: Q1 2024 Updates
What’s New in VictoriaMetrics: Q1 2024 Updates
 
Best Angular 17 Classroom & Online training - Naresh IT
Best Angular 17 Classroom & Online training - Naresh ITBest Angular 17 Classroom & Online training - Naresh IT
Best Angular 17 Classroom & Online training - Naresh IT
 
Effectively Troubleshoot 9 Types of OutOfMemoryError
Effectively Troubleshoot 9 Types of OutOfMemoryErrorEffectively Troubleshoot 9 Types of OutOfMemoryError
Effectively Troubleshoot 9 Types of OutOfMemoryError
 

OSDC 2019 | Democratizing Data at Go-JEK by Maulik Soneji

  • 2. What do we mean by Democratizing Data?
  • 3. What do we mean by Democratizing Data? - Self serve in a way that you are dependent on tools and not on people - Abstract in such a way that people only need to know - data they are publishing - Insights that they want
  • 4. Agenda - About Go-jek - Core components of Data Platform - Building Data platform around core components - Whole picture of how components interact with each other - References - QnA
  • 5. About Go-jek - A mobile app for daily needs - Transport - Food Delivery - Payments - Logistics - 18+ products - Operational Internationally - Indonesia - Singapore - Vietnam - Thailand 2016 Expansion and new services 2010 Call-center for ojek* services 2015 App launched with 3 services *ojek is an Indonesian term of motorcycle ride hailing
  • 6. Uses of Data - Application Monitoring - Fraud Detection - Pricing - Report Generation - User segmentation Let's talk about the scale of Data Growth of Data - 6666x growth in 18 months - Expansion to 3 countries(Singapore, Vietnam, Thailand) Type of Data - 18+ products ranging from: - Ride - Food Delivery - Payments - Shipments Given the data at Go-jek it becomes essential to create a data platform at scale and supporting automation.
  • 8. Core Components - Components that serve as backbone for the data platform - We have a bunch of these core components for different use cases - Multiple options available for changing any of these core components with another well suited option
  • 9. Core Components - Main components of our Data platform: - Kafka => message broker for producing and consuming data
  • 10. Core Components - Main components of our Data platform: - Kafka => message broker for producing and consuming data - Flink => real time stream processing
  • 11. Core Components - Main components of our Data platform: - Kafka => message broker for producing and consuming data - Flink => real time stream processing - Kubernetes => Deployment and management of containers
  • 12. Core Components - Main components of our Data platform: - Kafka => message broker for producing and consuming data - Flink => real time stream processing - Kubernetes => Deployment and management of containers - Airflow => workflow management
  • 13. Core Components - Main components of our Data platform: - Kafka => message broker for producing and consuming data - Flink => real time stream processing - Kubernetes => Deployment and management of containers - Airflow => workflow management - Dataproc => managed spark cluster for batch processing
  • 15. Typical day for Data Engineer at Gojek - Tackling the following aspects of Data: - Ingestion - Consumption - Aggregation - Cold Storage - Visualization (in early stages currently) - Automation for Resource creation: - Resource is any tool used for data ingestion, consumption etc.
  • 16. Focus areas for scaling - How to take care of these aspects while scaling data platform: - Reliability => No Data loss - Abstraction => Downtime due to one team shouldn’t impact other - Automation => Infrastructure, deployment - Monitoring => Monitoring and Alerting for all resources - Self Serve => Anyone can access platform without intervention
  • 17. Data Ingestion - Uniformity by Protobufs - All kafka messages are serialized/deserialized by using protobuf - Teams maintain the protobuf schema - Client libraries in various languages to push messages - Helps in terms of: - Defining a schema for pushing/reading from kafka - No breaking change, a newer version of the schema shouldn’t break the older conventions
  • 18. Why Protobufs? - All protobufs can be imported like client library - Use protoc command to generate library for different languages - Easy to import and use Create Golang library from Java library
  • 19. Data Ingestion - Reliability - Reliability by Fronting Systems - Fronting serves as reliable way to push data to kafka - First pushes to internal kafka and then pushes batched data to main kafka - Internal redis store to store failed messages - Individual frontings for each team to isolate issues - High availability with HAProxy - Can be replaced with any system that enables reliable push to kafka - Sidekiq jobs - Application level retry mechanism
  • 21. Data Consumption - Firehose: Custom kafka consumer that pushes to sinks: - Database Sink - HTTP Sink - Influx Sink - Elasticsearch Sink
  • 22. Data Cold Storage - Bigquery (Data warehouse) => Beast - GCS (Cold storage) => Sakaar - Pushes raw data from kafka to Bigquery and GCS - Governance of data by service accounts - Metabase and Zeppelin to query data
  • 24. Kafka - Abstraction - To scale kafka, formalize abstractions of different kafka clusters - Different Kafka clusters for application usecases: - Mainstream for all booking data - Appstream for mobile application data - Locstream for location data - Mirror kafka data for data usecases: - Data clusters used for aggregation, auditing, cold storage of data
  • 25. Data Aggregation - Flink (for real time aggregation) - Dataproc (for batch aggregation) - SQL interface following Apache Calcite to create aggregation jobs - User Defined Functions to support complex functionalities - Different sinks to write aggregated data to Elasticsearch, Database, Influx and others - Airflow to schedule jobs
  • 26. Data Visualization - Geo visualization platform to explore location data - D3 and chart libraries to explore data - Deck.GL for building heatmaps, 2D/3D visualization layers - Booking, payments heatmaps
  • 27. Automation - Terraform for IAC and following convention - Creating all aspects of data platform through terraform: - Kubernetes cluster - VMs - Kafka/Core Components - Grants to service accounts
  • 28. Monitoring - TICK script based monitoring/alerting setup - TICK(Telegraf, InfluxDB, Chronograph, Kapacitor) - StatsD client to collect metrics and write data to influx - Kapacitor scripts to create alerts - Integrations with slack, pagerduty for alerting - Grafana for monitoring - Create generic Tick templates and then use template to create alerts
  • 29. Tick setup example In this template, the warn_threshold and crit_threshold are to be supplied when creating alert. Create task through a curl call
  • 30. Self Serve - Data Platform to DIY provision data products - Web interface to collect information from user - Helm chart for all data products - Kubernetes client on backend to provision resource - Resource to team mapping for authentication and authorization
  • 32.
  • 33. References - Data Infra blog: https://blog.gojekengineering.com/data-infrastructure-at-go-jek-cd4dc8cbd929 - Fronting blog: https://blog.gojekengineering.com/kafka-4066a4ea8d0d - Aggregation blog: https://blog.gojekengineering.com/daggers-data-aggregation-in-real-time-4a32eb9ad2d1 - Sakaar blog: https://blog.gojekengineering.com/sakaar-taking-kafka-data-to-cloud-storage-at-go-jek-7839da20b5f3 - Data visualization blog: https://blog.gojekengineering.com/atlas-go-jeks-real-time-geospatial-visualization-platform-1cf5e16814c5 - Open sourced repos: - https://github.com/gojek/beast - https://github.com/gojekfarm/stencil - Helm charts: https://github.com/gojektech/charts