SlideShare a Scribd company logo
1 of 30
Download to read offline
REAL-TIME DATA PROCESSING AT RTB HOUSEREAL-TIME DATA PROCESSING AT RTB HOUSE
BIG DATA TECHNOLOGY MOSCOW 2018
OCTOBER 10-11, 2018
BIG DATA TECHNOLOGY MOSCOW 2018
OCTOBER 10-11, 2018
ARCHITECTURE & LESSONS LEARNED
BARTOSZ ŁOŚ
REAL-TIME DATA PROCESSING AT RTB HOUSE
TABLE OF CONTENTS
Agenda:
- our rtb platform
- the first iteration: mutable structures
- the second iteration: data-flow
- the third iteration: immutable streams of events
- the fourth iteration: multi-dc architecture
- the current iteration: kafka workers
- summary
02/30
OUR RTB PLATFORM
OUR RTB PLATFORM: THE CONTEXT 04/30
Bid requests:
2M/s (peak)
~30 SSP networks
<50-100ms
User events:
1.5B tags/day
350M impressions/day
3.5M clicks/day
1.5M conversions/day
Other events:
bidlogs, accesslogs,
domain events etc.
OUR RTB PLATFORM: DATA PROCESSING NUMBERS
Kafka:
- up to 250K+ messages per second
- 50TB+ processed data every day
- 6 clusters in 4 datacenters
- 26 Kafka brokers
- 85 topics, 5000+ partitions
Docker (processing components only):
- 44 engines
- 1408 cpu cores, 5.5TB ram
- 800+ containers
05/30
HDFS:
- 2PB+ data, up to 10GB/s
BigQuery:
- 1PB+ data, up to 10GB/min
Elasticsearch:
- 40TB data, up to 50K events/s
Aerospike (processing only):
- 80TB data, up to 8K events/s
THE FIRST ITERATION
THE 1ST ITERATION: MUTABLE IMPRESSIONS 07/30
THE 1ST ITERATION: DRAWBACKS
Issues:
- long, overloading data migrations (30 days back)
- complex servlets' logic, inability to reprocess
- inflexible, various schemas
- single-DC
- inconsistencies
08/30
THE SECOND ITERATION:
DATA-FLOW
THE 2ND ITERATION: THE 1ST DATA-FLOW ARCHITECTURE 10/30
THE 2ND ITERATION: DISTRIBUTED LOG
Why Apache Kafka:
- distributed log
- topics partitioning
- partition replication
- log retention
- stateless
- efficient data consuming
11/30
THE 2ND ITERATION: BATCH LOADING
Why Apache Camus:
- "Kafka to HDFS" pipeline
- batch tool
- map-reduce jobs
- storing offsets in log files
- data partitioning
12/30
THE 2ND ITERATION: AVRO & SCHEMA VERSIONING
Why Apache Avro:
- compact, efficient format
- schema: JSON format, payload: binary format
- self-describing container files
- rich data structures
- schema changes support, reader & writer schemas
Our approach:
- Kafka's messages and HDFS files
- schema registry
- avro-fastserde
13/30
(github.com/RTBHOUSE/avro-fastserde)
THE 2ND ITERATION: ACCURATE STATISTICS
Why Apache Storm:
- real-time processing
- streams of tuples, topologies
- fault-tolerance
Why Trident:
- transactions, exactly-once processing
- microbatches (latency & throughput)
14/30
THE 2ND ITERATION: STATS-COUNTER TOPOLOGY 15/30
THE 2ND ITERATION: DRAWBACKS
Hybrid architecture:
- aggregates (real-time)
- raw events (2-hour batches)
- joined events (end-of-day batch jobs)
Other issues:
- Hive joins
- mutable events
- servlets' complex logic
16/30
THE THIRD ITERATION:
NEW APPROACH
THE 3RD ITERATION: NEW APPROACH
{ "IMPRESSION”:
"URL”,
"TIME”,
"CREATIVE”,
...
"CLICKS”,
"CONVERSIONS”
}
{ "CLICK”:
"TIME”,
"IMPRESSION_ID”,
...
"IMPRESSION”
}
{ "CONVERSION”:
"TIME”,
"CLICK_ID”,
...
"IMPRESSION”,
"CLICK”
}
New approach:
- real-time processing
- publishing light events
- immutable streams of events
18/30
THE 3RD ITERATION: HIGH-LEVEL ARCHITECTURE 19/30
THE 3RD ITERATION: DATA-FLOW TOPOLOGY 20/30
THE FOURTH ITERATION:
MULTI-DC
THE 4TH ITERATION: NEW REQUIREMENTS
Main changes:
- 5-6x larger scale:
> from 350K to 2M bid requests/s within 1.5 years
- full multi-dc architecture:
> merging streams of events
> synchronization of user profiles
- end-to-end exactly-once processing:
> at-least-once output semantics + deduplication
- a few better components:
> merger
> new stats-counter, new data-flow
> dispatcher & loader
> logstash
22/30
THE 4TH ITERATION: MULTI-DC ARCHITECTURE 23/30
THE 4TH ITERATION: NEW DATA-FLOW ON KAFKA STREAMS 24/30
(picture from kafka.apache.org)
Why Kafka Streams:
- fully embedded library with no stream
processing cluster
- no external dependencies
- Kafka's parallelism model and group
membership mechanism
- event-at-a-time processing
(not microbatch)
- exactly-once processing semantics
(but at-least-once was good enough)
THE 4TH ITERATION: MERGER ON KAFKA CONSUMER API 25/30
THE CURRENT ITERATION:
KAFKA WORKERS
THE 5TH ITERATION: KAFKA WORKERS
Main features:
- higher level of distribution
- possibility to pause and resume processing for given partition
- asynchronous processing
- tighter control of offsets commits
- backpressure
- at-least-once semantics
- processing timeouts
- handling failures
- multiple consumers (in progress)
- kafka-to-kafka, hdfs, bigquery, elasticsearch connectors (in progress)
27/30
(github.com/RTBHOUSE/kafka-workers)
THE 5TH ITERATION: KAFKA WORKERS ARCHITECTURE 28/30
SUMMARY
What we have achieved:
- platform monitoring
- much more stable platform
- higher quality of data processing
- HDFS & BigQuery & Elasticsearch streaming
- multi-DC architecture and data synchronization
- high scalability
- better data-flow monitoring, deployment & maintenance
29/30
REAL-TIME DATA PROCESSING AT RTB HOUSEREAL-TIME DATA PROCESSING AT RTB HOUSE
BIG DATA TECHNOLOGY MOSCOW 2018
OCTOBER 10-11, 2018
THANK YOU FOR YOUR
ATTENTION

More Related Content

What's hot

The Secrets to SaaS Pricing
The Secrets to SaaS PricingThe Secrets to SaaS Pricing
The Secrets to SaaS Pricing
Kissmetrics on SlideShare
 

What's hot (20)

Introducing Trillium DQ for Big Data: Powerful Profiling and Data Quality for...
Introducing Trillium DQ for Big Data: Powerful Profiling and Data Quality for...Introducing Trillium DQ for Big Data: Powerful Profiling and Data Quality for...
Introducing Trillium DQ for Big Data: Powerful Profiling and Data Quality for...
 
How a Semantic Layer Makes Data Mesh Work at Scale
How a Semantic Layer Makes  Data Mesh Work at ScaleHow a Semantic Layer Makes  Data Mesh Work at Scale
How a Semantic Layer Makes Data Mesh Work at Scale
 
Startup go to market strategy
Startup go to market strategyStartup go to market strategy
Startup go to market strategy
 
Go to market planning
Go to market planningGo to market planning
Go to market planning
 
PowerBI - Porto.Data - 20150219
PowerBI - Porto.Data - 20150219PowerBI - Porto.Data - 20150219
PowerBI - Porto.Data - 20150219
 
What is (and who needs) a customer data platform?
What is (and who needs) a customer data platform?What is (and who needs) a customer data platform?
What is (and who needs) a customer data platform?
 
The Secrets to SaaS Pricing
The Secrets to SaaS PricingThe Secrets to SaaS Pricing
The Secrets to SaaS Pricing
 
Customer-Centric Data Management for Better Customer Experiences
Customer-Centric Data Management for Better Customer ExperiencesCustomer-Centric Data Management for Better Customer Experiences
Customer-Centric Data Management for Better Customer Experiences
 
Go-to-Market Strategies
Go-to-Market StrategiesGo-to-Market Strategies
Go-to-Market Strategies
 
Modeling data and best practices for the Azure Cosmos DB.
Modeling data and best practices for the Azure Cosmos DB.Modeling data and best practices for the Azure Cosmos DB.
Modeling data and best practices for the Azure Cosmos DB.
 
Business Semantics for Data Governance and Stewardship
Business Semantics for Data Governance and StewardshipBusiness Semantics for Data Governance and Stewardship
Business Semantics for Data Governance and Stewardship
 
"Marketing Analytics: How, Why & When"
"Marketing Analytics: How, Why & When""Marketing Analytics: How, Why & When"
"Marketing Analytics: How, Why & When"
 
Pardot B2B Marketing Automation
Pardot B2B Marketing AutomationPardot B2B Marketing Automation
Pardot B2B Marketing Automation
 
AI Powered Conversational Interfaces
AI Powered Conversational InterfacesAI Powered Conversational Interfaces
AI Powered Conversational Interfaces
 
Ross Chayka. Gartner Hype Cycle
Ross Chayka. Gartner Hype CycleRoss Chayka. Gartner Hype Cycle
Ross Chayka. Gartner Hype Cycle
 
Sales Enablement Plan Playbook
Sales Enablement Plan PlaybookSales Enablement Plan Playbook
Sales Enablement Plan Playbook
 
Demand Generation Program Playbook
Demand Generation Program PlaybookDemand Generation Program Playbook
Demand Generation Program Playbook
 
Saas Sales PowerPoint Presentation Slides
Saas Sales PowerPoint Presentation Slides Saas Sales PowerPoint Presentation Slides
Saas Sales PowerPoint Presentation Slides
 
How to Write a B2B Sales Playbook
How to Write a B2B Sales PlaybookHow to Write a B2B Sales Playbook
How to Write a B2B Sales Playbook
 
B2B Sales Enablement & The Customer Journey
B2B Sales Enablement & The Customer JourneyB2B Sales Enablement & The Customer Journey
B2B Sales Enablement & The Customer Journey
 

Similar to Real-Time Data Processing at RTB House – Architecture & Lessons Learned

Real Time Data Processing at RTB House - Bartosz Łoś
Real Time Data Processing at RTB House - Bartosz ŁośReal Time Data Processing at RTB House - Bartosz Łoś
Real Time Data Processing at RTB House - Bartosz Łoś
Evention
 
Apache Kafka - A modern Stream Processing Platform
Apache Kafka - A modern Stream Processing PlatformApache Kafka - A modern Stream Processing Platform
Apache Kafka - A modern Stream Processing Platform
Guido Schmutz
 
Apache Kafka
Apache KafkaApache Kafka
Apache Kafka
Joe Stein
 
Kafka Connect & Kafka Streams/KSQL - the ecosystem around Kafka
Kafka Connect & Kafka Streams/KSQL - the ecosystem around KafkaKafka Connect & Kafka Streams/KSQL - the ecosystem around Kafka
Kafka Connect & Kafka Streams/KSQL - the ecosystem around Kafka
Guido Schmutz
 

Similar to Real-Time Data Processing at RTB House – Architecture & Lessons Learned (20)

Real Time Data Processing at RTB House - Bartosz Łoś
Real Time Data Processing at RTB House - Bartosz ŁośReal Time Data Processing at RTB House - Bartosz Łoś
Real Time Data Processing at RTB House - Bartosz Łoś
 
Apache Kafka with Spark Streaming: Real-time Analytics Redefined
Apache Kafka with Spark Streaming: Real-time Analytics RedefinedApache Kafka with Spark Streaming: Real-time Analytics Redefined
Apache Kafka with Spark Streaming: Real-time Analytics Redefined
 
Event Driven Microservices
Event Driven MicroservicesEvent Driven Microservices
Event Driven Microservices
 
Introduction to apache kafka, confluent and why they matter
Introduction to apache kafka, confluent and why they matterIntroduction to apache kafka, confluent and why they matter
Introduction to apache kafka, confluent and why they matter
 
Streaming ETL with Apache Kafka and KSQL
Streaming ETL with Apache Kafka and KSQLStreaming ETL with Apache Kafka and KSQL
Streaming ETL with Apache Kafka and KSQL
 
Using a Fast Operational Database to Build Real-time Streaming Aggregations
Using a Fast Operational Database to Build Real-time Streaming AggregationsUsing a Fast Operational Database to Build Real-time Streaming Aggregations
Using a Fast Operational Database to Build Real-time Streaming Aggregations
 
Typesafe & William Hill: Cassandra, Spark, and Kafka - The New Streaming Data...
Typesafe & William Hill: Cassandra, Spark, and Kafka - The New Streaming Data...Typesafe & William Hill: Cassandra, Spark, and Kafka - The New Streaming Data...
Typesafe & William Hill: Cassandra, Spark, and Kafka - The New Streaming Data...
 
Real-time Data Pipeline: Kafka Streams / Kafka Connect versus Spark Streaming
Real-time Data Pipeline: Kafka Streams / Kafka Connect versus Spark StreamingReal-time Data Pipeline: Kafka Streams / Kafka Connect versus Spark Streaming
Real-time Data Pipeline: Kafka Streams / Kafka Connect versus Spark Streaming
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
 
Apache Kafka - A modern Stream Processing Platform
Apache Kafka - A modern Stream Processing PlatformApache Kafka - A modern Stream Processing Platform
Apache Kafka - A modern Stream Processing Platform
 
Spark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka StreamsSpark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka Streams
 
Apache Kafka
Apache KafkaApache Kafka
Apache Kafka
 
Big Data Open Source Security LLC: Realtime log analysis with Mesos, Docker, ...
Big Data Open Source Security LLC: Realtime log analysis with Mesos, Docker, ...Big Data Open Source Security LLC: Realtime log analysis with Mesos, Docker, ...
Big Data Open Source Security LLC: Realtime log analysis with Mesos, Docker, ...
 
Kafka Connect & Kafka Streams/KSQL - the ecosystem around Kafka
Kafka Connect & Kafka Streams/KSQL - the ecosystem around KafkaKafka Connect & Kafka Streams/KSQL - the ecosystem around Kafka
Kafka Connect & Kafka Streams/KSQL - the ecosystem around Kafka
 
Kafka Connect and Streams (Concepts, Architecture, Features)
Kafka Connect and Streams (Concepts, Architecture, Features)Kafka Connect and Streams (Concepts, Architecture, Features)
Kafka Connect and Streams (Concepts, Architecture, Features)
 
Connect K of SMACK:pykafka, kafka-python or?
Connect K of SMACK:pykafka, kafka-python or?Connect K of SMACK:pykafka, kafka-python or?
Connect K of SMACK:pykafka, kafka-python or?
 
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
 
Real-Time Log Analysis with Apache Mesos, Kafka and Cassandra
Real-Time Log Analysis with Apache Mesos, Kafka and CassandraReal-Time Log Analysis with Apache Mesos, Kafka and Cassandra
Real-Time Log Analysis with Apache Mesos, Kafka and Cassandra
 
Microservices with Spring 5 Webflux - jProfessionals
Microservices  with Spring 5 Webflux - jProfessionalsMicroservices  with Spring 5 Webflux - jProfessionals
Microservices with Spring 5 Webflux - jProfessionals
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
 

Recently uploaded

6.High Profile Call Girls In Punjab +919053900678 Punjab Call GirlHigh Profil...
6.High Profile Call Girls In Punjab +919053900678 Punjab Call GirlHigh Profil...6.High Profile Call Girls In Punjab +919053900678 Punjab Call GirlHigh Profil...
6.High Profile Call Girls In Punjab +919053900678 Punjab Call GirlHigh Profil...
@Chandigarh #call #Girls 9053900678 @Call #Girls in @Punjab 9053900678
 
Call Girls In Ashram Chowk Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Ashram Chowk Delhi 💯Call Us 🔝8264348440🔝Call Girls In Ashram Chowk Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Ashram Chowk Delhi 💯Call Us 🔝8264348440🔝
soniya singh
 
Lucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRL
Lucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRLLucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRL
Lucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRL
imonikaupta
 
Dwarka Sector 26 Call Girls | Delhi | 9999965857 🫦 Vanshika Verma More Our Se...
Dwarka Sector 26 Call Girls | Delhi | 9999965857 🫦 Vanshika Verma More Our Se...Dwarka Sector 26 Call Girls | Delhi | 9999965857 🫦 Vanshika Verma More Our Se...
Dwarka Sector 26 Call Girls | Delhi | 9999965857 🫦 Vanshika Verma More Our Se...
Call Girls In Delhi Whatsup 9873940964 Enjoy Unlimited Pleasure
 
Call Girls Service Chandigarh Lucky ❤️ 7710465962 Independent Call Girls In C...
Call Girls Service Chandigarh Lucky ❤️ 7710465962 Independent Call Girls In C...Call Girls Service Chandigarh Lucky ❤️ 7710465962 Independent Call Girls In C...
Call Girls Service Chandigarh Lucky ❤️ 7710465962 Independent Call Girls In C...
Sheetaleventcompany
 
₹5.5k {Cash Payment}New Friends Colony Call Girls In [Delhi NIHARIKA] 🔝|97111...
₹5.5k {Cash Payment}New Friends Colony Call Girls In [Delhi NIHARIKA] 🔝|97111...₹5.5k {Cash Payment}New Friends Colony Call Girls In [Delhi NIHARIKA] 🔝|97111...
₹5.5k {Cash Payment}New Friends Colony Call Girls In [Delhi NIHARIKA] 🔝|97111...
Diya Sharma
 
Hot Service (+9316020077 ) Goa Call Girls Real Photos and Genuine Service
Hot Service (+9316020077 ) Goa  Call Girls Real Photos and Genuine ServiceHot Service (+9316020077 ) Goa  Call Girls Real Photos and Genuine Service
Hot Service (+9316020077 ) Goa Call Girls Real Photos and Genuine Service
sexy call girls service in goa
 
Call Girls In Model Towh Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Model Towh Delhi 💯Call Us 🔝8264348440🔝Call Girls In Model Towh Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Model Towh Delhi 💯Call Us 🔝8264348440🔝
soniya singh
 

Recently uploaded (20)

6.High Profile Call Girls In Punjab +919053900678 Punjab Call GirlHigh Profil...
6.High Profile Call Girls In Punjab +919053900678 Punjab Call GirlHigh Profil...6.High Profile Call Girls In Punjab +919053900678 Punjab Call GirlHigh Profil...
6.High Profile Call Girls In Punjab +919053900678 Punjab Call GirlHigh Profil...
 
Russian Call girl in Ajman +971563133746 Ajman Call girl Service
Russian Call girl in Ajman +971563133746 Ajman Call girl ServiceRussian Call girl in Ajman +971563133746 Ajman Call girl Service
Russian Call girl in Ajman +971563133746 Ajman Call girl Service
 
Ganeshkhind ! Call Girls Pune - 450+ Call Girl Cash Payment 8005736733 Neha T...
Ganeshkhind ! Call Girls Pune - 450+ Call Girl Cash Payment 8005736733 Neha T...Ganeshkhind ! Call Girls Pune - 450+ Call Girl Cash Payment 8005736733 Neha T...
Ganeshkhind ! Call Girls Pune - 450+ Call Girl Cash Payment 8005736733 Neha T...
 
Enjoy Night⚡Call Girls Dlf City Phase 3 Gurgaon >༒8448380779 Escort Service
Enjoy Night⚡Call Girls Dlf City Phase 3 Gurgaon >༒8448380779 Escort ServiceEnjoy Night⚡Call Girls Dlf City Phase 3 Gurgaon >༒8448380779 Escort Service
Enjoy Night⚡Call Girls Dlf City Phase 3 Gurgaon >༒8448380779 Escort Service
 
Moving Beyond Twitter/X and Facebook - Social Media for local news providers
Moving Beyond Twitter/X and Facebook - Social Media for local news providersMoving Beyond Twitter/X and Facebook - Social Media for local news providers
Moving Beyond Twitter/X and Facebook - Social Media for local news providers
 
Call Girls In Ashram Chowk Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Ashram Chowk Delhi 💯Call Us 🔝8264348440🔝Call Girls In Ashram Chowk Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Ashram Chowk Delhi 💯Call Us 🔝8264348440🔝
 
Al Barsha Night Partner +0567686026 Call Girls Dubai
Al Barsha Night Partner +0567686026 Call Girls  DubaiAl Barsha Night Partner +0567686026 Call Girls  Dubai
Al Barsha Night Partner +0567686026 Call Girls Dubai
 
Lucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRL
Lucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRLLucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRL
Lucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRL
 
Dwarka Sector 26 Call Girls | Delhi | 9999965857 🫦 Vanshika Verma More Our Se...
Dwarka Sector 26 Call Girls | Delhi | 9999965857 🫦 Vanshika Verma More Our Se...Dwarka Sector 26 Call Girls | Delhi | 9999965857 🫦 Vanshika Verma More Our Se...
Dwarka Sector 26 Call Girls | Delhi | 9999965857 🫦 Vanshika Verma More Our Se...
 
Call Girls Service Chandigarh Lucky ❤️ 7710465962 Independent Call Girls In C...
Call Girls Service Chandigarh Lucky ❤️ 7710465962 Independent Call Girls In C...Call Girls Service Chandigarh Lucky ❤️ 7710465962 Independent Call Girls In C...
Call Girls Service Chandigarh Lucky ❤️ 7710465962 Independent Call Girls In C...
 
₹5.5k {Cash Payment}New Friends Colony Call Girls In [Delhi NIHARIKA] 🔝|97111...
₹5.5k {Cash Payment}New Friends Colony Call Girls In [Delhi NIHARIKA] 🔝|97111...₹5.5k {Cash Payment}New Friends Colony Call Girls In [Delhi NIHARIKA] 🔝|97111...
₹5.5k {Cash Payment}New Friends Colony Call Girls In [Delhi NIHARIKA] 🔝|97111...
 
Hot Service (+9316020077 ) Goa Call Girls Real Photos and Genuine Service
Hot Service (+9316020077 ) Goa  Call Girls Real Photos and Genuine ServiceHot Service (+9316020077 ) Goa  Call Girls Real Photos and Genuine Service
Hot Service (+9316020077 ) Goa Call Girls Real Photos and Genuine Service
 
Call Now ☎ 8264348440 !! Call Girls in Rani Bagh Escort Service Delhi N.C.R.
Call Now ☎ 8264348440 !! Call Girls in Rani Bagh Escort Service Delhi N.C.R.Call Now ☎ 8264348440 !! Call Girls in Rani Bagh Escort Service Delhi N.C.R.
Call Now ☎ 8264348440 !! Call Girls in Rani Bagh Escort Service Delhi N.C.R.
 
Pune Airport ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready...
Pune Airport ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready...Pune Airport ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready...
Pune Airport ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready...
 
'Future Evolution of the Internet' delivered by Geoff Huston at Everything Op...
'Future Evolution of the Internet' delivered by Geoff Huston at Everything Op...'Future Evolution of the Internet' delivered by Geoff Huston at Everything Op...
'Future Evolution of the Internet' delivered by Geoff Huston at Everything Op...
 
Russian Call Girls in %(+971524965298 )# Call Girls in Dubai
Russian Call Girls in %(+971524965298  )#  Call Girls in DubaiRussian Call Girls in %(+971524965298  )#  Call Girls in Dubai
Russian Call Girls in %(+971524965298 )# Call Girls in Dubai
 
Call Girls In Model Towh Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Model Towh Delhi 💯Call Us 🔝8264348440🔝Call Girls In Model Towh Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Model Towh Delhi 💯Call Us 🔝8264348440🔝
 
Top Rated Pune Call Girls Daund ⟟ 6297143586 ⟟ Call Me For Genuine Sex Servi...
Top Rated  Pune Call Girls Daund ⟟ 6297143586 ⟟ Call Me For Genuine Sex Servi...Top Rated  Pune Call Girls Daund ⟟ 6297143586 ⟟ Call Me For Genuine Sex Servi...
Top Rated Pune Call Girls Daund ⟟ 6297143586 ⟟ Call Me For Genuine Sex Servi...
 
Russian Call Girls Pune (Adult Only) 8005736733 Escort Service 24x7 Cash Pay...
Russian Call Girls Pune  (Adult Only) 8005736733 Escort Service 24x7 Cash Pay...Russian Call Girls Pune  (Adult Only) 8005736733 Escort Service 24x7 Cash Pay...
Russian Call Girls Pune (Adult Only) 8005736733 Escort Service 24x7 Cash Pay...
 
On Starlink, presented by Geoff Huston at NZNOG 2024
On Starlink, presented by Geoff Huston at NZNOG 2024On Starlink, presented by Geoff Huston at NZNOG 2024
On Starlink, presented by Geoff Huston at NZNOG 2024
 

Real-Time Data Processing at RTB House – Architecture & Lessons Learned

  • 1. REAL-TIME DATA PROCESSING AT RTB HOUSEREAL-TIME DATA PROCESSING AT RTB HOUSE BIG DATA TECHNOLOGY MOSCOW 2018 OCTOBER 10-11, 2018 BIG DATA TECHNOLOGY MOSCOW 2018 OCTOBER 10-11, 2018 ARCHITECTURE & LESSONS LEARNED BARTOSZ ŁOŚ REAL-TIME DATA PROCESSING AT RTB HOUSE
  • 2. TABLE OF CONTENTS Agenda: - our rtb platform - the first iteration: mutable structures - the second iteration: data-flow - the third iteration: immutable streams of events - the fourth iteration: multi-dc architecture - the current iteration: kafka workers - summary 02/30
  • 4. OUR RTB PLATFORM: THE CONTEXT 04/30 Bid requests: 2M/s (peak) ~30 SSP networks <50-100ms User events: 1.5B tags/day 350M impressions/day 3.5M clicks/day 1.5M conversions/day Other events: bidlogs, accesslogs, domain events etc.
  • 5. OUR RTB PLATFORM: DATA PROCESSING NUMBERS Kafka: - up to 250K+ messages per second - 50TB+ processed data every day - 6 clusters in 4 datacenters - 26 Kafka brokers - 85 topics, 5000+ partitions Docker (processing components only): - 44 engines - 1408 cpu cores, 5.5TB ram - 800+ containers 05/30 HDFS: - 2PB+ data, up to 10GB/s BigQuery: - 1PB+ data, up to 10GB/min Elasticsearch: - 40TB data, up to 50K events/s Aerospike (processing only): - 80TB data, up to 8K events/s
  • 7. THE 1ST ITERATION: MUTABLE IMPRESSIONS 07/30
  • 8. THE 1ST ITERATION: DRAWBACKS Issues: - long, overloading data migrations (30 days back) - complex servlets' logic, inability to reprocess - inflexible, various schemas - single-DC - inconsistencies 08/30
  • 10. THE 2ND ITERATION: THE 1ST DATA-FLOW ARCHITECTURE 10/30
  • 11. THE 2ND ITERATION: DISTRIBUTED LOG Why Apache Kafka: - distributed log - topics partitioning - partition replication - log retention - stateless - efficient data consuming 11/30
  • 12. THE 2ND ITERATION: BATCH LOADING Why Apache Camus: - "Kafka to HDFS" pipeline - batch tool - map-reduce jobs - storing offsets in log files - data partitioning 12/30
  • 13. THE 2ND ITERATION: AVRO & SCHEMA VERSIONING Why Apache Avro: - compact, efficient format - schema: JSON format, payload: binary format - self-describing container files - rich data structures - schema changes support, reader & writer schemas Our approach: - Kafka's messages and HDFS files - schema registry - avro-fastserde 13/30 (github.com/RTBHOUSE/avro-fastserde)
  • 14. THE 2ND ITERATION: ACCURATE STATISTICS Why Apache Storm: - real-time processing - streams of tuples, topologies - fault-tolerance Why Trident: - transactions, exactly-once processing - microbatches (latency & throughput) 14/30
  • 15. THE 2ND ITERATION: STATS-COUNTER TOPOLOGY 15/30
  • 16. THE 2ND ITERATION: DRAWBACKS Hybrid architecture: - aggregates (real-time) - raw events (2-hour batches) - joined events (end-of-day batch jobs) Other issues: - Hive joins - mutable events - servlets' complex logic 16/30
  • 18. THE 3RD ITERATION: NEW APPROACH { "IMPRESSION”: "URL”, "TIME”, "CREATIVE”, ... "CLICKS”, "CONVERSIONS” } { "CLICK”: "TIME”, "IMPRESSION_ID”, ... "IMPRESSION” } { "CONVERSION”: "TIME”, "CLICK_ID”, ... "IMPRESSION”, "CLICK” } New approach: - real-time processing - publishing light events - immutable streams of events 18/30
  • 19. THE 3RD ITERATION: HIGH-LEVEL ARCHITECTURE 19/30
  • 20. THE 3RD ITERATION: DATA-FLOW TOPOLOGY 20/30
  • 22. THE 4TH ITERATION: NEW REQUIREMENTS Main changes: - 5-6x larger scale: > from 350K to 2M bid requests/s within 1.5 years - full multi-dc architecture: > merging streams of events > synchronization of user profiles - end-to-end exactly-once processing: > at-least-once output semantics + deduplication - a few better components: > merger > new stats-counter, new data-flow > dispatcher & loader > logstash 22/30
  • 23. THE 4TH ITERATION: MULTI-DC ARCHITECTURE 23/30
  • 24. THE 4TH ITERATION: NEW DATA-FLOW ON KAFKA STREAMS 24/30 (picture from kafka.apache.org) Why Kafka Streams: - fully embedded library with no stream processing cluster - no external dependencies - Kafka's parallelism model and group membership mechanism - event-at-a-time processing (not microbatch) - exactly-once processing semantics (but at-least-once was good enough)
  • 25. THE 4TH ITERATION: MERGER ON KAFKA CONSUMER API 25/30
  • 27. THE 5TH ITERATION: KAFKA WORKERS Main features: - higher level of distribution - possibility to pause and resume processing for given partition - asynchronous processing - tighter control of offsets commits - backpressure - at-least-once semantics - processing timeouts - handling failures - multiple consumers (in progress) - kafka-to-kafka, hdfs, bigquery, elasticsearch connectors (in progress) 27/30 (github.com/RTBHOUSE/kafka-workers)
  • 28. THE 5TH ITERATION: KAFKA WORKERS ARCHITECTURE 28/30
  • 29. SUMMARY What we have achieved: - platform monitoring - much more stable platform - higher quality of data processing - HDFS & BigQuery & Elasticsearch streaming - multi-DC architecture and data synchronization - high scalability - better data-flow monitoring, deployment & maintenance 29/30
  • 30. REAL-TIME DATA PROCESSING AT RTB HOUSEREAL-TIME DATA PROCESSING AT RTB HOUSE BIG DATA TECHNOLOGY MOSCOW 2018 OCTOBER 10-11, 2018 THANK YOU FOR YOUR ATTENTION