SlideShare ist ein Scribd-Unternehmen logo
1 von 31
Downloaden Sie, um offline zu lesen
From 6 hours to 1 minute... in 2 days!
How we managed to stream our (long)
Hadoop batches
1
Sofian DJAMAA - Software engineer
@sdjamaa
Phase 1 - Buy displays
Phase 2 - Sell clicks
Phase 3 - ???
Phase 4 - Profit
Criteo
How does it work?
What’s
on Bild website
today?
user
We gather the
information for the
retargeting process
Let’s go on
Amazon!!
The website
contains a
« pixel » used
to put
information
on the user
cookie
user
eCommerce
website
publisher
website
Using the
information we have
on the user, tagged
by a cookie, we
display the right ad
Advertiser side
Publisher side
HOW DO THEY
KNOOOOOOOOWW????
Our constraints
6 datacenters
!
3 billions events a day
!
+50 PB of data in our Hadoop cluster
!
800K HTTP requests/second
!
JIRA ticket generation
How do we use the data? Where’s my
money?
WHERE’S
MY F@*#
MONEY?!?!
Where’s my
client’s money?
finance
sales
business devinternal
reports
client
reports
billing
(heavy) data
transformation
clicks, displays,
purchases…
client
YAY!!
MAKIN’
MONEY!!
client dashboard
(web)
But this takes time…
Where goes the data? There’s
an issue… let’s
investigate
WTF?!?!?
business
escalation
relase
management
Data not aggregated cross-DC
!
Granularity limited due to the volume
of data
!
Load time can be huge even if we bulk
insert (or move files)
clicks, displays,
purchases…
IIS web
servers
SQL Server DBs
(multiple instances
per DC)
Graphite
monitoring
Scaling is limited therefore
only the most aggregated
level is kept in Graphite
production
alert
Up to 6 hours for some metrics
!
Volume being huge, processing and
storage takes time (SQL Server
replication hell…)
!
Multiple datacenters containing data
with latency issues
6 hours to get business data
+ 1 hour to check data/raise alerts
+ 1 hour to find root cause
!
- SSome big money (up to million €)
finance
sales
release
management
(heavy) data
transformation
PBs of data to
replay
client
A lot of people need the same aggregated data but all with their own
constraints…
Consistency
required
Quick feedback
Batching on Hadoop or SQL Server
doesn’t fill the requirement
!
We need to have our checks as soon
as something wrong happens
!
We need to handle both real-time
and batch mode
But who can do it?
Big Data Berlin - Criteo
Some amazing projects such as:
!
- Ads in GIF format
!
- Embedded ad banners in movies
!
- Streaming service stopping several times a movie to display an ad
!
- Ponys
!
- And something about Chuck Norris (‘cause he’s awesome)
Internal hackathon
No team wanted to take the responsibility of project so we built our
own:
!
- 3 1/2 developers
!
- 1 business release manager (as a product owner)
!
- 2 business intelligence engineers (the guys that write SQL queries all day long)
!
- 1 business developer (doesn’t code a business layer obviously)
!
- 1 creative
!
- 1 release manager
!
- 1 technical solution guy (the guy who helps in putting the pixels)
Turbo
What people think a hackathon
is…
What really a hackathon is….
Big Data Berlin - Criteo
SummingBird computations
Metric aggregations at banner/zone level
- Clicks, displays, sales, revenue…
!
Real-time part
Aggregates are updated and sent after each event
30K messages/second
!
Batch part
Computes the expected trend of data for each period (reference data)
Using lasso: sum of squared errors, with a bound on the sum of the absolute
values of the coefficients
data processing is done in batches
on each side
when an offline batch is ready, it becomes the truth for the
whole system
online batches are computed in
streaming
batch #1 (e.g.
1H)
batch #2 batch #3
insert AND update
aggregated results
for each event
insert aggregated results
for 1 hour of events
λ =
http://github.com/twitter/summingbird
Platform[T]
def job = source.map { !
/* your job here */ "
}.store()
P#Source P#Store
job is
executed on
sends job results
to store
redirects input
to job
Why not streaming everything
then ?
Streaming costs a lot of infrastructure
!
Sometimes we need to replay (backfilling) events from the
past to correct a bug or adjust a formula
!
With PBs of data generated per day, a streaming
architecture needs to be massively parallelized (much
more than a batching architecture) for replays
!
Lambda Architecture is a good way to move towards a
full streaming architecture
Rule engine
Developed in Prolog
!
10K decisions/minute
!
Linked to the real-time flow to compute the
discrepancies with expected values and tag
abnormal data
Vizatra
In-house analytic visualization stack: world map, graphs, real-
time curves…
!
AngularJS, Bootstrap, Scala, Finagle, any DB supporting SQL
!
Web-component oriented: easy customization
!
Query deconstruction and NOT query building :-)
!
Open-source release coming soon
Riemann
Monitoring system with a « powerful stream processing
language » (nah, just Clojure configuration files)
!
Sends alerts based on tags sent by the rule engine
!
Scoped alerts
!
Alerts are emails and SMS to on-duty people
!
JIRA ticket generation
Awesomeness
✓ Data granularity
	 - Checks at banner/zone levels for
better investigations
✗ Data granularity
	 - Checks only at publisher website
level (only on RTB)
	 - No checks on the client side
✗ Latency up to 6 hours ✓ Latency of 1 minute
✗ Data aggregated hourly ✓ Data aggregated in 5-min period
✗ Money in the bank: $ ✓ Money in the bank: $$$$$
Even more…
Thanks to the hackathon, we are now able to provide real-time
feedback to sales, business developers, MDs and VPs which led
us to :
!
- Getting more clients as they love having a quick feedback on
their campaigns
!
- Adjust CPC in real-time
- For special occasions like sales
- To test aggressively our prediction models in an A/B test
!
- And more…
Some feedbacks
✗ Exponential learning curve with all frameworks
✗ A lot of features are missing (e.g. stores)
✗ Very limited documentation or tutorial
✗ Testing the error rate between Hadoop and Storm is
really too long for a 2 day development period
✗ Cassandra was a bad choice because of the data model
needed for the visualization part (lot of joins)
#Paris, #BigData, #MachineLearning, #NerfGuns, #Hadoop, #Storm, #Spark,
#Cassandra, #MongoDB, #Riemann, #Scala, #C# (?!), #Java
31
Sofian DJAMAA - Software engineer
@sdjamaa
WE RECRUIT!!!!
We have many open positions in the R&D:
!
•	Data Processing Systems Manager
•	Senior Software Engineer (Grenoble)
•	Software Development lead/Manager
•	SRE OPS Manager
•	Senior Software Engineer – Palo Alto, CA.
•	Python Software Lead Engineer - Paris
•	Software Development Engineer –Paris
•	Machine Learning Scientist

Weitere ähnliche Inhalte

Was ist angesagt?

Care and Feeding of Large Scale Graphite Installations - DevOpsDays Austin 2013
Care and Feeding of Large Scale Graphite Installations - DevOpsDays Austin 2013Care and Feeding of Large Scale Graphite Installations - DevOpsDays Austin 2013
Care and Feeding of Large Scale Graphite Installations - DevOpsDays Austin 2013Nick Galbreath
 
Prototyping Data Intensive Apps: TrendingTopics.org
Prototyping Data Intensive Apps: TrendingTopics.orgPrototyping Data Intensive Apps: TrendingTopics.org
Prototyping Data Intensive Apps: TrendingTopics.orgPeter Skomoroch
 
Community-Driven Graphs with JanusGraph
Community-Driven Graphs with JanusGraphCommunity-Driven Graphs with JanusGraph
Community-Driven Graphs with JanusGraphJason Plurad
 
How to measure everything - a million metrics per second with minimal develop...
How to measure everything - a million metrics per second with minimal develop...How to measure everything - a million metrics per second with minimal develop...
How to measure everything - a million metrics per second with minimal develop...Jos Boumans
 
Flink Forward SF 2017: Bill Liu & Haohui Mai - AthenaX : Uber’s streaming pro...
Flink Forward SF 2017: Bill Liu & Haohui Mai - AthenaX : Uber’s streaming pro...Flink Forward SF 2017: Bill Liu & Haohui Mai - AthenaX : Uber’s streaming pro...
Flink Forward SF 2017: Bill Liu & Haohui Mai - AthenaX : Uber’s streaming pro...Flink Forward
 
MySQL performance monitoring using Statsd and Graphite (PLUK2013)
MySQL performance monitoring using Statsd and Graphite (PLUK2013)MySQL performance monitoring using Statsd and Graphite (PLUK2013)
MySQL performance monitoring using Statsd and Graphite (PLUK2013)spil-engineering
 
Business Dashboards using Bonobo ETL, Grafana and Apache Airflow
Business Dashboards using Bonobo ETL, Grafana and Apache AirflowBusiness Dashboards using Bonobo ETL, Grafana and Apache Airflow
Business Dashboards using Bonobo ETL, Grafana and Apache AirflowRomain Dorgueil
 
QCon SF-2015 Stream Processing in uber
QCon SF-2015 Stream Processing in uberQCon SF-2015 Stream Processing in uber
QCon SF-2015 Stream Processing in uberDanny Yuan
 
Analyzing and processing FInancial Market Data on AWS with Kinesis - AWS Pop ...
Analyzing and processing FInancial Market Data on AWS with Kinesis - AWS Pop ...Analyzing and processing FInancial Market Data on AWS with Kinesis - AWS Pop ...
Analyzing and processing FInancial Market Data on AWS with Kinesis - AWS Pop ...Florian Benz
 
Structure Data 2014: BIG DATA ANALYTICS RE-INVENTED, Ryan Waite
Structure Data 2014: BIG DATA ANALYTICS RE-INVENTED, Ryan WaiteStructure Data 2014: BIG DATA ANALYTICS RE-INVENTED, Ryan Waite
Structure Data 2014: BIG DATA ANALYTICS RE-INVENTED, Ryan WaiteGigaom
 
Atmosphere 2014: When Storm hits data. Data streams processing in real time -...
Atmosphere 2014: When Storm hits data. Data streams processing in real time -...Atmosphere 2014: When Storm hits data. Data streams processing in real time -...
Atmosphere 2014: When Storm hits data. Data streams processing in real time -...PROIDEA
 
Intro to Airflow: Goodbye Cron, Welcome scheduled workflow management
Intro to Airflow: Goodbye Cron, Welcome scheduled workflow managementIntro to Airflow: Goodbye Cron, Welcome scheduled workflow management
Intro to Airflow: Goodbye Cron, Welcome scheduled workflow managementBurasakorn Sabyeying
 
Google Cloud Platform for Data Science teams
Google Cloud Platform for Data Science teamsGoogle Cloud Platform for Data Science teams
Google Cloud Platform for Data Science teamsBarton Rhodes
 
Introduction to Time Series: The Fastest Growing Database Category
 Introduction to Time Series: The Fastest Growing Database Category Introduction to Time Series: The Fastest Growing Database Category
Introduction to Time Series: The Fastest Growing Database CategoryDevOps.com
 
Two way data sync between legacy and your brand new micro-service architecture
 Two way data sync between legacy and your brand new micro-service architecture Two way data sync between legacy and your brand new micro-service architecture
Two way data sync between legacy and your brand new micro-service architecturebleporini
 
Cloud Connect 2012, Big Data @ Netflix
Cloud Connect 2012, Big Data @ NetflixCloud Connect 2012, Big Data @ Netflix
Cloud Connect 2012, Big Data @ NetflixJerome Boulon
 
Building Robust Pipelines with Airflow
Building Robust Pipelines with AirflowBuilding Robust Pipelines with Airflow
Building Robust Pipelines with AirflowErin Shellman
 
JCConf 2015 - Google Dataflow 在雲端大資料處理的應用
JCConf 2015 - Google Dataflow 在雲端大資料處理的應用JCConf 2015 - Google Dataflow 在雲端大資料處理的應用
JCConf 2015 - Google Dataflow 在雲端大資料處理的應用Simon Su
 
Fall in Love with Graphs and Metrics using Grafana
Fall in Love with Graphs and Metrics using GrafanaFall in Love with Graphs and Metrics using Grafana
Fall in Love with Graphs and Metrics using Grafanatorkelo
 

Was ist angesagt? (20)

Care and Feeding of Large Scale Graphite Installations - DevOpsDays Austin 2013
Care and Feeding of Large Scale Graphite Installations - DevOpsDays Austin 2013Care and Feeding of Large Scale Graphite Installations - DevOpsDays Austin 2013
Care and Feeding of Large Scale Graphite Installations - DevOpsDays Austin 2013
 
Prototyping Data Intensive Apps: TrendingTopics.org
Prototyping Data Intensive Apps: TrendingTopics.orgPrototyping Data Intensive Apps: TrendingTopics.org
Prototyping Data Intensive Apps: TrendingTopics.org
 
Elasticwulf Pycon Talk
Elasticwulf Pycon TalkElasticwulf Pycon Talk
Elasticwulf Pycon Talk
 
Community-Driven Graphs with JanusGraph
Community-Driven Graphs with JanusGraphCommunity-Driven Graphs with JanusGraph
Community-Driven Graphs with JanusGraph
 
How to measure everything - a million metrics per second with minimal develop...
How to measure everything - a million metrics per second with minimal develop...How to measure everything - a million metrics per second with minimal develop...
How to measure everything - a million metrics per second with minimal develop...
 
Flink Forward SF 2017: Bill Liu & Haohui Mai - AthenaX : Uber’s streaming pro...
Flink Forward SF 2017: Bill Liu & Haohui Mai - AthenaX : Uber’s streaming pro...Flink Forward SF 2017: Bill Liu & Haohui Mai - AthenaX : Uber’s streaming pro...
Flink Forward SF 2017: Bill Liu & Haohui Mai - AthenaX : Uber’s streaming pro...
 
MySQL performance monitoring using Statsd and Graphite (PLUK2013)
MySQL performance monitoring using Statsd and Graphite (PLUK2013)MySQL performance monitoring using Statsd and Graphite (PLUK2013)
MySQL performance monitoring using Statsd and Graphite (PLUK2013)
 
Business Dashboards using Bonobo ETL, Grafana and Apache Airflow
Business Dashboards using Bonobo ETL, Grafana and Apache AirflowBusiness Dashboards using Bonobo ETL, Grafana and Apache Airflow
Business Dashboards using Bonobo ETL, Grafana and Apache Airflow
 
QCon SF-2015 Stream Processing in uber
QCon SF-2015 Stream Processing in uberQCon SF-2015 Stream Processing in uber
QCon SF-2015 Stream Processing in uber
 
Analyzing and processing FInancial Market Data on AWS with Kinesis - AWS Pop ...
Analyzing and processing FInancial Market Data on AWS with Kinesis - AWS Pop ...Analyzing and processing FInancial Market Data on AWS with Kinesis - AWS Pop ...
Analyzing and processing FInancial Market Data on AWS with Kinesis - AWS Pop ...
 
Structure Data 2014: BIG DATA ANALYTICS RE-INVENTED, Ryan Waite
Structure Data 2014: BIG DATA ANALYTICS RE-INVENTED, Ryan WaiteStructure Data 2014: BIG DATA ANALYTICS RE-INVENTED, Ryan Waite
Structure Data 2014: BIG DATA ANALYTICS RE-INVENTED, Ryan Waite
 
Atmosphere 2014: When Storm hits data. Data streams processing in real time -...
Atmosphere 2014: When Storm hits data. Data streams processing in real time -...Atmosphere 2014: When Storm hits data. Data streams processing in real time -...
Atmosphere 2014: When Storm hits data. Data streams processing in real time -...
 
Intro to Airflow: Goodbye Cron, Welcome scheduled workflow management
Intro to Airflow: Goodbye Cron, Welcome scheduled workflow managementIntro to Airflow: Goodbye Cron, Welcome scheduled workflow management
Intro to Airflow: Goodbye Cron, Welcome scheduled workflow management
 
Google Cloud Platform for Data Science teams
Google Cloud Platform for Data Science teamsGoogle Cloud Platform for Data Science teams
Google Cloud Platform for Data Science teams
 
Introduction to Time Series: The Fastest Growing Database Category
 Introduction to Time Series: The Fastest Growing Database Category Introduction to Time Series: The Fastest Growing Database Category
Introduction to Time Series: The Fastest Growing Database Category
 
Two way data sync between legacy and your brand new micro-service architecture
 Two way data sync between legacy and your brand new micro-service architecture Two way data sync between legacy and your brand new micro-service architecture
Two way data sync between legacy and your brand new micro-service architecture
 
Cloud Connect 2012, Big Data @ Netflix
Cloud Connect 2012, Big Data @ NetflixCloud Connect 2012, Big Data @ Netflix
Cloud Connect 2012, Big Data @ Netflix
 
Building Robust Pipelines with Airflow
Building Robust Pipelines with AirflowBuilding Robust Pipelines with Airflow
Building Robust Pipelines with Airflow
 
JCConf 2015 - Google Dataflow 在雲端大資料處理的應用
JCConf 2015 - Google Dataflow 在雲端大資料處理的應用JCConf 2015 - Google Dataflow 在雲端大資料處理的應用
JCConf 2015 - Google Dataflow 在雲端大資料處理的應用
 
Fall in Love with Graphs and Metrics using Grafana
Fall in Love with Graphs and Metrics using GrafanaFall in Love with Graphs and Metrics using Grafana
Fall in Love with Graphs and Metrics using Grafana
 

Ähnlich wie Big Data Berlin - Criteo

Real-Time AI: Designing for Low Latency and High Throughput - Dr. Sergei Izra...
Real-Time AI: Designing for Low Latency and High Throughput - Dr. Sergei Izra...Real-Time AI: Designing for Low Latency and High Throughput - Dr. Sergei Izra...
Real-Time AI: Designing for Low Latency and High Throughput - Dr. Sergei Izra...Sri Ambati
 
Processing 19 billion messages in real time and NOT dying in the process
Processing 19 billion messages in real time and NOT dying in the processProcessing 19 billion messages in real time and NOT dying in the process
Processing 19 billion messages in real time and NOT dying in the processJampp
 
Storm at spider.io - London Storm Meetup 2013-06-18
Storm at spider.io - London Storm Meetup 2013-06-18Storm at spider.io - London Storm Meetup 2013-06-18
Storm at spider.io - London Storm Meetup 2013-06-18Ashley Brown
 
WSO2Con EU 2015: An Introduction to the WSO2 Data Analytics Platform
WSO2Con EU 2015: An Introduction to the WSO2 Data Analytics PlatformWSO2Con EU 2015: An Introduction to the WSO2 Data Analytics Platform
WSO2Con EU 2015: An Introduction to the WSO2 Data Analytics PlatformWSO2
 
Building Reactive Real-time Data Pipeline
Building Reactive Real-time Data PipelineBuilding Reactive Real-time Data Pipeline
Building Reactive Real-time Data PipelineTrieu Nguyen
 
Big-Data Server Farm Architecture
Big-Data Server Farm Architecture Big-Data Server Farm Architecture
Big-Data Server Farm Architecture Jordan Chung
 
Stream Processing as Game Changer for Big Data and Internet of Things by Kai ...
Stream Processing as Game Changer for Big Data and Internet of Things by Kai ...Stream Processing as Game Changer for Big Data and Internet of Things by Kai ...
Stream Processing as Game Changer for Big Data and Internet of Things by Kai ...Big Data Spain
 
Streaming Analytics Comparison of Open Source Frameworks, Products, Cloud Ser...
Streaming Analytics Comparison of Open Source Frameworks, Products, Cloud Ser...Streaming Analytics Comparison of Open Source Frameworks, Products, Cloud Ser...
Streaming Analytics Comparison of Open Source Frameworks, Products, Cloud Ser...Kai Wähner
 
Introduction to WSO2 Analytics Platform: 2016 Q2 Update
Introduction to WSO2 Analytics Platform: 2016 Q2 UpdateIntroduction to WSO2 Analytics Platform: 2016 Q2 Update
Introduction to WSO2 Analytics Platform: 2016 Q2 UpdateSrinath Perera
 
Streaming in the Wild with Apache Flink
Streaming in the Wild with Apache FlinkStreaming in the Wild with Apache Flink
Streaming in the Wild with Apache FlinkKostas Tzoumas
 
Take Action: The New Reality of Data-Driven Business
Take Action: The New Reality of Data-Driven BusinessTake Action: The New Reality of Data-Driven Business
Take Action: The New Reality of Data-Driven BusinessInside Analysis
 
[2C6]Everyplay_Big_Data
[2C6]Everyplay_Big_Data[2C6]Everyplay_Big_Data
[2C6]Everyplay_Big_DataNAVER D2
 
Apache Flink Adoption at Shopify
Apache Flink Adoption at ShopifyApache Flink Adoption at Shopify
Apache Flink Adoption at ShopifyYaroslav Tkachenko
 
WSO2 Workshop Sydney 2016 - Analytics
WSO2 Workshop Sydney 2016 -  AnalyticsWSO2 Workshop Sydney 2016 -  Analytics
WSO2 Workshop Sydney 2016 - AnalyticsDassana Wijesekara
 
Hadoop and the Relational Database: The Best of Both Worlds
Hadoop and the Relational Database: The Best of Both WorldsHadoop and the Relational Database: The Best of Both Worlds
Hadoop and the Relational Database: The Best of Both WorldsInside Analysis
 
High availability, real-time and scalable architectures
High availability, real-time and scalable architecturesHigh availability, real-time and scalable architectures
High availability, real-time and scalable architecturesJampp
 
New usage model for real-time analytics by Dr. WILLIAM L. BAIN at Big Data S...
 New usage model for real-time analytics by Dr. WILLIAM L. BAIN at Big Data S... New usage model for real-time analytics by Dr. WILLIAM L. BAIN at Big Data S...
New usage model for real-time analytics by Dr. WILLIAM L. BAIN at Big Data S...Big Data Spain
 
Simpler, faster, cheaper Enterprise Apps using only Spring Boot on GCP
Simpler, faster, cheaper Enterprise Apps using only Spring Boot on GCPSimpler, faster, cheaper Enterprise Apps using only Spring Boot on GCP
Simpler, faster, cheaper Enterprise Apps using only Spring Boot on GCPDaniel Zivkovic
 

Ähnlich wie Big Data Berlin - Criteo (20)

Real-Time AI: Designing for Low Latency and High Throughput - Dr. Sergei Izra...
Real-Time AI: Designing for Low Latency and High Throughput - Dr. Sergei Izra...Real-Time AI: Designing for Low Latency and High Throughput - Dr. Sergei Izra...
Real-Time AI: Designing for Low Latency and High Throughput - Dr. Sergei Izra...
 
Processing 19 billion messages in real time and NOT dying in the process
Processing 19 billion messages in real time and NOT dying in the processProcessing 19 billion messages in real time and NOT dying in the process
Processing 19 billion messages in real time and NOT dying in the process
 
Storm at spider.io - London Storm Meetup 2013-06-18
Storm at spider.io - London Storm Meetup 2013-06-18Storm at spider.io - London Storm Meetup 2013-06-18
Storm at spider.io - London Storm Meetup 2013-06-18
 
WSO2Con EU 2015: An Introduction to the WSO2 Data Analytics Platform
WSO2Con EU 2015: An Introduction to the WSO2 Data Analytics PlatformWSO2Con EU 2015: An Introduction to the WSO2 Data Analytics Platform
WSO2Con EU 2015: An Introduction to the WSO2 Data Analytics Platform
 
Streaming in the Wild with Apache Flink
Streaming in the Wild with Apache FlinkStreaming in the Wild with Apache Flink
Streaming in the Wild with Apache Flink
 
Building Reactive Real-time Data Pipeline
Building Reactive Real-time Data PipelineBuilding Reactive Real-time Data Pipeline
Building Reactive Real-time Data Pipeline
 
Big-Data Server Farm Architecture
Big-Data Server Farm Architecture Big-Data Server Farm Architecture
Big-Data Server Farm Architecture
 
Stream Processing as Game Changer for Big Data and Internet of Things by Kai ...
Stream Processing as Game Changer for Big Data and Internet of Things by Kai ...Stream Processing as Game Changer for Big Data and Internet of Things by Kai ...
Stream Processing as Game Changer for Big Data and Internet of Things by Kai ...
 
Streaming Analytics Comparison of Open Source Frameworks, Products, Cloud Ser...
Streaming Analytics Comparison of Open Source Frameworks, Products, Cloud Ser...Streaming Analytics Comparison of Open Source Frameworks, Products, Cloud Ser...
Streaming Analytics Comparison of Open Source Frameworks, Products, Cloud Ser...
 
Introduction to WSO2 Analytics Platform: 2016 Q2 Update
Introduction to WSO2 Analytics Platform: 2016 Q2 UpdateIntroduction to WSO2 Analytics Platform: 2016 Q2 Update
Introduction to WSO2 Analytics Platform: 2016 Q2 Update
 
Streaming in the Wild with Apache Flink
Streaming in the Wild with Apache FlinkStreaming in the Wild with Apache Flink
Streaming in the Wild with Apache Flink
 
Take Action: The New Reality of Data-Driven Business
Take Action: The New Reality of Data-Driven BusinessTake Action: The New Reality of Data-Driven Business
Take Action: The New Reality of Data-Driven Business
 
[2C6]Everyplay_Big_Data
[2C6]Everyplay_Big_Data[2C6]Everyplay_Big_Data
[2C6]Everyplay_Big_Data
 
Apache Flink Adoption at Shopify
Apache Flink Adoption at ShopifyApache Flink Adoption at Shopify
Apache Flink Adoption at Shopify
 
WSO2 Workshop Sydney 2016 - Analytics
WSO2 Workshop Sydney 2016 -  AnalyticsWSO2 Workshop Sydney 2016 -  Analytics
WSO2 Workshop Sydney 2016 - Analytics
 
Hadoop and the Relational Database: The Best of Both Worlds
Hadoop and the Relational Database: The Best of Both WorldsHadoop and the Relational Database: The Best of Both Worlds
Hadoop and the Relational Database: The Best of Both Worlds
 
Tweak Geeks #FOS15
Tweak Geeks #FOS15Tweak Geeks #FOS15
Tweak Geeks #FOS15
 
High availability, real-time and scalable architectures
High availability, real-time and scalable architecturesHigh availability, real-time and scalable architectures
High availability, real-time and scalable architectures
 
New usage model for real-time analytics by Dr. WILLIAM L. BAIN at Big Data S...
 New usage model for real-time analytics by Dr. WILLIAM L. BAIN at Big Data S... New usage model for real-time analytics by Dr. WILLIAM L. BAIN at Big Data S...
New usage model for real-time analytics by Dr. WILLIAM L. BAIN at Big Data S...
 
Simpler, faster, cheaper Enterprise Apps using only Spring Boot on GCP
Simpler, faster, cheaper Enterprise Apps using only Spring Boot on GCPSimpler, faster, cheaper Enterprise Apps using only Spring Boot on GCP
Simpler, faster, cheaper Enterprise Apps using only Spring Boot on GCP
 

Kürzlich hochgeladen

TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptxTINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptxDwiAyuSitiHartinah
 
Optimal Decision Making - Cost Reduction in Logistics
Optimal Decision Making - Cost Reduction in LogisticsOptimal Decision Making - Cost Reduction in Logistics
Optimal Decision Making - Cost Reduction in LogisticsThinkInnovation
 
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024Guido X Jansen
 
The Universal GTM - how we design GTM and dataLayer
The Universal GTM - how we design GTM and dataLayerThe Universal GTM - how we design GTM and dataLayer
The Universal GTM - how we design GTM and dataLayerPavel Šabatka
 
Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...PrithaVashisht1
 
5 Ds to Define Data Archiving Best Practices
5 Ds to Define Data Archiving Best Practices5 Ds to Define Data Archiving Best Practices
5 Ds to Define Data Archiving Best PracticesDataArchiva
 
Mapping the pubmed data under different suptopics using NLP.pptx
Mapping the pubmed data under different suptopics using NLP.pptxMapping the pubmed data under different suptopics using NLP.pptx
Mapping the pubmed data under different suptopics using NLP.pptxVenkatasubramani13
 
Virtuosoft SmartSync Product Introduction
Virtuosoft SmartSync Product IntroductionVirtuosoft SmartSync Product Introduction
Virtuosoft SmartSync Product Introductionsanjaymuralee1
 
CI, CD -Tools to integrate without manual intervention
CI, CD -Tools to integrate without manual interventionCI, CD -Tools to integrate without manual intervention
CI, CD -Tools to integrate without manual interventionajayrajaganeshkayala
 
Master's Thesis - Data Science - Presentation
Master's Thesis - Data Science - PresentationMaster's Thesis - Data Science - Presentation
Master's Thesis - Data Science - PresentationGiorgio Carbone
 
Strategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
Strategic CX: A Deep Dive into Voice of the Customer Insights for ClarityStrategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
Strategic CX: A Deep Dive into Voice of the Customer Insights for ClarityAggregage
 
ChistaDATA Real-Time DATA Analytics Infrastructure
ChistaDATA Real-Time DATA Analytics InfrastructureChistaDATA Real-Time DATA Analytics Infrastructure
ChistaDATA Real-Time DATA Analytics Infrastructuresonikadigital1
 
How is Real-Time Analytics Different from Traditional OLAP?
How is Real-Time Analytics Different from Traditional OLAP?How is Real-Time Analytics Different from Traditional OLAP?
How is Real-Time Analytics Different from Traditional OLAP?sonikadigital1
 
CCS336-Cloud-Services-Management-Lecture-Notes-1.pptx
CCS336-Cloud-Services-Management-Lecture-Notes-1.pptxCCS336-Cloud-Services-Management-Lecture-Notes-1.pptx
CCS336-Cloud-Services-Management-Lecture-Notes-1.pptxdhiyaneswaranv1
 
Cash Is Still King: ATM market research '2023
Cash Is Still King: ATM market research '2023Cash Is Still King: ATM market research '2023
Cash Is Still King: ATM market research '2023Vladislav Solodkiy
 
Rock Songs common codes and conventions.pptx
Rock Songs common codes and conventions.pptxRock Songs common codes and conventions.pptx
Rock Songs common codes and conventions.pptxFinatron037
 

Kürzlich hochgeladen (16)

TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptxTINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
 
Optimal Decision Making - Cost Reduction in Logistics
Optimal Decision Making - Cost Reduction in LogisticsOptimal Decision Making - Cost Reduction in Logistics
Optimal Decision Making - Cost Reduction in Logistics
 
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
 
The Universal GTM - how we design GTM and dataLayer
The Universal GTM - how we design GTM and dataLayerThe Universal GTM - how we design GTM and dataLayer
The Universal GTM - how we design GTM and dataLayer
 
Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...
 
5 Ds to Define Data Archiving Best Practices
5 Ds to Define Data Archiving Best Practices5 Ds to Define Data Archiving Best Practices
5 Ds to Define Data Archiving Best Practices
 
Mapping the pubmed data under different suptopics using NLP.pptx
Mapping the pubmed data under different suptopics using NLP.pptxMapping the pubmed data under different suptopics using NLP.pptx
Mapping the pubmed data under different suptopics using NLP.pptx
 
Virtuosoft SmartSync Product Introduction
Virtuosoft SmartSync Product IntroductionVirtuosoft SmartSync Product Introduction
Virtuosoft SmartSync Product Introduction
 
CI, CD -Tools to integrate without manual intervention
CI, CD -Tools to integrate without manual interventionCI, CD -Tools to integrate without manual intervention
CI, CD -Tools to integrate without manual intervention
 
Master's Thesis - Data Science - Presentation
Master's Thesis - Data Science - PresentationMaster's Thesis - Data Science - Presentation
Master's Thesis - Data Science - Presentation
 
Strategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
Strategic CX: A Deep Dive into Voice of the Customer Insights for ClarityStrategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
Strategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
 
ChistaDATA Real-Time DATA Analytics Infrastructure
ChistaDATA Real-Time DATA Analytics InfrastructureChistaDATA Real-Time DATA Analytics Infrastructure
ChistaDATA Real-Time DATA Analytics Infrastructure
 
How is Real-Time Analytics Different from Traditional OLAP?
How is Real-Time Analytics Different from Traditional OLAP?How is Real-Time Analytics Different from Traditional OLAP?
How is Real-Time Analytics Different from Traditional OLAP?
 
CCS336-Cloud-Services-Management-Lecture-Notes-1.pptx
CCS336-Cloud-Services-Management-Lecture-Notes-1.pptxCCS336-Cloud-Services-Management-Lecture-Notes-1.pptx
CCS336-Cloud-Services-Management-Lecture-Notes-1.pptx
 
Cash Is Still King: ATM market research '2023
Cash Is Still King: ATM market research '2023Cash Is Still King: ATM market research '2023
Cash Is Still King: ATM market research '2023
 
Rock Songs common codes and conventions.pptx
Rock Songs common codes and conventions.pptxRock Songs common codes and conventions.pptx
Rock Songs common codes and conventions.pptx
 

Big Data Berlin - Criteo

  • 1. From 6 hours to 1 minute... in 2 days! How we managed to stream our (long) Hadoop batches 1 Sofian DJAMAA - Software engineer @sdjamaa
  • 2. Phase 1 - Buy displays Phase 2 - Sell clicks Phase 3 - ??? Phase 4 - Profit Criteo
  • 3. How does it work? What’s on Bild website today? user We gather the information for the retargeting process Let’s go on Amazon!! The website contains a « pixel » used to put information on the user cookie user eCommerce website publisher website Using the information we have on the user, tagged by a cookie, we display the right ad Advertiser side Publisher side
  • 5. Our constraints 6 datacenters ! 3 billions events a day ! +50 PB of data in our Hadoop cluster ! 800K HTTP requests/second ! JIRA ticket generation
  • 6. How do we use the data? Where’s my money? WHERE’S MY F@*# MONEY?!?! Where’s my client’s money? finance sales business devinternal reports client reports billing (heavy) data transformation clicks, displays, purchases… client YAY!! MAKIN’ MONEY!! client dashboard (web)
  • 7. But this takes time…
  • 8. Where goes the data? There’s an issue… let’s investigate WTF?!?!? business escalation relase management Data not aggregated cross-DC ! Granularity limited due to the volume of data ! Load time can be huge even if we bulk insert (or move files) clicks, displays, purchases… IIS web servers SQL Server DBs (multiple instances per DC) Graphite monitoring Scaling is limited therefore only the most aggregated level is kept in Graphite production alert
  • 9. Up to 6 hours for some metrics ! Volume being huge, processing and storage takes time (SQL Server replication hell…) ! Multiple datacenters containing data with latency issues
  • 10. 6 hours to get business data + 1 hour to check data/raise alerts + 1 hour to find root cause ! - SSome big money (up to million €)
  • 11. finance sales release management (heavy) data transformation PBs of data to replay client A lot of people need the same aggregated data but all with their own constraints… Consistency required Quick feedback
  • 12. Batching on Hadoop or SQL Server doesn’t fill the requirement ! We need to have our checks as soon as something wrong happens ! We need to handle both real-time and batch mode
  • 13. But who can do it?
  • 15. Some amazing projects such as: ! - Ads in GIF format ! - Embedded ad banners in movies ! - Streaming service stopping several times a movie to display an ad ! - Ponys ! - And something about Chuck Norris (‘cause he’s awesome) Internal hackathon
  • 16. No team wanted to take the responsibility of project so we built our own: ! - 3 1/2 developers ! - 1 business release manager (as a product owner) ! - 2 business intelligence engineers (the guys that write SQL queries all day long) ! - 1 business developer (doesn’t code a business layer obviously) ! - 1 creative ! - 1 release manager ! - 1 technical solution guy (the guy who helps in putting the pixels) Turbo
  • 17. What people think a hackathon is…
  • 18. What really a hackathon is….
  • 20. SummingBird computations Metric aggregations at banner/zone level - Clicks, displays, sales, revenue… ! Real-time part Aggregates are updated and sent after each event 30K messages/second ! Batch part Computes the expected trend of data for each period (reference data) Using lasso: sum of squared errors, with a bound on the sum of the absolute values of the coefficients
  • 21. data processing is done in batches on each side when an offline batch is ready, it becomes the truth for the whole system online batches are computed in streaming batch #1 (e.g. 1H) batch #2 batch #3 insert AND update aggregated results for each event insert aggregated results for 1 hour of events
  • 23. Platform[T] def job = source.map { ! /* your job here */ " }.store() P#Source P#Store job is executed on sends job results to store redirects input to job
  • 24. Why not streaming everything then ? Streaming costs a lot of infrastructure ! Sometimes we need to replay (backfilling) events from the past to correct a bug or adjust a formula ! With PBs of data generated per day, a streaming architecture needs to be massively parallelized (much more than a batching architecture) for replays ! Lambda Architecture is a good way to move towards a full streaming architecture
  • 25. Rule engine Developed in Prolog ! 10K decisions/minute ! Linked to the real-time flow to compute the discrepancies with expected values and tag abnormal data
  • 26. Vizatra In-house analytic visualization stack: world map, graphs, real- time curves… ! AngularJS, Bootstrap, Scala, Finagle, any DB supporting SQL ! Web-component oriented: easy customization ! Query deconstruction and NOT query building :-) ! Open-source release coming soon
  • 27. Riemann Monitoring system with a « powerful stream processing language » (nah, just Clojure configuration files) ! Sends alerts based on tags sent by the rule engine ! Scoped alerts ! Alerts are emails and SMS to on-duty people ! JIRA ticket generation
  • 28. Awesomeness ✓ Data granularity - Checks at banner/zone levels for better investigations ✗ Data granularity - Checks only at publisher website level (only on RTB) - No checks on the client side ✗ Latency up to 6 hours ✓ Latency of 1 minute ✗ Data aggregated hourly ✓ Data aggregated in 5-min period ✗ Money in the bank: $ ✓ Money in the bank: $$$$$
  • 29. Even more… Thanks to the hackathon, we are now able to provide real-time feedback to sales, business developers, MDs and VPs which led us to : ! - Getting more clients as they love having a quick feedback on their campaigns ! - Adjust CPC in real-time - For special occasions like sales - To test aggressively our prediction models in an A/B test ! - And more…
  • 30. Some feedbacks ✗ Exponential learning curve with all frameworks ✗ A lot of features are missing (e.g. stores) ✗ Very limited documentation or tutorial ✗ Testing the error rate between Hadoop and Storm is really too long for a 2 day development period ✗ Cassandra was a bad choice because of the data model needed for the visualization part (lot of joins)
  • 31. #Paris, #BigData, #MachineLearning, #NerfGuns, #Hadoop, #Storm, #Spark, #Cassandra, #MongoDB, #Riemann, #Scala, #C# (?!), #Java 31 Sofian DJAMAA - Software engineer @sdjamaa WE RECRUIT!!!! We have many open positions in the R&D: ! • Data Processing Systems Manager • Senior Software Engineer (Grenoble) • Software Development lead/Manager • SRE OPS Manager • Senior Software Engineer – Palo Alto, CA. • Python Software Lead Engineer - Paris • Software Development Engineer –Paris • Machine Learning Scientist