Criteo Infrastructure (Platform) Meetup

Criteo Infrastructure (Platform) Meetup
22nd February 2017
Diarmuid Gill, VP R&D - Platforms
Introduction & welcome note

3 | Copyright © 2017 Criteo
Our mission
TARGET THE
RIGHT USER
AT THE
RIGHT TIME
WITH THE RIGHT
MESSAGE

Key Figures
18 000 PUBLISHERS90%
RETENTION RATE2
+130
COUNTRIES
LISTED ON THE NASDAQ
SINCE
OCTOBER 2013
R&D REPRESENTS 21% OF THE
WORKFORCE
2500
EMPLOYEES
21
BILLIONS $3
14 000
ADVERTISERS
$1,799 million1
31
OFFICES
1: REVENUE IN 2016
2: ANNUAL RATE 2015
3: $ OF TURNOVER GENERATED TO OUR CLIENTS - TURNOVER POST-CLICK WW FROM JANUARY TO DECEMBER
2015

GENERAL CONCEPT
Users visit an
advertiser’s website
1
Criteo identifies the users
(via cookies)
2
Users leave the advertiser’s website
& browse publisher on the Internet
3
Criteo identifies users on
these pages
(via cookie)
4
Criteo displays an advertising
banner, personalized for
each user
5
Click through directly
to the advertiser’s
page
6
@
Retargeting principles

• 3.2B catalog items ingested/day, 6B
items stored
• 3.6B cookies/device IDs seen per
month
• 3.9B personalized banners/day
• 49 RTBs @ 120B bid requests/day
• 3M QPS at peak
• 90 Gbps bandwidth
• 20K servers
• 27PB of data stored
• 3.6PB of data read daily
• 500B log lines processed/day
• 363TB of RAM in memcached, 37M req/s
• 300K Hadoop jobs/day
Scale @ Criteo

Batch processing:
• Hadoop as a Service:
• 2 clusters – main + backup one for degraded mode
• Cloudera CDH5
• 2300 servers total (1300 + 1000), 76K vcores
• 50PiB storage capacity
• Own job scheduler for improved data flow and coordination
• 300k jobs per day
Hadoop @ Criteo

Infrastructure Key Figures
Hosting Global Partners :
Sunnyvale
2 PoP
500 kVA
2 006 Servers
New York
2 PoP
930 kVA
2 793 Servers
Hong Kong
2 PoP
472 kVA
2 185 Servers
Paris
3 Pop
1 800 kVA
5 003 Servers
Amsterdam
2 PoP
+2 500 kVA
3 874 Servers
Tokyo
2 PoP
455 kVA
2 564 Servers
Shanghai
1 PoP
200 kVA
907 Servers
Worldwide
16 PoP
~8 MVA Contracted
20 526 Servers
Up to 90 Gbps
3M QPS
Ashburn
2 PoP
1,1 MVA
1 170 Servers
Hosting Global Partners :

Some of the many technologies used at Criteo

What does
“Platforms”
mean in Criteo?
4

Top Level Applications
Platforms
Infrastructure
SRE
Advertiser Publisher
WebScale
Prediction Dynamic
Creative
Recommendation
Engine
• Catalog
• User Events
• Campaigns
• Reporting
• RTB
• Direct
• Campaigns
• Reporting
Systems
Platforms
Systems
Engine

Analytics Platforms
Advertiser Publisher
Analytics
AX/BI
Reporting / Billing Reporting / Payments

Tonight’s menu
Bill of Fare
***
1st talk: FastTrack: scaling customer integration
- Nicolas Laveau, Leo-Paul Goffic & Camille Coueslant -
2nd talk: Evolution of data structures in Yandex.Metrica
- Alexey Milovidov -
3rd talk: Don't take your software for granted
- Cedrick Montout -
4th talk: Evolution of analytics at Criteo
- Justin Coffey -
***
21:05 - 22:00 Networking

Camille Coueslant, Léo-Paul
Goffic, Nicolas Laveau
2017/02/22
Scaling customer integration
FastTrack

What do we do in Criteo?
Deliver the right message to the right user at the right time

Integration: Creatives settings
• Banners need branding
• Logo
• Font
• Color palette
• Banners come in many formats

Integration: Tags
• Banners are based on user intent
• Tags on customer store
• Different types of intent
• Home page view
• Product view
• Listing view
• Basket
• Sales
• Intent at product level
<script type="text/javascript" src="//static.criteo.net/js/ld/ld.js" async="true">
</script>
<script type="text/javascript">
window.criteo_q = window.criteo_q || [];
window.criteo_q.push(
{ event: "setAccount", account: 666 },
{ event: "setEmail", email: "harry.potter@hogwarts.org" },
{ event: "setSiteType", type: "g" },
{ event: "viewHome" }
);
</script>
<script type="text/javascript" src="//static.criteo.net/js/ld/ld.js" async="true">
</script>
<script type="text/javascript">
window.criteo_q = window.criteo_q || [];
window.criteo_q.push(
{ event: "setAccount", account: 666 },
{ event: "setEmail", email: "harry.potter@hogwarts.org" },
{ event: "setSiteType", type: "g" },
{ event: "trackTransaction", id: "tr-56182-2123", item: [
{ id: "patronus", price: 12.54, quantity: 3 },
{ id: "avada-kedavra", price: 1099.99, quantity: 1 }
/* add a line for each item in the user's basket */
]}
);
</script>
Home
Sales

Integration: Product Feed
• Banners contain products
• Characteristics of products are used for
recommendation
• Name, description, image, price for display
<item>
<g:id>0</g:id>
<title>Abracadabra</title>
<g:image_link>
http://www.magic.com/assets/spells/abracadabra.png
</g:image_link>
<link>
http://www.magic.com/spells/abracadabra
</link>
<description>
Multi-purpose spell. Your companion for every occasion!
</description>
<g:price>625.99</g:price>
<g:google_product_category>35</g:google_product_category>
</item>
id;title;image_link;link;description;price;google_product_
category
0;Abracadabra;http://www.magic.com/assets/spells/abracadab
ra.png;http://www.magic.com/spells/abracadabra;Multi-
purpose spell. Your companion for every
occasion!;625.99;Arts & Entertainment > Hobbies & Creative
Arts > Magic & Novelties
XML
CSV

Back in 2014
When the customer was seeing what he had to implement

Back in 2014
When the technical support was seeing the first implementation

Back in 2014
When the customer was trying to debug his implementation

Criteo grows… fast!
This does not scale!
« Performance is everything »
BUT
we need to onboard first
Clients
TS

All is not lost!
Technology & UX to the rescue!

Tags
Part 1:
Tag Validation Dashboard

Goal
 Show near real-time metrics on trackers format issues
 Detect mismatches between the trackers and the product feed
 Provide fine-grained data (max 24 hours)
 Available for each of our clients (=worldwide)

How
Initial trackers architecture

How
1. Audit the tracker events
2. Send this audit event to Kafka
3. Consume it from Druid

Why Druid
• Druid is an open-source column-oriented distributed data store
• Advantages:
• Fast aggregation queries on huge amount of metrics
• Real-time streaming ingestion
• Scalable
• Highly available

1. Audit the tracker events
2. Send this audit event to Kafka
3. Consume it from Druid
4. Query Druid from Integrate
How

Result

Tag Debug Mode
How do I make sure I send Criteo the right information from my website?
?
? Fig 1: Criteo Hotline

Tag Debug Mode
How do I make sure I send Criteo the right information from my website?
Fig 2: Happy customer

How tags work
https://www.mvmtwatches.com/

How tags work
ld.js

How tags work
ld.js
GET /event?a=%5B30072%…

How tags work
ld.js
GET /event?a=%5B30072%…
200 OK

Tag Debug Mode

Tag Debug Mode
https://www.mvmtwatches.com/#enable-tag-debug-mode

Tag Debug Mode
https://www.mvmtwatches.com/#enable-tag-debug-mode ld.js
if (document.location.hash == debugHash)
loadLdDebug();

Tag Debug Mode
ld-debug.js
loadLdDebug();
addDebugIframe();

Tag Debug Mode
GET /event?a=%5B30072%…&debugMode=1
ld-debug.js
loadLdDebug();
addDebugIframe();

Tag Debug Mode
GET /event?a=%5B30072%…&debugMode=1
200 OK
Content-Type: application/javascript
sendDebugInformationToIframe({
audit: {
product: { image: ‘…’ },
errors: […]
}
});
ld-debug.js
loadLdDebug();
addDebugIframe();

Tag Debug Mode
 Gives you fine-grained insights on the quality of information sent
 Requires no technical knowlege
 Mirrors exactly what will be processed down the line

Goal
 Provide feedbacks ASAP on a subset of products
 Provide feedbacks on the whole feed
 Automatic format detection (Google specs)
 User can validate the structure of the feed
 User can review some products
 As close as possible as the daily feed import

Full import
Daily import architecture

Full import
Update feed processing
Hadoop job to compute
errors and attributes
statistics

Full import
Launch full import from
Integrate, retrieve and
display statistics

Test import
Create a Marathon application
that:
- Stream incoming feed
- Detect format
- Reuse part of feed processing
Hadoop job java code
- Save import & statistics in DB
- Provide API to fetch statistics

Result

How banners work at Criteo
• Actual humans pick predefined
layouts, colors, CTAs
• Then those are combined with product
information and optimized on-the-fly
Je découvre !
J’achète !
× ×
×
=

“Can I have drop shadows on my products?”
“I’m not sure about the pink”
“Could it autoplay loud music?”
As a result, clients worry
“What will my banners look like?”

There is stuff we can’t do, and stuff we don’t necessarily want to do
“What will my banners look like?”
“Can I have drop shadows on my products?”
“I’m not sure about the pink”
“Could it autoplay loud music?”

Creatives to the rescue
And it takes back and forth.
Our goal:
• Give advertisers a preview of what it’ll look like
• Give advertisers customization options
• Feedback the performance impact
• 80% of advertisers validate their Creatives in < 2 minutes
• 80% of advertisers don’t ask for a change

Creatives
Bring on UX, R&D, Product, Sales, Creatives & Technical Support

Creatives
1 Education
Preview
Performance
Customization
2
3
4
1
2
3
4

Going further!
And mostly faster

eCommerce Platforms
Lots of our clients run on ready-to-use platforms that have APIs
As a result, we can completely automate the integration workflow for them!

Shopify integration
Only 2 clicks needed!
Reduced integration time from 14 days to 20 minutes

How customers / technical support / we feel

“
”
• Only 25% in 2014
• 66% complete
Feed in < 1h
• 43 days in 2014
• 2014: 600
integrations/quarter
• 2016: 1800
integrations/quarter
• 50% handled
through Integrate
• 95% accept “as-is”
• 4% accept with
performance
downgrade
• Only 1% ask for
modification
Nassim Aissat, Global TS
I’m in love with the
Tag Debug Mode
7514d %Median
integration time
Tags without help
Integrate achievements
92%Validate Creatives
< 2 mn
20mnIntegration w/
Shopify App

What does Black Friday mean at Criteo?

Release freeze: trying to guarantee the stability of the platform...
... with nasty side-effects
Getting ready for Black Friday

How to know evaluate at a glance the health of the datacenter?
Comes grafana
Monitoring the datacenter

With specific filters, deviant machines can be spotted easily

Drilling down...

Until finding a likely culprit

And switching to micro analysis to find the root cause
• Process Explorer
• Profiling
• Windbg
• ClrMD

Load Balancing
HA Proxy

Basic of Client Side Load Balancing

Mixed technical specifications

Gen8 Load test

• This is a bullet
• 2nd level bullet
Gen8 vs Gen9 servers

Observable result
2/3
1/3

Conclusion
Do not take your software for granted
• Internal Infrastructure will change
• External workload will change
… be prepared

The Analytics Stack at
Criteo
Yesterday, Today and Tomorrow with an assist from Bill Murray
Justin Coffey, Team Lead

The Ghost of Christmas
Present
What do we have now?

Criteo: Scale of Data
• 4 Billion ads served each day
• 200+ Billion events logged each day
• 50TBs of data ingested each day
• 10 trillion records processed each day

Criteo: Scale of the Analytics Stack
50+ TB ingested / day
2000+ jobs / day
7+PB
Under
Management
200+ Analysts
400+ Engineers
1000+
Sales and Ops

Criteo: Scaling Analysts
0
20
40
60
80
100
120
140
160
180
Analysts Hired since
2010

Criteo: Scaling Data
0
2E+10
4E+10
6E+10
8E+10
1E+11
1.2E+11
1.4E+11
Growth of a Single Dataset Since July 2014

Criteo: The Analytics Stack Today
Ad-Hoc
Analysis
Hadoop for primary
storage and point of
ingestion
Data Transformation
on top of Hadoop
Hive (7PB) and
Vertica (100+ TB)
Data Warehouses
Ad-Hoc SQL on Hive
and Vertica,
Reporting on
Tableau and Vertica
OrchestrationviaLangoustine

Our Stack is Simple
• Few moving parts
• Purposefully built with Shiny Thing blinders on
• It's okay to not have the "latest and greatest" tech
• Good enough is, actually, always good enough

On Shiny Things: the universe is vast
so be selective, and master what you select

The Ghost of Christmas Past
Before we continue, a quick history lesson of how we got here is in order...

Everything starts
somewhere
and it's not always pretty.

In early 2013, you could use SQL Server…
AdServer_Db
Publisher_Db
LogStatus_Db
BlogWidgetStat_Db
BlogWidgetAdStat_dbTraffic_custom_db
Extranet_DbTraffic_custom_db
CATEGORY_DB
Mail_MonitorDB
Inventory_Db
AdServerBo_Db
AdServerStat_Db
DashBoard_DB
Dashboard_Security_DB
WebServerStat_db
ABTesting_DB
AdvertiserFatigueStats_db
ADVERTISING_DB
StatPrediction_DB
CAST_DB
CriteoRefdb
ImportDB
RISK_DBGalacticaStats_DB
MaxCpc_DB
UserProfilingDB
WorkflowPersistency_db
CAST_DB_HOURLY
StatEngine_Db
Crawler_Db
BICustom_DB
Lookalike_DB
Widget_db
AOC_DB
AOC_DB
Build_Deploy_Fake_db
publisher_stats_db
TestFwk_Db
LogMonitorDb
ADMINLOGS_DB
SqoopExport_db
FraudDetection_db
HPClink_DB
DW_DB
tsuissesbenl_stat_db
Heyokr_Stat_db
kiabiit_stat_db
Ultaus_Stat_db
Crutchfieldus_Stat_db
Forzierijp_Stat_db
Retailchoiceuk_Stat_db
Ryanairhotelses_Stat_db
Speakyplanetfr_Stat_db
Autowayjp_Stat_db
Sicilianobr_Stat_db
Jukenhousingjp_Stat_db
Cosyforyoufr_Stat_db
Tripadvisorru_Stat_db
Linasmatkassese_Stat_db
Ellepassionsfr_Stat_db
Skyde_Stat_db
Swimdoctormallkr_Stat_db
Sitescoutbr_Stat_db
Travelzoousnewusers_Stat_db
Platekompanietno_Stat_db
Testaoc110413frcom_Stat_db
Megapoolnl_Stat_db
Elektrototaalmarktnl_Stat_db
Intersportuk_Stat_db
Usineadesignfr_Stat_db
Lekmerno_Stat_db
Vuelingit_Stat_db
Valuedopinions_Stat_db
Forzierino_Stat_db
Artisantiuk_Stat_db
Idbusit_Stat_db
Cocostorykr_Stat_db
Artnaturejp_Stat_db
Byggmaxse_Stat_db
Corporatecriteopmit_Stat_db
Aramisauto_Stat_db
Migoaes_Stat_db
Degrotespeelgoedwinkelnl_Stat_db
Diorcouturit_Stat_db
Kaufuniquede_Stat_db
Codigallerykr_Stat_db
Mandarinaduckfr_Stat_db
Comarketingorangenokiafr_Stat_db
Sinbiangkr_Stat_db
Cheapflightsuk_Stat_db
Undergirlkr_Stat_db
Agradinl_Stat_db
Kofferprofide_Stat_db
Domodipl_Stat_db
Mandarinaduckat_Stat_db
Mobilegermany_Stat_db
Chlit_Stat_db
Spreadshirtuk_Stat_db
Casalrunningfr_Stat_db
Bloomfm_Stat_db
Hotelsbe_Stat_db
Strumentimusicaliit_Stat_db
Bathroomworlduk_Stat_db
Verivoxde_Stat_db
Mcmkr_Stat_db
Viaggiedreamsit_Stat_db
Brille24de_Stat_db
Yjgakuseikaikan_Stat_db
Stylepitnl_Stat_db
Cvlibraryrecruiter_Stat_db
Preis24de_Stat_db
Tigershedsuk_Stat_db
Duvetandpillowuk_Stat_db
Noths_Stat_db
Wizwidkr_Stat_db
Ticketonlinede_Stat_db
Lifestyleeuropeuk_Stat_db
Shopeccose_Stat_db
Swanhellenicuk_Stat_db
Deguisementdiscountfr_Stat_db
Freshcottonnl_Stat_db
Tikamoonfr_Stat_db
Testfp1_Stat_db
warehouse_stat_db
Hisjeans_Stat_db
Mountfieldlawnmowers_Stat_db
Sitescoutnl_Stat_db
Lancomeus_Stat_db
Brandelijp_Stat_db
Mesdessousfr_Stat_db
Beautyplanningjp_Stat_db
Lgcobrandingpriceminister_Stat_db
Stockngous_Stat_db
Kickzde_Stat_db
Rockymountaindecorus_Stat_db
Cellbesse_Stat_db
Yvesrocheres_Stat_db
Toshibadirectjp_Stat_db
Seneukr_Stat_db
Waterfeaturesuk_Stat_db
Cottagesforyouuk_Stat_db
Camif_Stat_db
Lojaskdbr_Stat_db
Hipmunkhotels_Stat_db
Sorteonline_Stat_db
Ediets_Stat_db
Bonsportru_Stat_db
Jobjsenjp_Stat_db
Redcoonit_Stat_db
Hmuk_Stat_db
Srtestcetelem2_Stat_db
Iamprettykr_Stat_db
Lebunnybleushopkr_Stat_db
Condenastit_Stat_db
Hotusaes_Stat_db
Chilitvit_Stat_db
Hellinefr_Stat_db
Cobrasonfr_Stat_db
madeindesign_stat_db
Megagadgetsnl_Stat_db
Todaofertabr_Stat_db
bulbus_Stat_db
Calcioshopit_Stat_db
Edenlyes_Stat_db
Recruiterucajp_Stat_db
Engelhornde_Stat_db
Spreadshirtno_Stat_db
Dusparstde_Stat_db
Tabletbr_Stat_db
Ventesecretfr_Stat_db
Venteunique_Stat_db
Dellchde_Stat_db
Dressforlessnl_Stat_db
Multipopkr_Stat_db
allheartus_Stat_db
Trovitdejobs_Stat_db
lesjeudisfr_stat_db
Expediaukcrosssell_Stat_db
Furniturebrituk_Stat_db
Yooxbe_Stat_db
Skyscannerno_Stat_db
Bluetomatoat_Stat_db
Mechakaitaijp_Stat_db
Destinationlightingus_Stat_db
and 10K+ more

SQL Server was Production Infrastructure
• Analyst access to data was an afterthought
• Production databases were not designed for analytics
• Reports and queries were tightly coupled to production
• UX was low and Analysts occasionally broke production systems!

Hive also made an early appearance…
2013-04-22 11:28:59,942 Stage-1 map = 17%, reduce = 0%, Cumulative CPU 365222.27 sec
ZZZZ…

But Hive was also an afterthought
• Raw production data batch loaded with no transformations
• Query tools were non-existant
• Queries were slow and only expert analysts could run them
• UX and productivity were extremely low

This just wasn't working!
we needed a new approach

First things first
we need a database!

Requirements for an Analytic Database
• It must be extremely fast
• It must be able to store our most actionable data sets
• Dozens (at the time!) of TBs, now hundreds
• It must be queryable with proper SQL
• It must be deployable on hardware we specify

Defining a Proof of Concept Evaluation
• Work with Analysts to identify key data sets
• Analyze query patterns
• Define benchmark queries
• Work with vendors to test closed source solutions
• Test OSS in-house

The results
• Vertica struck the right balance between cost, performance and deployment options
• PoC evaluation took ~3 months
• Initial deployment took another ~3 months
• Operations ramped up over the following ~6 months

Working with Analysts during deployment
• Analysts in the team helped define and document the data model
• They also created training materials
• Training was done in concert with engineers

But was it a success?
• Within a year of the rollout we were able to decomission SQL server for analytics
• Today Vertica has over 100 unique ad-hoc users connected each day
• It executes hundreds of thousands of queries each day
• It is the most important piece of analytics infrastructure at Criteo

A fresh deployment to mature infrastructure
• Vertica at Criteo has scaled from ~12TB to ~120TB (going PB soon)
• Ad-hoc users have grown from ~40 to ~200
• Reporting users have grown from ~300 to ~1500
• The number of tables has grown from ~50 to >500

Wait, 500 tables in 3
years?
That's a lot of data modelling!

Analysts contribute to the data model
• Engineers know how the DB works and know how to optimize a data model, but they don't always know what to put in it
• With good tools Analysts contribute to the evolutions of the data model, including schema additions and modifications
• Engineers in the team can help guide them in the finer details
• Rinse and repeat

Side bar: We also had dashboards with SSRS
But we were told it was
ugly and complicated.
We traded ugly for slow,
btw, and it's still
complicated

From SSRS to Tableau and SQL Server to Vertica
• Actually, "slow" is just our current perception—we had SSRS dashboards with timeouts on the order of hours.
• SSRS served as our de facto ETL between those 10K+ SQL Server DBs
• Those SQL Server DBs were also production databases.

So to Summarize the Past
• Analysts had to query across thousands of DBs
• Dashboards were slow and complicated
• Analytics work was strongly coupled to production
life was great back then wasn't it?

We're done then?
Not quite. Things can go awry!

The Ghost of Christmas
Future
...here's hoping it's a near future...

Criteo is World Wide
We have hundreds of analysts spread across dozens of countries!

Criteo has a Rich Product Offering
• Banner Ads, Mobile, In-App, Email, Search
• 10's of Thousands of Advertisers and Publishers
• Some of them very big and very demanding

And (reminder!) our Scale Never Seems to Stop Growing
0
2E+10
4E+10
6E+10
8E+10
1E+11
1.2E+11
1.4E+11
Growth of a Single Dataset Since July 2014

(reminder #2) Number of analysts hired since 2010
0
20
40
60
80
100
120
140
160
180

What could go wrong?

New Challenges
• With so many hungry analysts to feed and with so much volume and variety of data, Vertica's query planner is working over time
• We need to instrument and monitor more
• We need to level-up analysts' SQL skills
• And yes, finally, we do need some data governance*
*oh how I've resisted this day!

2 Analysts and 3 Engineers ain't gonna cut it
• We have scaled up our PM team
• We are moving from a proto-CoE team to an official CoE team
• We are scaling engineering operations

What's on the TODO list?
• Documentation, and automating it as much as possible
• Non-invasive, but very intimate query monitoring
• Workload isolation
• Query suggestions and preëmptive query blocking

More about query inspection
• No matter how wonderful a database may be its performance comes down to how much IO it has and how much contention there is for it
• The difference between a poorly optimized query and a well optimized one for the IO subsystem can be orders of magnitude
• Better queries means more concurrent, happier users

More about query inspection
• Vertica offers lots of ways to find out what is going on behind the scenes, but one of the best ways is to EXPLAIN your users' queries and identify those
who need to be trained!

Recalling our Current Challenges
• Tableau Workbooks are Slow
• Vertica is Overloaded
• Reporting Data is Frequently Late

Patches and the Arc of History
• Each of our currently challenges can be addressed in the short term
• But we need long term solutions to avoid regressions

Tableau Relief Program (TaRP)
Short Term:
• Double the cores on production server
• Isolate critical workbooks
Medium Term:
• Require all production workbooks to go
through gerrit/git review
• Score workbook complexity pre-release
• Monitor released workbooks for QoS
Not So Long Term:
• Work with Product and Central Ops to create
Tableau Center of Excellence and level up BI

TaRP: reporting alchemy
Push to production
Productive
Analyst
Angry
Sales Person
No SLA
dataset
Productive
Analyst
Happy
Sales Person
SLA
dataset
Push to review Automated deploy
Knowledgeable
Analyst

Why impose a dev cycle on report building?
not to be trite, but, well:
that's good money!

More seriously
• Tableau workbooks consume data
• Data comes in all sorts of volumes and velocities (sorry)
• Data query complexity is linked to workbook complexity and features
• If you don't know what you're doing, your workbooks will be:
• slow, because of internal workbook complexity
• slow, because of complex database queries
• not be up to date if it doesn't query the proper data sources
Tableau workbook developers are developers, full stop. Treat them like they are.

Consul
Vertica Roadmap
RTIngester
HDFSIngest
er
HL
L
JDBC
VProxy
Admin
VIcO
JVMIngeste
r
DataDisco

Vertica as a Service
Short Term:
• Scale out as fast as reasonable
• Split reporting and ad hoc workloads
• Better hardware configuration
• More monitoring
Not So Long Term:
• Better monitoring
• Control Input: Trickle and Bulk Loading, Consistently, Durably and Efficiently
• Control Output: Query inspection/prioritization, Workload management

Fixing Your Latent Data Problem
Short Term:
• Migrate critical data workflows to Langoustine
• Optimize DAG and long running queries
Medium Term:
• Migrate long-tail datasets to Langoustine
• Better metrics, capacity planning
Not So Long Term:
• Refactor data model to cull useless data sets
• Better complexity analysis of workflow modifications pre-release

We're going to need better instrumentation
Better Workflow Insights in Langoustine Better Hadoop Job Performance Metrics

Let's spend less time making data workflows
Langoustine IDE makes building Hive workflows trivial

Langoustine IDE promotes best practices
Workflows are source controlled:
Reviews are built-in:

We'll need better dev tools (eg dev-cluster)
build an AWS hadoop cluster:
connect to it via a local docker container:
and load it with data saved in S3:

SLAB: SLA Boards That Say A Lot

Wait, what about Opera and
Vizatra?
didn't you guys do a lot of work on that?

A Quick Opera Recap
Opera is the internal replacement for CPOP, built in two parts
A scalding-langoustine data pipeline: And a vizatra-OLAP web app:

We learned a lot from building Opera
• How to use SQL to describe a dashboard
• How to master SQL queries executed from an OLAP app
• How to build big, fast databases
• How to build optimal (or so we think) data processing pipelines
• How to make a decent UI with decent UX

Let's focus on the SQL stuff

Using SQL for dashboard meta-data
SELECT
time_id as hour,
country_code as country,
network_id as network,
SUM(clicks) as clicks,
SUM(displays) as displays,
SUM(clicks) / SUM(displays) as ctr
FROM
facts
WHERE
time_id BETWEEN ?start AND ?end
GROUP BY
time_id,
country_code,
network_id
Time dimensions
Dimensions
Metrics
Parameters

Using SQL for dashboard meta-data
Time dimension
Dimensions
Metrics
Parameters

Big-O(lap)
SELECT
time_id as hour,
FROM
facts
WHERE
GROUP BY
time_id,
country_code,
network_id
PROJECTION
Revenue by country
SELECTION
Last 7 days in EUR

Big-O(lap)
SELECT
SUM(displays) as displays
FROM
facts
WHERE
time_id BETWEEN ‘2016-03-01’ AND ‘2016-03-07’
GROUP BY
country_code
PROJECTION
Revenue by country
SELECTION
Last 7 days in EUR

Now that we've gotten
intimate with SQL...
Let's see what else we can build...

Vizatra Client: One DB Client to Rule Them All

Vizatra Client: One DB Client to Rule Them All
• Parse every query and analyze complexity before executing it
• Enforce best practices (e.g. predicates on partitions)
• Degrade gracefully (e.g. don't submit queries to an overloaded DB)
• Score users and queries, share with other users
• Provide basic visualizations to increase analytic productivity
• Support non-SQL datasources
• And your feature?

The End.
Thanks for listening. If any of this sounds fun, we're hiring!

Criteo Infrastructure (Platform) Meetup

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (9)

Ähnlich wie Criteo Infrastructure (Platform) Meetup

Ähnlich wie Criteo Infrastructure (Platform) Meetup (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Criteo Infrastructure (Platform) Meetup

Hinweis der Redaktion