SlideShare ist ein Scribd-Unternehmen logo
1 von 125
Downloaden Sie, um offline zu lesen
WHOAMI
> Ruben Berenguel (@berenguel)
> PhD in Mathematics
> Lead Data Engineer at Hybrid Theory
> Preferred stack is Python, Go and Scala
Part 1 Set up
Part 2 The identity graph
Part 3 Speed up and improvements
Part 1: Set up
Adtech
What are cookies, really?
What is cookie mapping?
The identity problem
PROGRAMMATIC ADTECH
FIND USERS SATISFYING SOME
CRITERIA
PROGRAMMATIC ADTECH
FIND USERS SATISFYING SOME
CRITERIA
> Visited pages of category ABC
PROGRAMMATIC ADTECH
FIND USERS SATISFYING SOME
CRITERIA
> Visited pages of category ABC
> Are interested in concept XYZ
PROGRAMMATIC ADTECH
FIND USERS SATISFYING SOME
CRITERIA
> Visited pages of category ABC
> Are interested in concept XYZ
> Are likely to want to buy from our
client RST
TO FIND THEM WE NEED
THEIR BROWSE AND/OR BEHAVIOUR
DATA
! "
TO FIND THEM WE NEED
THEIR BROWSE AND/OR BEHAVIOUR DATA
! "
TO DELIVER FOR OUR
CLIENTS WE NEED
A WAY TO SHOW THEM ADS
!
COOKIES
ARE USED TO HELP
WEBSITES
TRACK EVENTS
AND STATE
AS USERS BROWSE
THERE ARE TWO KIND OF
COOKIES
FIRST PARTY (SESSION, STATE…)
THIRD PARTY (EVENT TRACKING…)
WE GET BROWSE DATA FROM USERS ON
THE WEB FROM DATA PROVIDERSA
A
Event logs with cookies provided in batch by data providers
WE GET BROWSE DATA FROM USERS IN THE
WEB FROM DATA PROVIDERS
WE GET BROWSE DATA FROM USERS
BROWSING OUR CLIENT WEBSITEB
B
Event logs with cookies generated from our servers, via our pixels
HOW DO WE CONNECT
BOTH DATA SOURCES?
THE IDENTIFIERS WE GET
FROM BOTH SIDES ARE
UNRELATED! !
MAPPING SERVERS
AND
THE MAPPING CHAIN
THE IDENTITY
PROBLEM
BASIC SOLUTION
BASIC SOLUTION
> Coalesce (merge on nulls) chains based on one id
BASIC SOLUTION
> Coalesce (merge on nulls) chains based on one id
> Is not as complete as the graph approach because…
BASIC SOLUTION
> Coalesce (merge on nulls) chains based on one id
> Is not as complete as the graph approach because…
> Requires one stable identifier
Part 2: The identity graph
Rethink the problem as a graph
Connected components in big data
ENTER GRAPHFRAMES
BASIC SPARK GRAPH
FRAMEWORK: GRAPHX
IT IS MESSAGE-PROPAGATIONC
,
GRAPH-PARALLEL, LOW LEVEL
C
Like the Pregel API
BASIC SPARK GRAPH
FRAMEWORK: GRAPHX
IT IS MESSAGE-PROPAGATION (PREGEL
API) GRAPH-PARALLEL, LOW LEVEL
GRAPHFRAMES ARE TO DATAFRAMES
AS GRAPHX IS TO RDDS
ALTERNATIVES CONSIDERED…
Apache Giraph harder maintenance
Neo4J harder scalability
AWS Neptune too new
INPUT SHOULD BE FORMATTED AS A DATAFRAME OF EDGES
src dst (…)
partner_1_! partner_2_⍺ 1617963647…
partner_1_2 partner_3_⭘ 1617963647…
partner_2_𝛄 partner_3_ 1617963654…
⁞ ⁞ ⁞
CONNECTED
COMPONENTS IN BIG
DATA
THE LARGE STAR - SMALL
STAR ALGORITHM
OUTPUT LAYOUT
Component Id Partner / Cookie Id Timestamp
10234 partner_1_! 1617963647
10234 partner_2_⍺ 1617963647
5534 partner_1_2 1617963654
⁞ ⁞ ⁞
To map from Partner A to Partner B
To map from Partner A to Partner B
> Given an id Partner_A_X,
To map from Partner A to Partner B
> Given an id Partner_A_X,
> we find the connected component id for the node
Partner_A_X,
To map from Partner A to Partner B
> Given an id Partner_A_X,
> we find the connected component id for the node
Partner_A_X,
> we find all the nodes of the form Partner_B_* for the
component above
IMPACT OF MOVING FROM AN ADHOC
PROCESS TO A GRAPH PROCESS
IMPACT OF MOVING FROM AN ADHOC
PROCESS TO A GRAPH PROCESS
> Partner integration: from 2 months to 1 week
IMPACT OF MOVING FROM AN ADHOC
PROCESS TO A GRAPH PROCESS
> Partner integration: from 2 months to 1 week
> Users mapped uplift: around 20%
IMPACT OF MOVING FROM AN ADHOC
PROCESS TO A GRAPH PROCESS
> Partner integration: from 2 months to 1 week
> Users mapped uplift: around 20%
> Mapping "quality": competitive (within 5%) with industry
leaders
Part 3: Speed up and improvements
Data cleanup
Cheap refresh
Machine tuning
Potential improvement
DATA CLEANUP
INVALID IDENTIFIERS
INVALID IDENTIFIERS
LIKE NA OR 0 OR XYZ
(OR FRAUDULENT CALLS TO A MAPPING SERVER)
NODE PRUNING
NODE PRUNING
TO PREVENT HUGE COMPONENTS
IN THE COOKIE CASE, BY EXPIRING COOKIES NOT SEEN IN N DAYS
COMPONENT DESTRUCTION
COMPONENT DESTRUCTION
TO LIMIT COMPONENT SIZE
ARTIFICIALLY
IF THE DATA IS FULLY CLEAN WE CAN ASSUME NO USER HAS MORE THAN M IDENTIFIERS
WHAT IS THE FASTEST WAY TO BUILD A
2 BILLION NODES GRAPH DAILY?
!
WHAT IS THE FASTEST WAY TO BUILD A
2 BILLION NODES GRAPH DAILY?
NOT DOING IT
THE EASY WAY
MACHINE
TUNING
FOR LARGE
GRAPHS
GO LARGE AND TUNE UP
GO LARGE AND TUNE UP
> the process is memory hungry
GO LARGE AND TUNE UP
> the process is memory hungry
> the process is shuffle hungry
GO LARGE AND TUNE UP
> the process is memory hungry
> the process is shuffle hungry
BETTER TO HAVE FEW, LARGE,
MACHINES
GO LARGE AND TUNE UP
> the process is memory hungry
> the process is shuffle hungry
BETTER TO HAVE FEW, LARGE,
MACHINES
AND GIVE EXECUTORS MORE
MEMORY THAN YOU'D THINK
IMPACT OF ADAPTIVE
QUERY EXECUTION
(AQE)
AQE USES RUNTIME STATISTICS TO
HELP THE COST BASED OPTIMIZER
(CBO) AND SPEED UP SPARK
IMPACT OF ADAPTIVE
QUERY EXECUTION
(AQE)
AQE USES RUNTIME STATISTICS TO HELP
THE COST BASED OPTIMIZER (CBO) AND
SPEED UP SPARK
USING SPARK 3.X WITH AQE ACTIVE
HAS A 30-40% SPEED UP
FURTHER IMPROVEMENTS
FURTHER IMPROVEMENTS
> Easy: Move storage to Delta Lake
FURTHER IMPROVEMENTS
> Easy: Move storage to Delta Lake
> Hard: implement union-find-shuffle instead of large star -
small star
THANKS!
Get the slides from my github:
github.com/rberenguel/
The repository is
identity-graphs
References
Connected Components in MapReduce and Beyond (ACM)
Connected Components in MapReduce and Beyond (slides)
Partition Aware Connected Component Computation in Distributed Systems
Building Graphs at a Large Scale: Union Find Shuffle
Adaptive Query Execution: Speeding up SparkSQL at runtime
Pregel: A System for Large-Scale Graph Processing
GraphX
GraphFrames
Apache Giraph
Neo4J
AWS Neptune
Databricks' Delta Lake: high on ACID
Related talks
Massive-Scale Entity Resolution Using the Power of Apache Spark and Graph
Maps and Meaning: Graph-based Entity Resolution in Apache Spark & GraphX
Building Identity Graph at Scale for Programmatic Media Buying Using Apache Spark and
Delta Lake
Building Identity Graphs over Heterogeneous Data
Optimize the Large Scale Graph Applications by using Apache Spark with 4-5x
Performance Improvements
GraphFrames: Graph Queries In Spark SQL
Using GraphX/Pregel on Browsing History to Discover Purchase Intent
Reference Image attribution
Graphs Ruben Berenguel ! (Generative art with p5js)
Bulb Alessandro Bianchi (Unsplash)
Bubbles Marko Blažević (Unsplash)
Chair Volodymyr Tokar (Unsplash)
Cookie Dex Ezekiel (Unsplash)
Loupe Agence Olloweb (Unsplash)
Map Timo Wielink (Unsplash)
Mask Adnan Khan (Unsplash)
Newspaper Rishabh Sharma (Unsplash)
Party Adi Goldstein (Unsplash)
Socket Kelly Sikkema (Unsplash)
Spray JESHOOTS.COM (Unsplash)
Tuning gustavo Campos (Unsplash)
Web Shannon Potter (Unsplash)
Resources
Unicode table
EOF

Weitere ähnliche Inhalte

Was ist angesagt?

Graph Features in Spark 3.0: Integrating Graph Querying and Algorithms in Spa...
Graph Features in Spark 3.0: Integrating Graph Querying and Algorithms in Spa...Graph Features in Spark 3.0: Integrating Graph Querying and Algorithms in Spa...
Graph Features in Spark 3.0: Integrating Graph Querying and Algorithms in Spa...
Databricks
 

Was ist angesagt? (20)

Apache Arrow: In Theory, In Practice
Apache Arrow: In Theory, In PracticeApache Arrow: In Theory, In Practice
Apache Arrow: In Theory, In Practice
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introduction
 
Incremental View Maintenance with Coral, DBT, and Iceberg
Incremental View Maintenance with Coral, DBT, and IcebergIncremental View Maintenance with Coral, DBT, and Iceberg
Incremental View Maintenance with Coral, DBT, and Iceberg
 
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise NecessityApache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
 
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
 
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
 
Graph Features in Spark 3.0: Integrating Graph Querying and Algorithms in Spa...
Graph Features in Spark 3.0: Integrating Graph Querying and Algorithms in Spa...Graph Features in Spark 3.0: Integrating Graph Querying and Algorithms in Spa...
Graph Features in Spark 3.0: Integrating Graph Querying and Algorithms in Spa...
 
From Query Plan to Query Performance: Supercharging your Apache Spark Queries...
From Query Plan to Query Performance: Supercharging your Apache Spark Queries...From Query Plan to Query Performance: Supercharging your Apache Spark Queries...
From Query Plan to Query Performance: Supercharging your Apache Spark Queries...
 
Apache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data TransportApache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data Transport
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Hyperspace for Delta Lake
Hyperspace for Delta LakeHyperspace for Delta Lake
Hyperspace for Delta Lake
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
 
Apache Arrow Flight Overview
Apache Arrow Flight OverviewApache Arrow Flight Overview
Apache Arrow Flight Overview
 
Strata 2016 - Architecting for Change: LinkedIn's new data ecosystem
Strata 2016 - Architecting for Change: LinkedIn's new data ecosystemStrata 2016 - Architecting for Change: LinkedIn's new data ecosystem
Strata 2016 - Architecting for Change: LinkedIn's new data ecosystem
 
Parquet overview
Parquet overviewParquet overview
Parquet overview
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Understanding InfluxDB’s New Storage Engine
Understanding InfluxDB’s New Storage EngineUnderstanding InfluxDB’s New Storage Engine
Understanding InfluxDB’s New Storage Engine
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
 
Introduction to Apache Calcite
Introduction to Apache CalciteIntroduction to Apache Calcite
Introduction to Apache Calcite
 
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
 

Ähnlich wie Keeping Identity Graphs In Sync With Apache Spark

Off-Label Data Mesh: A Prescription for Healthier Data
Off-Label Data Mesh: A Prescription for Healthier DataOff-Label Data Mesh: A Prescription for Healthier Data
Off-Label Data Mesh: A Prescription for Healthier Data
HostedbyConfluent
 
RAPIDS cuGraph – Accelerating all your Graph needs
RAPIDS cuGraph – Accelerating all your Graph needsRAPIDS cuGraph – Accelerating all your Graph needs
RAPIDS cuGraph – Accelerating all your Graph needs
Connected Data World
 

Ähnlich wie Keeping Identity Graphs In Sync With Apache Spark (20)

EPAM AQA: Let`s make a (s)hot with gatling 3.0+
EPAM AQA: Let`s  make a (s)hot with gatling 3.0+EPAM AQA: Let`s  make a (s)hot with gatling 3.0+
EPAM AQA: Let`s make a (s)hot with gatling 3.0+
 
AI 클라우드로 완전 정복하기 - 데이터 분석부터 딥러닝까지 (윤석찬, AWS테크에반젤리스트)
AI 클라우드로 완전 정복하기 - 데이터 분석부터 딥러닝까지 (윤석찬, AWS테크에반젤리스트)AI 클라우드로 완전 정복하기 - 데이터 분석부터 딥러닝까지 (윤석찬, AWS테크에반젤리스트)
AI 클라우드로 완전 정복하기 - 데이터 분석부터 딥러닝까지 (윤석찬, AWS테크에반젤리스트)
 
PostgreSQL
PostgreSQLPostgreSQL
PostgreSQL
 
PostgreSQL
PostgreSQL PostgreSQL
PostgreSQL
 
Digital Transformation | AWS Webinar
Digital Transformation | AWS WebinarDigital Transformation | AWS Webinar
Digital Transformation | AWS Webinar
 
MongoDB What's new in 3.2 version
MongoDB What's new in 3.2 versionMongoDB What's new in 3.2 version
MongoDB What's new in 3.2 version
 
GraphQL across the stack: How everything fits together
GraphQL across the stack: How everything fits togetherGraphQL across the stack: How everything fits together
GraphQL across the stack: How everything fits together
 
What’s New in the Upcoming Apache Spark 3.0
What’s New in the Upcoming Apache Spark 3.0What’s New in the Upcoming Apache Spark 3.0
What’s New in the Upcoming Apache Spark 3.0
 
Apache Kylin’s Performance Boost from Apache HBase
Apache Kylin’s Performance Boost from Apache HBaseApache Kylin’s Performance Boost from Apache HBase
Apache Kylin’s Performance Boost from Apache HBase
 
Creating PostgreSQL-as-a-Service at Scale
Creating PostgreSQL-as-a-Service at ScaleCreating PostgreSQL-as-a-Service at Scale
Creating PostgreSQL-as-a-Service at Scale
 
Scalable, Fast Analytics with Graph - Why and How
Scalable, Fast Analytics with Graph - Why and HowScalable, Fast Analytics with Graph - Why and How
Scalable, Fast Analytics with Graph - Why and How
 
Extending Analytics Beyond the Data Warehouse, ft. Warner Bros. Analytics (AN...
Extending Analytics Beyond the Data Warehouse, ft. Warner Bros. Analytics (AN...Extending Analytics Beyond the Data Warehouse, ft. Warner Bros. Analytics (AN...
Extending Analytics Beyond the Data Warehouse, ft. Warner Bros. Analytics (AN...
 
EEDC 2010. Scaling Web Applications
EEDC 2010. Scaling Web ApplicationsEEDC 2010. Scaling Web Applications
EEDC 2010. Scaling Web Applications
 
AWS Partner Webcast - Analyze Big Data for Consumer Applications with Looker ...
AWS Partner Webcast - Analyze Big Data for Consumer Applications with Looker ...AWS Partner Webcast - Analyze Big Data for Consumer Applications with Looker ...
AWS Partner Webcast - Analyze Big Data for Consumer Applications with Looker ...
 
Make streaming processing towards ANSI SQL
Make streaming processing towards ANSI SQLMake streaming processing towards ANSI SQL
Make streaming processing towards ANSI SQL
 
Accelerate Your Analytic Queries with Amazon Aurora Parallel Query (DAT362) -...
Accelerate Your Analytic Queries with Amazon Aurora Parallel Query (DAT362) -...Accelerate Your Analytic Queries with Amazon Aurora Parallel Query (DAT362) -...
Accelerate Your Analytic Queries with Amazon Aurora Parallel Query (DAT362) -...
 
Off-Label Data Mesh: A Prescription for Healthier Data
Off-Label Data Mesh: A Prescription for Healthier DataOff-Label Data Mesh: A Prescription for Healthier Data
Off-Label Data Mesh: A Prescription for Healthier Data
 
WJAX 2019 - Taking Distributed Tracing to the next level
WJAX 2019 - Taking Distributed Tracing to the next levelWJAX 2019 - Taking Distributed Tracing to the next level
WJAX 2019 - Taking Distributed Tracing to the next level
 
RAPIDS cuGraph – Accelerating all your Graph needs
RAPIDS cuGraph – Accelerating all your Graph needsRAPIDS cuGraph – Accelerating all your Graph needs
RAPIDS cuGraph – Accelerating all your Graph needs
 
Dsdt meetup 2017 11-21
Dsdt meetup 2017 11-21Dsdt meetup 2017 11-21
Dsdt meetup 2017 11-21
 

Mehr von Databricks

Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 

Mehr von Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 
Machine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionMachine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack Detection
 

Kürzlich hochgeladen

Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
ranjankumarbehera14
 
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
vexqp
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling ManjurJual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
ptikerjasaptiker
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
nirzagarg
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
vexqp
 
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
vexqp
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
ahmedjiabur940
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
q6pzkpark
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
nirzagarg
 

Kürzlich hochgeladen (20)

Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
 
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for Research
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptxThe-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
 
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling ManjurJual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
 
Switzerland Constitution 2002.pdf.........
Switzerland Constitution 2002.pdf.........Switzerland Constitution 2002.pdf.........
Switzerland Constitution 2002.pdf.........
 
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangePredicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
 
Harnessing the Power of GenAI for BI and Reporting.pptx
Harnessing the Power of GenAI for BI and Reporting.pptxHarnessing the Power of GenAI for BI and Reporting.pptx
Harnessing the Power of GenAI for BI and Reporting.pptx
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - Almora
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
 
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
 
Sequential and reinforcement learning for demand side management by Margaux B...
Sequential and reinforcement learning for demand side management by Margaux B...Sequential and reinforcement learning for demand side management by Margaux B...
Sequential and reinforcement learning for demand side management by Margaux B...
 
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
 

Keeping Identity Graphs In Sync With Apache Spark

  • 1.
  • 2.
  • 3. WHOAMI > Ruben Berenguel (@berenguel) > PhD in Mathematics > Lead Data Engineer at Hybrid Theory > Preferred stack is Python, Go and Scala
  • 4. Part 1 Set up Part 2 The identity graph Part 3 Speed up and improvements
  • 5. Part 1: Set up Adtech What are cookies, really? What is cookie mapping? The identity problem
  • 6. PROGRAMMATIC ADTECH FIND USERS SATISFYING SOME CRITERIA
  • 7. PROGRAMMATIC ADTECH FIND USERS SATISFYING SOME CRITERIA > Visited pages of category ABC
  • 8. PROGRAMMATIC ADTECH FIND USERS SATISFYING SOME CRITERIA > Visited pages of category ABC > Are interested in concept XYZ
  • 9. PROGRAMMATIC ADTECH FIND USERS SATISFYING SOME CRITERIA > Visited pages of category ABC > Are interested in concept XYZ > Are likely to want to buy from our client RST
  • 10. TO FIND THEM WE NEED THEIR BROWSE AND/OR BEHAVIOUR DATA ! "
  • 11. TO FIND THEM WE NEED THEIR BROWSE AND/OR BEHAVIOUR DATA ! " TO DELIVER FOR OUR CLIENTS WE NEED A WAY TO SHOW THEM ADS !
  • 12. COOKIES ARE USED TO HELP WEBSITES TRACK EVENTS AND STATE AS USERS BROWSE
  • 13. THERE ARE TWO KIND OF COOKIES FIRST PARTY (SESSION, STATE…) THIRD PARTY (EVENT TRACKING…)
  • 14.
  • 15.
  • 16.
  • 17.
  • 18.
  • 19.
  • 20.
  • 21.
  • 22.
  • 23.
  • 24. WE GET BROWSE DATA FROM USERS ON THE WEB FROM DATA PROVIDERSA A Event logs with cookies provided in batch by data providers
  • 25. WE GET BROWSE DATA FROM USERS IN THE WEB FROM DATA PROVIDERS WE GET BROWSE DATA FROM USERS BROWSING OUR CLIENT WEBSITEB B Event logs with cookies generated from our servers, via our pixels
  • 26. HOW DO WE CONNECT BOTH DATA SOURCES? THE IDENTIFIERS WE GET FROM BOTH SIDES ARE UNRELATED! !
  • 28.
  • 29.
  • 30.
  • 31.
  • 32.
  • 33.
  • 34.
  • 35.
  • 36.
  • 37.
  • 38.
  • 39.
  • 40.
  • 42.
  • 44. BASIC SOLUTION > Coalesce (merge on nulls) chains based on one id
  • 45. BASIC SOLUTION > Coalesce (merge on nulls) chains based on one id > Is not as complete as the graph approach because…
  • 46. BASIC SOLUTION > Coalesce (merge on nulls) chains based on one id > Is not as complete as the graph approach because… > Requires one stable identifier
  • 47.
  • 48. Part 2: The identity graph Rethink the problem as a graph Connected components in big data
  • 49.
  • 50.
  • 51.
  • 52.
  • 53.
  • 54.
  • 55.
  • 56.
  • 57.
  • 58.
  • 60. BASIC SPARK GRAPH FRAMEWORK: GRAPHX IT IS MESSAGE-PROPAGATIONC , GRAPH-PARALLEL, LOW LEVEL C Like the Pregel API
  • 61. BASIC SPARK GRAPH FRAMEWORK: GRAPHX IT IS MESSAGE-PROPAGATION (PREGEL API) GRAPH-PARALLEL, LOW LEVEL GRAPHFRAMES ARE TO DATAFRAMES AS GRAPHX IS TO RDDS
  • 62. ALTERNATIVES CONSIDERED… Apache Giraph harder maintenance Neo4J harder scalability AWS Neptune too new
  • 63. INPUT SHOULD BE FORMATTED AS A DATAFRAME OF EDGES src dst (…) partner_1_! partner_2_⍺ 1617963647… partner_1_2 partner_3_⭘ 1617963647… partner_2_𝛄 partner_3_ 1617963654… ⁞ ⁞ ⁞
  • 64. CONNECTED COMPONENTS IN BIG DATA THE LARGE STAR - SMALL STAR ALGORITHM
  • 65.
  • 66.
  • 67.
  • 68.
  • 69.
  • 70.
  • 71.
  • 72.
  • 73.
  • 74.
  • 75.
  • 76.
  • 77.
  • 78.
  • 79. OUTPUT LAYOUT Component Id Partner / Cookie Id Timestamp 10234 partner_1_! 1617963647 10234 partner_2_⍺ 1617963647 5534 partner_1_2 1617963654 ⁞ ⁞ ⁞
  • 80. To map from Partner A to Partner B
  • 81. To map from Partner A to Partner B > Given an id Partner_A_X,
  • 82. To map from Partner A to Partner B > Given an id Partner_A_X, > we find the connected component id for the node Partner_A_X,
  • 83. To map from Partner A to Partner B > Given an id Partner_A_X, > we find the connected component id for the node Partner_A_X, > we find all the nodes of the form Partner_B_* for the component above
  • 84. IMPACT OF MOVING FROM AN ADHOC PROCESS TO A GRAPH PROCESS
  • 85. IMPACT OF MOVING FROM AN ADHOC PROCESS TO A GRAPH PROCESS > Partner integration: from 2 months to 1 week
  • 86. IMPACT OF MOVING FROM AN ADHOC PROCESS TO A GRAPH PROCESS > Partner integration: from 2 months to 1 week > Users mapped uplift: around 20%
  • 87. IMPACT OF MOVING FROM AN ADHOC PROCESS TO A GRAPH PROCESS > Partner integration: from 2 months to 1 week > Users mapped uplift: around 20% > Mapping "quality": competitive (within 5%) with industry leaders
  • 88. Part 3: Speed up and improvements Data cleanup Cheap refresh Machine tuning Potential improvement
  • 91. INVALID IDENTIFIERS LIKE NA OR 0 OR XYZ (OR FRAUDULENT CALLS TO A MAPPING SERVER)
  • 93. NODE PRUNING TO PREVENT HUGE COMPONENTS IN THE COOKIE CASE, BY EXPIRING COOKIES NOT SEEN IN N DAYS
  • 95. COMPONENT DESTRUCTION TO LIMIT COMPONENT SIZE ARTIFICIALLY IF THE DATA IS FULLY CLEAN WE CAN ASSUME NO USER HAS MORE THAN M IDENTIFIERS
  • 96. WHAT IS THE FASTEST WAY TO BUILD A 2 BILLION NODES GRAPH DAILY? !
  • 97. WHAT IS THE FASTEST WAY TO BUILD A 2 BILLION NODES GRAPH DAILY? NOT DOING IT
  • 99.
  • 100.
  • 101.
  • 102.
  • 103.
  • 104.
  • 105.
  • 106.
  • 108. GO LARGE AND TUNE UP
  • 109. GO LARGE AND TUNE UP > the process is memory hungry
  • 110. GO LARGE AND TUNE UP > the process is memory hungry > the process is shuffle hungry
  • 111. GO LARGE AND TUNE UP > the process is memory hungry > the process is shuffle hungry BETTER TO HAVE FEW, LARGE, MACHINES
  • 112. GO LARGE AND TUNE UP > the process is memory hungry > the process is shuffle hungry BETTER TO HAVE FEW, LARGE, MACHINES AND GIVE EXECUTORS MORE MEMORY THAN YOU'D THINK
  • 113. IMPACT OF ADAPTIVE QUERY EXECUTION (AQE) AQE USES RUNTIME STATISTICS TO HELP THE COST BASED OPTIMIZER (CBO) AND SPEED UP SPARK
  • 114. IMPACT OF ADAPTIVE QUERY EXECUTION (AQE) AQE USES RUNTIME STATISTICS TO HELP THE COST BASED OPTIMIZER (CBO) AND SPEED UP SPARK USING SPARK 3.X WITH AQE ACTIVE HAS A 30-40% SPEED UP
  • 116. FURTHER IMPROVEMENTS > Easy: Move storage to Delta Lake
  • 117. FURTHER IMPROVEMENTS > Easy: Move storage to Delta Lake > Hard: implement union-find-shuffle instead of large star - small star
  • 119. Get the slides from my github: github.com/rberenguel/ The repository is identity-graphs
  • 120.
  • 121. References Connected Components in MapReduce and Beyond (ACM) Connected Components in MapReduce and Beyond (slides) Partition Aware Connected Component Computation in Distributed Systems Building Graphs at a Large Scale: Union Find Shuffle Adaptive Query Execution: Speeding up SparkSQL at runtime Pregel: A System for Large-Scale Graph Processing GraphX GraphFrames Apache Giraph Neo4J AWS Neptune Databricks' Delta Lake: high on ACID
  • 122. Related talks Massive-Scale Entity Resolution Using the Power of Apache Spark and Graph Maps and Meaning: Graph-based Entity Resolution in Apache Spark & GraphX Building Identity Graph at Scale for Programmatic Media Buying Using Apache Spark and Delta Lake Building Identity Graphs over Heterogeneous Data Optimize the Large Scale Graph Applications by using Apache Spark with 4-5x Performance Improvements GraphFrames: Graph Queries In Spark SQL Using GraphX/Pregel on Browsing History to Discover Purchase Intent
  • 123. Reference Image attribution Graphs Ruben Berenguel ! (Generative art with p5js) Bulb Alessandro Bianchi (Unsplash) Bubbles Marko Blažević (Unsplash) Chair Volodymyr Tokar (Unsplash) Cookie Dex Ezekiel (Unsplash) Loupe Agence Olloweb (Unsplash) Map Timo Wielink (Unsplash) Mask Adnan Khan (Unsplash) Newspaper Rishabh Sharma (Unsplash) Party Adi Goldstein (Unsplash) Socket Kelly Sikkema (Unsplash) Spray JESHOOTS.COM (Unsplash) Tuning gustavo Campos (Unsplash) Web Shannon Potter (Unsplash)
  • 125. EOF