The online advertising industry is based on identifying users with cookies, and showing relevant ads to interested users. But there are many data providers, many places to target ads and many people browsing online. How can we identify users across data providers? The first step in solving this is by cookie mapping: a chain of server calls that pass identifiers across providers. Sadly, chains break, servers break, providers can be flaky or use caching and you may never see the whole of the chain. The solution to this problem is constructing an identity graph with the data we see: in our case, cookie ids are nodes, edges are relations and connected components of the graph are users.
In this talk I will explain how Hybrid Theory leverages Spark and GraphFrames to construct and maintain a 2000 million node identity graph with minimal computational cost.
9. PROGRAMMATIC ADTECH
FIND USERS SATISFYING SOME
CRITERIA
> Visited pages of category ABC
> Are interested in concept XYZ
> Are likely to want to buy from our
client RST
10. TO FIND THEM WE NEED
THEIR BROWSE AND/OR BEHAVIOUR
DATA
! "
11. TO FIND THEM WE NEED
THEIR BROWSE AND/OR BEHAVIOUR DATA
! "
TO DELIVER FOR OUR
CLIENTS WE NEED
A WAY TO SHOW THEM ADS
!
13. THERE ARE TWO KIND OF
COOKIES
FIRST PARTY (SESSION, STATE…)
THIRD PARTY (EVENT TRACKING…)
14.
15.
16.
17.
18.
19.
20.
21.
22.
23.
24. WE GET BROWSE DATA FROM USERS ON
THE WEB FROM DATA PROVIDERSA
A
Event logs with cookies provided in batch by data providers
25. WE GET BROWSE DATA FROM USERS IN THE
WEB FROM DATA PROVIDERS
WE GET BROWSE DATA FROM USERS
BROWSING OUR CLIENT WEBSITEB
B
Event logs with cookies generated from our servers, via our pixels
26. HOW DO WE CONNECT
BOTH DATA SOURCES?
THE IDENTIFIERS WE GET
FROM BOTH SIDES ARE
UNRELATED! !
45. BASIC SOLUTION
> Coalesce (merge on nulls) chains based on one id
> Is not as complete as the graph approach because…
46. BASIC SOLUTION
> Coalesce (merge on nulls) chains based on one id
> Is not as complete as the graph approach because…
> Requires one stable identifier
47.
48. Part 2: The identity graph
Rethink the problem as a graph
Connected components in big data
60. BASIC SPARK GRAPH
FRAMEWORK: GRAPHX
IT IS MESSAGE-PROPAGATIONC
,
GRAPH-PARALLEL, LOW LEVEL
C
Like the Pregel API
61. BASIC SPARK GRAPH
FRAMEWORK: GRAPHX
IT IS MESSAGE-PROPAGATION (PREGEL
API) GRAPH-PARALLEL, LOW LEVEL
GRAPHFRAMES ARE TO DATAFRAMES
AS GRAPHX IS TO RDDS
63. INPUT SHOULD BE FORMATTED AS A DATAFRAME OF EDGES
src dst (…)
partner_1_! partner_2_⍺ 1617963647…
partner_1_2 partner_3_⭘ 1617963647…
partner_2_𝛄 partner_3_ 1617963654…
⁞ ⁞ ⁞
81. To map from Partner A to Partner B
> Given an id Partner_A_X,
82. To map from Partner A to Partner B
> Given an id Partner_A_X,
> we find the connected component id for the node
Partner_A_X,
83. To map from Partner A to Partner B
> Given an id Partner_A_X,
> we find the connected component id for the node
Partner_A_X,
> we find all the nodes of the form Partner_B_* for the
component above
85. IMPACT OF MOVING FROM AN ADHOC
PROCESS TO A GRAPH PROCESS
> Partner integration: from 2 months to 1 week
86. IMPACT OF MOVING FROM AN ADHOC
PROCESS TO A GRAPH PROCESS
> Partner integration: from 2 months to 1 week
> Users mapped uplift: around 20%
87. IMPACT OF MOVING FROM AN ADHOC
PROCESS TO A GRAPH PROCESS
> Partner integration: from 2 months to 1 week
> Users mapped uplift: around 20%
> Mapping "quality": competitive (within 5%) with industry
leaders
88. Part 3: Speed up and improvements
Data cleanup
Cheap refresh
Machine tuning
Potential improvement
109. GO LARGE AND TUNE UP
> the process is memory hungry
110. GO LARGE AND TUNE UP
> the process is memory hungry
> the process is shuffle hungry
111. GO LARGE AND TUNE UP
> the process is memory hungry
> the process is shuffle hungry
BETTER TO HAVE FEW, LARGE,
MACHINES
112. GO LARGE AND TUNE UP
> the process is memory hungry
> the process is shuffle hungry
BETTER TO HAVE FEW, LARGE,
MACHINES
AND GIVE EXECUTORS MORE
MEMORY THAN YOU'D THINK
113. IMPACT OF ADAPTIVE
QUERY EXECUTION
(AQE)
AQE USES RUNTIME STATISTICS TO
HELP THE COST BASED OPTIMIZER
(CBO) AND SPEED UP SPARK
114. IMPACT OF ADAPTIVE
QUERY EXECUTION
(AQE)
AQE USES RUNTIME STATISTICS TO HELP
THE COST BASED OPTIMIZER (CBO) AND
SPEED UP SPARK
USING SPARK 3.X WITH AQE ACTIVE
HAS A 30-40% SPEED UP
119. Get the slides from my github:
github.com/rberenguel/
The repository is
identity-graphs
120.
121. References
Connected Components in MapReduce and Beyond (ACM)
Connected Components in MapReduce and Beyond (slides)
Partition Aware Connected Component Computation in Distributed Systems
Building Graphs at a Large Scale: Union Find Shuffle
Adaptive Query Execution: Speeding up SparkSQL at runtime
Pregel: A System for Large-Scale Graph Processing
GraphX
GraphFrames
Apache Giraph
Neo4J
AWS Neptune
Databricks' Delta Lake: high on ACID
122. Related talks
Massive-Scale Entity Resolution Using the Power of Apache Spark and Graph
Maps and Meaning: Graph-based Entity Resolution in Apache Spark & GraphX
Building Identity Graph at Scale for Programmatic Media Buying Using Apache Spark and
Delta Lake
Building Identity Graphs over Heterogeneous Data
Optimize the Large Scale Graph Applications by using Apache Spark with 4-5x
Performance Improvements
GraphFrames: Graph Queries In Spark SQL
Using GraphX/Pregel on Browsing History to Discover Purchase Intent
123. Reference Image attribution
Graphs Ruben Berenguel ! (Generative art with p5js)
Bulb Alessandro Bianchi (Unsplash)
Bubbles Marko Blažević (Unsplash)
Chair Volodymyr Tokar (Unsplash)
Cookie Dex Ezekiel (Unsplash)
Loupe Agence Olloweb (Unsplash)
Map Timo Wielink (Unsplash)
Mask Adnan Khan (Unsplash)
Newspaper Rishabh Sharma (Unsplash)
Party Adi Goldstein (Unsplash)
Socket Kelly Sikkema (Unsplash)
Spray JESHOOTS.COM (Unsplash)
Tuning gustavo Campos (Unsplash)
Web Shannon Potter (Unsplash)