Keeping Identity Graphs In Sync With Apache Spark

WHOAMI
> Ruben Berenguel (@berenguel)
> PhD in Mathematics
> Lead Data Engineer at Hybrid Theory
> Preferred stack is Python, Go and Scala

Part 1 Set up
Part 2 The identity graph
Part 3 Speed up and improvements

Part 1: Set up
Adtech
What are cookies, really?
What is cookie mapping?
The identity problem

PROGRAMMATIC ADTECH
FIND USERS SATISFYING SOME
CRITERIA

PROGRAMMATIC ADTECH
CRITERIA
> Visited pages of category ABC

PROGRAMMATIC ADTECH
CRITERIA
> Are interested in concept XYZ

PROGRAMMATIC ADTECH
CRITERIA
> Are interested in concept XYZ
> Are likely to want to buy from our
client RST

TO FIND THEM WE NEED
THEIR BROWSE AND/OR BEHAVIOUR
DATA
! "

TO FIND THEM WE NEED
THEIR BROWSE AND/OR BEHAVIOUR DATA
! "
TO DELIVER FOR OUR
CLIENTS WE NEED
A WAY TO SHOW THEM ADS
!

COOKIES
ARE USED TO HELP
WEBSITES
TRACK EVENTS
AND STATE
AS USERS BROWSE

THERE ARE TWO KIND OF
COOKIES
FIRST PARTY (SESSION, STATE…)
THIRD PARTY (EVENT TRACKING…)

WE GET BROWSE DATA FROM USERS ON
THE WEB FROM DATA PROVIDERSA
A
Event logs with cookies provided in batch by data providers

WE GET BROWSE DATA FROM USERS IN THE
WEB FROM DATA PROVIDERS
WE GET BROWSE DATA FROM USERS
BROWSING OUR CLIENT WEBSITEB
B
Event logs with cookies generated from our servers, via our pixels

HOW DO WE CONNECT
BOTH DATA SOURCES?
THE IDENTIFIERS WE GET
FROM BOTH SIDES ARE
UNRELATED! !

MAPPING SERVERS
AND
THE MAPPING CHAIN

BASIC SOLUTION
> Coalesce (merge on nulls) chains based on one id

BASIC SOLUTION
> Is not as complete as the graph approach because…

BASIC SOLUTION
> Is not as complete as the graph approach because…
> Requires one stable identifier

Part 2: The identity graph
Rethink the problem as a graph
Connected components in big data

BASIC SPARK GRAPH
FRAMEWORK: GRAPHX
IT IS MESSAGE-PROPAGATIONC
,
GRAPH-PARALLEL, LOW LEVEL
C
Like the Pregel API

BASIC SPARK GRAPH
FRAMEWORK: GRAPHX
IT IS MESSAGE-PROPAGATION (PREGEL
API) GRAPH-PARALLEL, LOW LEVEL
GRAPHFRAMES ARE TO DATAFRAMES
AS GRAPHX IS TO RDDS

ALTERNATIVES CONSIDERED…
Apache Giraph harder maintenance
Neo4J harder scalability
AWS Neptune too new

INPUT SHOULD BE FORMATTED AS A DATAFRAME OF EDGES
src dst (…)
partner_1_! partner_2_⍺ 1617963647…
partner_1_2 partner_3_⭘ 1617963647…
partner_2_𝛄 partner_3_ 1617963654…
⁞ ⁞ ⁞

CONNECTED
COMPONENTS IN BIG
DATA
THE LARGE STAR - SMALL
STAR ALGORITHM

OUTPUT LAYOUT
Component Id Partner / Cookie Id Timestamp
10234 partner_1_! 1617963647
10234 partner_2_⍺ 1617963647
5534 partner_1_2 1617963654
⁞ ⁞ ⁞

To map from Partner A to Partner B

> Given an id Partner_A_X,

> we find the connected component id for the node
Partner_A_X,

> we find the connected component id for the node
Partner_A_X,
> we find all the nodes of the form Partner_B_* for the
component above

IMPACT OF MOVING FROM AN ADHOC
PROCESS TO A GRAPH PROCESS

> Partner integration: from 2 months to 1 week

> Users mapped uplift: around 20%

> Users mapped uplift: around 20%
> Mapping "quality": competitive (within 5%) with industry
leaders

Part 3: Speed up and improvements
Data cleanup
Cheap refresh
Machine tuning
Potential improvement

INVALID IDENTIFIERS
LIKE NA OR 0 OR XYZ
(OR FRAUDULENT CALLS TO A MAPPING SERVER)

NODE PRUNING
TO PREVENT HUGE COMPONENTS
IN THE COOKIE CASE, BY EXPIRING COOKIES NOT SEEN IN N DAYS

COMPONENT DESTRUCTION
TO LIMIT COMPONENT SIZE
ARTIFICIALLY
IF THE DATA IS FULLY CLEAN WE CAN ASSUME NO USER HAS MORE THAN M IDENTIFIERS

WHAT IS THE FASTEST WAY TO BUILD A
2 BILLION NODES GRAPH DAILY?
!

WHAT IS THE FASTEST WAY TO BUILD A
2 BILLION NODES GRAPH DAILY?
NOT DOING IT

MACHINE
TUNING
FOR LARGE
GRAPHS

GO LARGE AND TUNE UP
> the process is memory hungry

> the process is shuffle hungry

BETTER TO HAVE FEW, LARGE,
MACHINES

BETTER TO HAVE FEW, LARGE,
MACHINES
AND GIVE EXECUTORS MORE
MEMORY THAN YOU'D THINK

IMPACT OF ADAPTIVE
QUERY EXECUTION
(AQE)
AQE USES RUNTIME STATISTICS TO
HELP THE COST BASED OPTIMIZER
(CBO) AND SPEED UP SPARK

IMPACT OF ADAPTIVE
QUERY EXECUTION
(AQE)
AQE USES RUNTIME STATISTICS TO HELP
THE COST BASED OPTIMIZER (CBO) AND
SPEED UP SPARK
USING SPARK 3.X WITH AQE ACTIVE
HAS A 30-40% SPEED UP

FURTHER IMPROVEMENTS
> Easy: Move storage to Delta Lake

FURTHER IMPROVEMENTS
> Easy: Move storage to Delta Lake
> Hard: implement union-find-shuffle instead of large star -
small star

Get the slides from my github:
github.com/rberenguel/
The repository is
identity-graphs

References
Connected Components in MapReduce and Beyond (ACM)
Connected Components in MapReduce and Beyond (slides)
Partition Aware Connected Component Computation in Distributed Systems
Building Graphs at a Large Scale: Union Find Shuffle
Adaptive Query Execution: Speeding up SparkSQL at runtime
Pregel: A System for Large-Scale Graph Processing
GraphX
GraphFrames
Apache Giraph
Neo4J
AWS Neptune
Databricks' Delta Lake: high on ACID

Related talks
Massive-Scale Entity Resolution Using the Power of Apache Spark and Graph
Maps and Meaning: Graph-based Entity Resolution in Apache Spark & GraphX
Building Identity Graph at Scale for Programmatic Media Buying Using Apache Spark and
Delta Lake
Building Identity Graphs over Heterogeneous Data
Optimize the Large Scale Graph Applications by using Apache Spark with 4-5x
Performance Improvements
GraphFrames: Graph Queries In Spark SQL
Using GraphX/Pregel on Browsing History to Discover Purchase Intent

Reference Image attribution
Graphs Ruben Berenguel ! (Generative art with p5js)
Bulb Alessandro Bianchi (Unsplash)
Bubbles Marko Blažević (Unsplash)
Chair Volodymyr Tokar (Unsplash)
Cookie Dex Ezekiel (Unsplash)
Loupe Agence Olloweb (Unsplash)
Map Timo Wielink (Unsplash)
Mask Adnan Khan (Unsplash)
Newspaper Rishabh Sharma (Unsplash)
Party Adi Goldstein (Unsplash)
Socket Kelly Sikkema (Unsplash)
Spray JESHOOTS.COM (Unsplash)
Tuning gustavo Campos (Unsplash)
Web Shannon Potter (Unsplash)

Keeping Identity Graphs In Sync With Apache Spark

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Keeping Identity Graphs In Sync With Apache Spark

Ähnlich wie Keeping Identity Graphs In Sync With Apache Spark (20)

Mehr von Databricks

Mehr von Databricks (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Keeping Identity Graphs In Sync With Apache Spark