This presentation deals with Big Data usage in advertising industry, namely in ad exchange business. It contains an overview of Big Data tools, a respective architectural approach and Java application for MapReduce.
This presentation was held by Oleksandr Fedirko (Lead Software Engineer, Consultant, GlobalLogic) and Danylo Stepanchuk (Lead Software Engineer, Consultant, GlobalLogic) at GlobalLogic Kyiv Java Career Day on August 11, 2018.
Learn more: https://www.globallogic.com/ua/events/globallogic-kyiv-java-career-day
3. 3
Confidential
Agenda
- Intro into Ad Exchange business area
- Big Data tools overview
- Architectural approach
- JVM-based processing in Big Data analytics
5. 5
Confidential
Ad Evolution
Reservation
Buying
ads sold via direct
transactions between
advertisers/agencies
and publishers
Ad Networks
ad networks
aggregate inventory
and sold it to
advertisers. Helped
publishers by selling
inventory they could
not sell themselves
Ad Exchanges &
SSPs
real-time
marketplaces with a
large pools of liquid
inventory not sold in
direct buys: SSPs
have more controls for
publishers to optimize
yield
DSPs
Bidding technology
designed to help
advertisers/agencies
target and optimize
their buys across
multiple ad
exchanges/publisher
inventory pools
Private Exchanges
& Automated
Guaranteed
Exclusive advertiser-to-
publisher inventory
relationship for
programmatic
purchasing in brand
safe environments
1990s Now
Direct Sold/
Guaranteed/
Reserved
Indirect/
Programmatic/
Unreserved
Programmatic
Premium
8. 8
Confidential
What is Big Data?
We’ve all heard the term “big data,”
but you may not know exactly what it
means. Most experts agree the term
describes information that shares
these three attributes:
9. 9
Confidential
Typical Big Data pipeline
Data Sources
- Structured
- Unstructured
Data Ingestion
- Batch layer
- Stream layer
Storage
BI / Data
Warehouse
Visualization
and Reporting
ToolsProcessing Layer
- Data Mining
- Machine
Learning
Governance and Privacy Security Quality Management High Scale; Low Cost
15. 15
Confidential
MapReduce
● Good old, slow and reliable
● Written in Java
● Natively supports Java, though all JVM
compatible languages are adaptable
● Easy to learn and tune
● Just batch processing
● Hard to implement complex pipelines
● Unit testing
Spark
● “Brand-new”, fast and flexible
● Written in Scala
● Natively supports Scala and Java (R and
Python)
● Provides fat pack of functionality
● Batch and micro-batch processing
● Support of complex pipelines is its thing
● Unit testing
MapReduce vs Spark: Which one to pick up?
19. 19
Confidential
Big Data analytics: What’s the challenge?
Daily
● 65B of raw ad and bid events
● over 100 TB of serialized and compressed raw input data
20. 20
Confidential
Big Data analytics: What’s the challenge?
Daily
● 65B of raw ad and bid events
● over 100 TB of serialized and compressed raw input data
● around 150K analytic queries over 110 dimensions in an analytic data store
21. 21
Confidential
Big Data analytics: What’s the challenge?
Daily
● 65B of raw ad and bid events
● over 100 TB of serialized and compressed raw input data
● around 150K analytic queries over 110 dimensions in an analytic data store
● 4s of 98% query time and 1s of Avg query time
26. 26
Confidential
Let’s solve some problem: Keywords
Seller
“I want to have an opportunity to get performance reports beyond the standard account, site, zone, size,
geography, etc”
27. 27
Confidential
Let’s solve some problem: Keywords
Seller
“I want to have an opportunity to get performance reports beyond the standard account, site, zone, size,
geography, etc”
Ad Exchange Company
“I want to satisfy high demand of this functionality, let’s name it Keywords, but I also want to reduce processing
and retention cost by servicing only sellers with limited number of different keywords”
28. 28
Confidential
Let’s solve some problem: Keywords
Seller
“I want to have an opportunity to get performance reports beyond the standard account, site, zone, size,
geography, etc”
Ad Exchange Company
“I want to satisfy high demand of this functionality, let’s name it Keywords, but also want to reduce processing
and retention cost by servicing only sellers with limited number of different keywords”
Engineering
“There are two steps to solve Keywords problem: first, we need to identify sellers which comply with a threshold;
second, we need to prepare reports only for them”
33. 33
Confidential
Is this all about writing clean code?
Nope!
Network
Bandwidth
Storage I/OCPU RAM
It may be a
bottleneck
34. 34
Confidential
Is this all about writing clean code?
Nope!
Network
Bandwidth
Storage I/OCPU RAM
Compression
algorithms
MapReduce
& Spark jobs
tuning
Storage
formats
Data access
patterns
It may be a
bottleneck
It may help to
overcome the
bottleneck