Big Data in Advertising Industry — Oleksandr Fedirko, Danylo Stepanchuk

2
Confidential
Big Data in Advertisement
Industry

3
Confidential
Agenda
- Intro into Ad Exchange business area
- Big Data tools overview
- Architectural approach
- JVM-based processing in Big Data analytics

4
Confidential
Intro into Ad Exchange business
area

5
Confidential
Ad Evolution
Reservation
Buying
ads sold via direct
transactions between
advertisers/agencies
and publishers
Ad Networks
ad networks
aggregate inventory
and sold it to
advertisers. Helped
publishers by selling
inventory they could
not sell themselves
Ad Exchanges &
SSPs
real-time
marketplaces with a
large pools of liquid
inventory not sold in
direct buys: SSPs
have more controls for
publishers to optimize
yield
DSPs
Bidding technology
designed to help
advertisers/agencies
target and optimize
their buys across
multiple ad
exchanges/publisher
inventory pools
Private Exchanges
& Automated
Guaranteed
Exclusive advertiser-to-
publisher inventory
relationship for
programmatic
purchasing in brand
safe environments
1990s Now
Direct Sold/
Guaranteed/
Reserved
Indirect/
Programmatic/
Unreserved
Programmatic
Premium

6
Confidential
SellersBuyers
Ad Ecosystem. How it works?
Ad Network Ad Network
Agency DSP Ad Exchange SSP Publisher
DMP/Data Supply
Brand Audience
RTB

7
Confidential
Big Data tools overview

8
Confidential
What is Big Data?
We’ve all heard the term “big data,”
but you may not know exactly what it
means. Most experts agree the term
describes information that shares
these three attributes:

9
Confidential
Typical Big Data pipeline
Data Sources
- Structured
- Unstructured
Data Ingestion
- Batch layer
- Stream layer
Storage
BI / Data
Warehouse
Visualization
and Reporting
ToolsProcessing Layer
- Data Mining
- Machine
Learning
Governance and Privacy Security Quality Management High Scale; Low Cost

10
Confidential
Storages (non-relational)
Key-value Document Column-oriented
Graph Full-text (search engine) BLOB

11
Confidential
Data ingestion or ETL
Batch Near to realtime Realtime
Source ETL Destination

12
Confidential
Resource management
Distributed storage
Hadoop
HDFS
YARNMapReduce 1.0
MapReduce 2.0

14
Confidential
Spark
Apache Spark is
a unified
analytics engine
for large-scale
data processing

15
Confidential
MapReduce
● Good old, slow and reliable
● Written in Java
● Natively supports Java, though all JVM
compatible languages are adaptable
● Easy to learn and tune
● Just batch processing
● Hard to implement complex pipelines
● Unit testing
Spark
● “Brand-new”, fast and flexible
● Written in Scala
● Natively supports Scala and Java (R and
Python)
● Provides fat pack of functionality
● Batch and micro-batch processing
● Support of complex pipelines is its thing
● Unit testing
MapReduce vs Spark: Which one to pick up?

16
Confidential
Architectural approach

17
Confidential
High level overview
Bid PlatformAd Platform
Buyer Buyer Buyer
Analytical
Platform
Seller

18
Confidential
Big Data analytics: What’s the challenge?
Daily
● 65B of raw ad and bid events

19
Confidential
Daily
● over 100 TB of serialized and compressed raw input data

20
Confidential
Daily
● around 150K analytic queries over 110 dimensions in an analytic data store

21
Confidential
Daily
● around 150K analytic queries over 110 dimensions in an analytic data store
● 4s of 98% query time and 1s of Avg query time

22
Confidential
Big Data pipeline applied
Ad & Bid
Platforms Data Collector HDFS
Druid
Performance
Analytics
MapReduce
Spark

23
Confidential
Big Data pipeline applied
Ad & Bid
Platforms Data Collector HDFS
Druid
Performance
Analytics
MapReduce
Spark

25
Confidential
JVM-based processing in Big Data
analytics

26
Confidential
Let’s solve some problem: Keywords
Seller
“I want to have an opportunity to get performance reports beyond the standard account, site, zone, size,
geography, etc”

27
Confidential
Seller
geography, etc”
Ad Exchange Company
“I want to satisfy high demand of this functionality, let’s name it Keywords, but I also want to reduce processing
and retention cost by servicing only sellers with limited number of different keywords”

28
Confidential
Seller
geography, etc”
Ad Exchange Company
“I want to satisfy high demand of this functionality, let’s name it Keywords, but also want to reduce processing
and retention cost by servicing only sellers with limited number of different keywords”
Engineering
“There are two steps to solve Keywords problem: first, we need to identify sellers which comply with a threshold;
second, we need to prepare reports only for them”

29
Confidential
Spark: Let’s write some code
def getKeyword(AELog) => Option[ ( AccountId, Keyword ) ]
AdLog.getDataset(inputPath)(sparkSession)
.flatMap( getKeyword )
.distinct
.mapValues(_ => 1L)
.reduceByKey(_ + _)
.filter { case (_, totalKeywords) => totalKeywords <= maxKeywordsNumber }
.keys
.collect()
.toSet
Step #1: Identify valid sellers
(AdLogs: SeqFile[ID,AdLog], maxKeywords: Long) => Set[AccountId]

30
Confidential
case class KeywordsRecord( … ) // fields which represent dimensions and metrics
object KeywordsRecord { .. } // functions pack to operate with input/output data
AdLog.getDataset(inputPath)(sparkSession)
.filter( adLog => validSellers.contains(adLog.getAccountId) )
.map( KeywordsRecord.fromAdLog )
.toDS
.groupBy( KeywordsRecord.groupBy: _* ) // dimensions
.agg( KeywordsRecord.aggregations.head, KeywordsRecord.aggregations.tail: _* ) // metrics
.select( KeywordsRecord.allCols: _* )
.as[ KeywordsRecord ]
.map( KeywordsRecord.toJson )
.write
.text(outputPath)
Step #2: Prepare Keywords Report
(AdLogs: SeqFile[ID,AdLog], validSellers: Set[AccountId]) => TextFile[Json]

31
Confidential
object KeywordsApplication {
def getValidSellers(inputPath, maxKeywordsNumber)(implicit SparkSession)
def prepareReport(inputPath, outputPath, validSellers)(implicit SparkSession)
def main(args: Array[String]) = {
…
implicit val sparkSession = SparkSession.builder()
.appName(jobName)
.getOrCreate
val validSellers = getValidSellers(inputPath, maxKeywordsNumber)
prepareReport(inputPath, outputPath, validSellers)
…
}
}
Put it together: Step #1 + Step#2
(AdLogs: SeqFile[ID,AdLog], maxKeywords: Long) => TextFile[Json]

32
Confidential
Is this all about writing clean code?

33
Confidential
Nope!
Network
Bandwidth
Storage I/OCPU RAM
It may be a
bottleneck

34
Confidential
Nope!
Network
Bandwidth
Storage I/OCPU RAM
Compression
algorithms
MapReduce
& Spark jobs
tuning
Storage
formats
Data access
patterns
It may be a
bottleneck
It may help to
overcome the
bottleneck

35
Confidential
MapReduce + Spark: One must use them right

36
Confidential
36
Q&A session

Big Data in Advertising Industry — Oleksandr Fedirko, Danylo Stepanchuk

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Ähnlich wie Big Data in Advertising Industry — Oleksandr Fedirko, Danylo Stepanchuk

Ähnlich wie Big Data in Advertising Industry — Oleksandr Fedirko, Danylo Stepanchuk (20)

Mehr von GlobalLogic Ukraine

Mehr von GlobalLogic Ukraine (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Big Data in Advertising Industry — Oleksandr Fedirko, Danylo Stepanchuk