Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.
AUDIENCE COUNTING @ SCALE
Boris Trofimoff
Sigma Software & Collective
@b0ris_1
AGENDA
About our customer1
Motivation or Why & What we are counting2
Counting Fundamentals3
Counting with Spark4
Spark Roa...
 Place where advertisers and end-users meet each user
 Collective Media is a full stack cookie serving company
USERS
 ~...
MOTIVATION
HOW AUDIENCE IS CREATED
IMPRESSION LOG
AD SITE COOKIE IMPRESSIONS CLICKS SEGMENTS
bmw_X5 forbes.com 13e835610ff0d95 10 1 [a.m, b.rk, c.rh, d.sn, ....
SEGMENT EXAMPLES
SEGMENT DESCRIPTION
a.m Male
a.f Female
b.tk $75k-$100k annual income
b.rk $100k-$150k annual income
c.hs...
BUILDING AUDIENCE PROFILE
WHAT WE CAN DO WITH DATA
What is male/female ratio for people who have seen bmw_X5 ad
on forbes.com
Income distribution fo...
COUNTING
FUNAMENTALS
SQL?
SELECT count(distinct cookie_id)
FROM impressions
WHERE site = 'forbes.com' AND ad = 'bmw_X5' AND segment contains 'a...
CARDINALITY ESTIMATION ALGORITHMS
ACCURACY
MEMORY EFFICIENCY
ESTIMATE LARGE CARDINALITIES
PRACTIALITY
For a fixed amount o...
HYPERLOGLOG AND OTHERS
AUDIENCE CARDINALITY
APPROXIMATION WITH HLL
Create Audience of people addressed by
unique identifiers (cookies)
Create Aud...
HYPERLOGLOG OPERATIONS
trait HyperLogLog {
def add(cookieId: String): Unit
// |A|
def cardinality(): Long
// |A ∪ B|
def m...
COUNTING
WITH SPARK
IMPRESSION LOG TRANSFORMATION
AD SITE COOKIE IMPRESSIONS CLICKS SEGMENTS
bmw_X5 forbes.com 13e835610ff0d95 10 1 [a.m, b.rk...
FROM COOKIES TO HYPERLOGLOG
AD SITE COOKIES HLL IMPRESSIONS CLICKS
bmw_X5 forbes.com HyperLogLog@23sdg4 5468 35
bmw_X5 cnn...
DATA FRAMES
> val adImpressions: DataFrame = sqlContext.parquetFile("/aa/${yy-mm-dd}/${hh}/audience")
> adImpressions.prin...
LET’S COUNT SOMETHING
import org.apache.spark.sql.functions._
import org.apache.spark.sql.HLLFunctions._
val bmwCookies: H...
SPARK
ROAD NOTES
WRITING OWN SPARK
AGGREGATION FUNCTIONS
case class MergeHLLPartition(child: Expression)
extends AggregateExpression with t...
AGGREGATION FUNCTIONS
PROS & CONS
Simple DSL and Native DataFrame look-like functions
Works much faster than solving this ...
SPARK AS IN-MEMORY SQL DATABASE
BATCH-DRIVEN APP LONG-RUNNING APPCHANGE
Create
SparkContext
Run
Calculations
Destloy
Spark...
REFERENCES
 http://eugenezhulenev.com/blog/2015/07/15/interactive-audience-analytics-
with-spark-and-hyperloglog/
(Especi...
THANK YOU!
Nächste SlideShare
Wird geladen in …5
×

von

Audience counting at Scale Slide 1 Audience counting at Scale Slide 2 Audience counting at Scale Slide 3 Audience counting at Scale Slide 4 Audience counting at Scale Slide 5 Audience counting at Scale Slide 6 Audience counting at Scale Slide 7 Audience counting at Scale Slide 8 Audience counting at Scale Slide 9 Audience counting at Scale Slide 10 Audience counting at Scale Slide 11 Audience counting at Scale Slide 12 Audience counting at Scale Slide 13 Audience counting at Scale Slide 14 Audience counting at Scale Slide 15 Audience counting at Scale Slide 16 Audience counting at Scale Slide 17 Audience counting at Scale Slide 18 Audience counting at Scale Slide 19 Audience counting at Scale Slide 20 Audience counting at Scale Slide 21 Audience counting at Scale Slide 22 Audience counting at Scale Slide 23 Audience counting at Scale Slide 24 Audience counting at Scale Slide 25 Audience counting at Scale Slide 26

Audience counting at Scale

  1. 1. AUDIENCE COUNTING @ SCALE Boris Trofimoff Sigma Software & Collective @b0ris_1
  2. 2. AGENDA About our customer1 Motivation or Why & What we are counting2 Counting Fundamentals3 Counting with Spark4 Spark Road Notes5
  3. 3.  Place where advertisers and end-users meet each user  Collective Media is a full stack cookie serving company USERS  ~1B active user profiles MODELS  1000s of models built weekly PREDICTIONS 100s of billions predictions daily MODELING AT SCALEVOLUME Petabytes of data used VARIETY Profiles, formats, screens VELOCITY 100k+ requests per second 20 billions events per day VERACITY Robust measurements
  4. 4. MOTIVATION
  5. 5. HOW AUDIENCE IS CREATED
  6. 6. IMPRESSION LOG AD SITE COOKIE IMPRESSIONS CLICKS SEGMENTS bmw_X5 forbes.com 13e835610ff0d95 10 1 [a.m, b.rk, c.rh, d.sn, ...] mercedes_2015 forbes.com 13e8360c8e1233d 5 0 [a.f, b.rk, c.hs, d.mr, ...] nokia gizmodo.com 13e3c97d526839c 8 0 [a.m, b.tk, c.hs, d.sn, ...] apple_music reddit.com 1357a253f00c0ac 3 1 [a.m, b.rk, d.sn, e.gh, ...] nokia cnn.com 13b23555294aced 2 1 [a.f, b.tk, c.rh, d.sn, ...] apple_music facebook.com 13e8333d16d723d 9 1 [a.m, d.sn, g.gh, s.hr, ...]
  7. 7. SEGMENT EXAMPLES SEGMENT DESCRIPTION a.m Male a.f Female b.tk $75k-$100k annual income b.rk $100k-$150k annual income c.hs High School c.rh College d.sn Single d.mr Married
  8. 8. BUILDING AUDIENCE PROFILE
  9. 9. WHAT WE CAN DO WITH DATA What is male/female ratio for people who have seen bmw_X5 ad on forbes.com Income distribution for people who have seen Apple Music ad Nokia click distribution across different education levels
  10. 10. COUNTING FUNAMENTALS
  11. 11. SQL? SELECT count(distinct cookie_id) FROM impressions WHERE site = 'forbes.com' AND ad = 'bmw_X5' AND segment contains 'a.m' Infinite combinations Big Data => Big Latency for Hive, Impala and Druid
  12. 12. CARDINALITY ESTIMATION ALGORITHMS ACCURACY MEMORY EFFICIENCY ESTIMATE LARGE CARDINALITIES PRACTIALITY For a fixed amount of memory, the algorithm should provide as accurate an estimate as possible. Especially for small cardinalities, the results should be near exact The algorithm should use the available memory efficiently and adapt its memory usage to the cardinality Multisets with cardinalities well beyond 1 billion occur on a daily basis, and it is important that such large cardinalities can be estimated with reasonable accuracy The algorithm should be implementable and maintainable
  13. 13. HYPERLOGLOG AND OTHERS
  14. 14. AUDIENCE CARDINALITY APPROXIMATION WITH HLL Create Audience of people addressed by unique identifiers (cookies) Create Audience “Hash Sum” file with fixed size regardless of audience size Cardinalities ~ 109 with a typical accuracy of 2% using 1.5KB of memory. 1.5KB Create Audience Create Hash
  15. 15. HYPERLOGLOG OPERATIONS trait HyperLogLog { def add(cookieId: String): Unit // |A| def cardinality(): Long // |A ∪ B| def merge(other: HyperLogLog): HyperLogLog // |A ∩ B| = |A| + |B| - |A ∪ B|, def intersect(other: HyperLogLog): Long } ∪ ~ merge = 1.5KB 1.5KB 1.5KB ∩ ~ intrsct = 1.5KB 1.5KB CARDINALITY | |
  16. 16. COUNTING WITH SPARK
  17. 17. IMPRESSION LOG TRANSFORMATION AD SITE COOKIE IMPRESSIONS CLICKS SEGMENTS bmw_X5 forbes.com 13e835610ff0d95 10 1 [a.m, b.rk, c.rh, d.sn, ...] mercedes_2015 forbes.com 13e8360c8e1233d 5 0 [a.f, b.rk, c.hs, d.mr, ...] nokia gizmodo.com 13e3c97d526839c 8 0 [a.m, b.tk, c.hs, d.sn, ...] apple_music reddit.com 1357a253f00c0ac 3 1 [a.m, b.rk, d.sn, e.gh, ...] nokia cnn.com 13b23555294aced 2 1 [a.f, b.tk, c.rh, d.sn, ...] apple_music facebook.com 13e8333d16d723d 9 1 [a.m, d.sn, g.gh, s.hr, ...] Splitting original impression log table into two separate aggregated tables • by campaign • by segmentsSegments
  18. 18. FROM COOKIES TO HYPERLOGLOG AD SITE COOKIES HLL IMPRESSIONS CLICKS bmw_X5 forbes.com HyperLogLog@23sdg4 5468 35 bmw_X5 cnn.com HyperLogLog@84jdg4 8943 29 SEGMENT COOKIES HLL IMPRESSIONS CLICKS Male HyperLogLog@65xzx2 235468 335 $100k-$150k HyperLogLog@12das1 569473 194
  19. 19. DATA FRAMES > val adImpressions: DataFrame = sqlContext.parquetFile("/aa/${yy-mm-dd}/${hh}/audience") > adImpressions.printSchema() // root // | -- ad: string (nullable = true) // | -- site: string (nullable = true) // | -- hll: binary (nullable = true) // | -- impressions: long (nullable = true) // | -- clicks: long (nullable = true) > val segmentImpressions: DataFrame = sqlContext.parquetFile("/aa/${yy-mm-dd}/${hh}/segments") > segmentImpressions.printSchema() // root // | -- segment: string (nullable = true) // | -- hll: binary (nullable = true) // | -- impressions: long (nullable = true) // | -- clicks: long (nullable = true)
  20. 20. LET’S COUNT SOMETHING import org.apache.spark.sql.functions._ import org.apache.spark.sql.HLLFunctions._ val bmwCookies: HyperLogLog = adImpressions .filter(col("ad") === "bmw_X5") .select(mergeHll(col("hll")).first() // -- sum(clicks) val educatedCookies: HyperLogLog = segmentImpressions .filter(col("segment") in Seq("College", "High School")) .select(mergeHll(col("hll")).first() val p = (bmwCookies intersect educatedCookies) / bmwCookies.cardinality() Percent of college and high school education in BMW campaign?
  21. 21. SPARK ROAD NOTES
  22. 22. WRITING OWN SPARK AGGREGATION FUNCTIONS case class MergeHLLPartition(child: Expression) extends AggregateExpression with trees.UnaryNode[Expression] { ... } case class MergeHLLMerge(child: Expression) extends AggregateExpression with trees.UnaryNode[Expression] { ... } case class MergeHLL(child: Expression) extends PartialAggregate with trees.UnaryNode[Expression] { override def asPartial: SplitEvaluation = { val partial = Alias(MergeHLLPartition(child), "PartialMergeHLL")() SplitEvaluation( MergeHLLMerge(partial.toAttribute), partial :: Nil ) } } def mergeHLL(e: Column): Column = MergeHLL(e.expr) define function that will be applied to each row in RDD partition define function that will take results from different partitions and merge them together tell Spark how you want it to split your computation across RDD
  23. 23. AGGREGATION FUNCTIONS PROS & CONS Simple DSL and Native DataFrame look-like functions Works much faster than solving this problem with Scala transformations on top of RDD[Row] Dramatic Performance Speed-Up via mutable state control (10x times) UDF should be part of private Spark package, risk these interfaces might be changed/abandoned in the future.
  24. 24. SPARK AS IN-MEMORY SQL DATABASE BATCH-DRIVEN APP LONG-RUNNING APPCHANGE Create SparkContext Run Calculations Destloy SparkContext Show Result Load Data Cache it In memory Receive Request Create SparkContext Show Result Run Calculations ~ 500 GB (1 year history) ~40N occupied from ~200N cluster Response time 1-2 seconds Destloy SparkContext
  25. 25. REFERENCES  http://eugenezhulenev.com/blog/2015/07/15/interactive-audience-analytics- with-spark-and-hyperloglog/ (Especial thanks to Eugene Zhulenev for his brilliant blog post)  https://github.com/collectivemedia/spark-hyperloglog  http://research.google.com/pubs/pub40671.html  https://github.com/AdRoll/cantor  http://tech.adroll.com/blog/data/2013/07/10/hll-minhash.html
  26. 26. THANK YOU!
  • gianvitosiciliano

    Nov. 3, 2016
  • tmatyashovsky

    Jan. 21, 2016

How we count big audiences with HLL algorithm at Collective

Aufrufe

Aufrufe insgesamt

1.563

Auf Slideshare

0

Aus Einbettungen

0

Anzahl der Einbettungen

622

Befehle

Downloads

10

Geteilt

0

Kommentare

0

Likes

2

×