Traditional market research is generally conducted by questionnaires or other forms of explicit feedback, directly asked to an ad hoc panel of individuals that in aggregate are representative of a larger group of people. Unfortunately, those traditional approaches are often invasive, nonscalable, and biased. Indirect approaches based on sparse and implicit consumer feedback (e.g., social network interactions, web browsing, or online purchases) are more scalable, authentic, and more suitable for real-time consumer insights.
Although those sources of implicit consumer feedback provide relevant and detailed pictures of the population, they individually provide only a limited set of observable behaviors.
The Holy Grail of market research is the ability to merge different sources of consumers interests into an augmented view that connects all the dots across multiple domains.
Unfortunately, user-centric "fusion" algorithms present many limitations in the case of heterogeneous datasets strongly differing in terms of size and density and when the number of sources to merge increases.
We propose a novel approach of Audience Projection able to define a target audience as a subset of the population in a source domain and to project this target to a set of users into a destination dataset.
We will show how libraries such as spaCy can provide Deep Learning implementations for Named Entity Recognition (NER) to match related brands and we will use Bayesian Inference to transfer knowledge from the source domain. This way, we can estimate the probability of the user to belong to the target using the source distribution of volume of interests of common entities as model evidence and the source target size as prior probability.
Bio:
Gianmario Spacagna is the chief scientist and head of AI at Helixa. His team’s mission is building the next generation of behavior algorithms and models of human decision making with careful attention to their potential and effects on society. His experience covers a diverse portfolio of machine learning algorithms and data products across different industries. Previously, he worked as a data scientist in IoT automotive (Pirelli Cyber Technology), retail and business banking (Barclays Analytics Centre of Excellence), threat intelligence (Cisco Talos), predictive marketing (AgilOne), plus some occasional freelancing. He’s a co-author of the book Python Deep Learning, contributor to the “Professional Manifesto for Data Science,” and founder of the Data Science Milan community. Gianmario holds a master’s degree in telematics (Polytechnic of Turin) and software engineering of distributed systems (KTH of Stockholm). After having spent half of his career abroad, he now lives in Milan. His favorite hobbies include home cooking, hiking, and exploring the surrounding nature on his motorcycle.
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Audience projection of target consumers over multiple domains a ner and bayesian approach, Gianmario Spacagna, Alberto Pirovano
1. Helixa
Audience Projection of Target Consumers over
Multiple Domains: a NER and Bayesian approach
Gianmario Spacagna
Chief Scientist @ Helixa
O’Reilly AI Conference
London, 16th October 2019
2. About Me
7+ years experience in Data Science and Machine Learning
Currently leading a team of ML Scientists and ML Engineers
Background in Telematics and Software Engineering of Distributed Systems
Ongoing MBA Student
Co-author of Python Deep Learning
Contributor of the Professional Data Science Manifesto
Blogger of Data Science Vademecum
Founder of the Data Science Milan community (1.4k members)
Stockholm, London, Milan
Gianmario Spacagna
Chief Scientist, Helixa
gspacagna@helixa.ai
3. DEMOGRAPHICS
HHI < 40K
Female
18 - 24
INFLUENCERS
ODESZA Cardi B
Shane DawsonJames Charles
Helixa is Market
Research platform
that uses AI to
integrate disparate
data sources into an
enriched view of the
consumers who
matter to your
business.
INTERESTS
Listen to Podcasts Kylie Cosmetics
Fan
Starbucks
Chipotle
PSYCHOGRAPHICS
Fast Food
Fans
Fashion
Enthusiasts
Entertainment
Junkies
4. In the next 40 minutes...
OUR GOAL:
Discuss some of the current challenges of traditional market
research and propose a novel solution based on Named Entity
Recognition (NER) and Bayesian Inference.
6. Applied Social Science
What is Market Research?
Gain Insights for Strategic Decisions
Information about
individuals and organizations Statistical Inference
7. Why Market Research matters?
Brands Perceptions
Consumers Preferences
and Behaviors
Buyer Personas
Market Segmentation
Identify OpportunitiesMarket Trends
8. Approaches to Market Research
Opinions and individual experiences
In-depth interviews
Smaller sample
Qualitative Quantitative
Numbers and Data
Statistics
Larger sample
24. Look-alike Fusions Don’t Scale Well
Differences in feature
space
Craftsmanship required
at each change of data
Universal objective
function to optimize
25. Is there a more
scalable way to
“fuse” datasets?
27. Audience Projection defined as “User Binary Classification”
Source:
Social Network Panel
Destination:
Consumptions Survey Panel
70M
Social accounts
200M
U.S. consumers
1.6M / 26M /
TRUE
FALSE
TRUE
FALSE
Target
Audience
=
PROJECTION
Ben & Jerry’s: bought in
last 6 months?
Affinity: 1.80x
Venmo: paid in last 30 days?
Affinity: 1.6x
Angry Orchard: drunk in
last 6 months?
Affinity: 1.50x
28. Solution = Named Entity Recognition (NER) + Bayesian Model
Social
Pages
Consumption
Questions
NER NER
BAYESIAN MODEL
ENTITY LINKING (NEL)
Destination:
Consumptions Survey Panel
Source:
Social Network Panel
Projected Users
Probabilities
Target
Audience
29. Entities Represent an Universal Feature Space
Social
Pages
Consumption
Questions
Listed
Products
NER NER NER
30. The Coca-Cola Company is a total beverage
company, offering over 500 brands in more
than 200 countries and territories.
Named Entity Recognition(NER) in each Domain
Social
Pages
Consumption
Questions
Listed
Products
Adidas Originals Men's Relaxed Strapback Cap
Coca-Cola KWC-4 6-Can Personal Mini 12V DC Car and 110V
AC Cooler, Red
37. Stacked Heterogeneous Feature Space
X X ? ?
X X ? ?
? ? X X X X
? ? X X X
? ? X X X
Source
Users
Destination
Users
source-only entities common entities destination-only entities
Latent
interests
Target
Audience
=
38. Common Entities translate Source to Destination
Source:
Social Network Panel
Destination:
Consumptions Survey Panel
Target
Audience
=
Common Entities
?Bayesian
Model
Source Target Size
1.6M / 70M = 2.3%
Share of
Interests
39. “Share of interests” encode the DNA of the Target Audience
Global
share of interests:
100%
Common Entities
Target audience
share of interests:
50%
17%
50%
Target Audience
slice
40. Bayesian Model
Posterior
Probability of user belonging to
projected target given the
Share of Interests on common entities
𝐏( / ) =∈
𝐏( / )∙𝐏( )∈ ∈
𝐏( )
Evidence
Prior
Source Target Size=2.3%Likelihood
49. Validate via Common Entities
X
X
X X X
X X
X
Source
Users
Destination
Users
common entities
Target
Audience
OR=
Projected
Audience
OR=
Exact Query Replica
Ground
Truth
50. Validate via Self Reconstruction Within the Same Domain
X X X
X X X
X X X X X X
X X X X
X X X
Source
Users
Destination
Users
source-only entities common entities destination-only entities
Target
Audience
=
Ground
Truth
51. Validate via Double-step Reconstruction
PROJECTION PROJECTION
Predicted
probabilities
Ground
Truth
57. Multiple Perspectives Reinforce Reliability
Social Panel
Target
Audience
=
Interacted with Game
Informer social page
Affinity: 2.17x
Have you read any Game
Informer issue?
Affinity: 1.73x
Game Informer Single Issue
Magazine purchased online
Affinity: 2.51x
68. The spaCy NER Model Overview
EMBED
ENCODE
ATTEND
PREDICT
69. Embedding Words
Features
token lower prefix suffix shape
Apple apple app ple Wwwww
U.K. uk uk uk W.W.
Fahrenheit 451 fahrenheit 451 fah 451 Wwwwwwwwww ddd
Each word (token) is represented by concatenating
the embeddings of all of the 4 features in order to
generalize the context for unknown words.
71. Encoding Sequences of Words
Residual Convolutional Neural Networks allows to
encode context-independent word vectors into a
context-sensitive sentence matrix.
Raw tri-gram chunk Enriched tri-gram matrix
Mark
Watney
visited
“Mark Watney visited Mars”
72. Crafting the Attention Vector
The attention vector of the trigram includes
information on the encountered entities.
“Mark Watney visited Mars”
Attention vector
Tri-gram matrix
Enriched
tri-gram vector
73. Predicting the Recognized Entities
Actions:
SHIFT
OUT
REDUCE (Entity Tagging)
Stack Buffer Segment
“Mark Watney visited Mars”
Actions:
1.SHIFT
2.SHIFT
3.REDUCE (PER)
4.OUT
5.SHIFT
6.REDUCE (LOC)
Mark
Watney
Mars
Mark
Watney
visited
Mars
Enriched
tri-gram vector
Update
attention
Attention vector
Tri-gam matrix
76. Projecting the Share of Interests on Common Entities
Target
Audience
Projection
50%
17%
50%
Share of Interests:
SIZE: 60M
SIZE: 200M
SIZE: ?
SIZE: 40M
Global Audience
(average american)
=
Target
Audience evidence
prior
77. Evidence Statistics on Share of Interests
N = 180M users in U.S. population
sampling rate = 1 : 10k
n = 18k users in sample panel
p = 17% of market penetration
x = 3k expected projected users
SIZE: 200M
SIZE: 40M
statistics:
evidence
78. 𝐏( / ) =
Binomial Positive Likelihood
n = 17999
x = 2999
log(p)=-5.56323
Probability of selecting 3000 / 18000 McDonald’s panel
users given that the user IS part of the target∈
n = 18000
x = 3000
log(p)=-5.54342
is smaller than
p=17%
79. 𝐏( / ) =
Binomial Negative Likelihood
n = 17999
x = 2999
log(p)=-5.53942
Probability of selecting 3000 / 18000 McDonald’s panel
users given that the user IS NOT part of the target∉
n = 18000
x = 3000
log(p)=-5.54342
p=17%
is greater than