Algorithms and machine learning models can unintentionally learn to classify and control people based on their data. A case study shows how optimizing for click-through rates can lead users to be clustered into "filter bubbles" and have their opinions steered over time without feedback. It is important to be aware of these risks and regulate algorithms' use of personal data to avoid unfairly profiling or manipulating individuals.
3. Outline
Theory
1. Algorithms
2. Machine Learning
3. Big Data & Consequences for Machine Learning
4. Use of Algorithms Today and in the Future
Experiments
1. Discriminating people with machine learning & algorithms
2. Creating persistent user identities by (accidental) de-
anonymization
Summary & Outlook
1. Strategies for Handling Data Responsibly
5. Algorithms
An algorithm is a "recipe" that gives a computer (or a
human) step-by-step instructions in order to achieve a
certain goal.
Start
Door
bell
ringing
Andreas
stands on
trapdoor?
Open
trapdoor
Wait.
Our time
will
come.
yes
no
6. Machine Learning
A machine learning algorithm automatically generates
models and checks them against the training data we
provide, trying to find a model that explains the data well
and can predict unknown data.
7. Data vs. Model
𝒙 𝑦 = 𝑚 𝒙, 𝒑 + 𝜀
see e.g. "Machine Learning" by Tom Mitchell (McGraw Hill, 1997).
y
x1
8. Data vs. Model
𝒙 𝑦 = 𝑚 𝒙, 𝒑 + 𝜀
see e.g. "Machine Learning" by Tom Mitchell (McGraw Hill, 1997).
y
x1
9. Sources of Error
𝜀 = 𝜀 𝑠𝑦𝑠 + 𝜀 𝑛𝑜𝑖𝑠𝑒 + 𝜀ℎ𝑖𝑑𝑑𝑒𝑛
systematic errors arise due to
imperfect measurements of
known variables
noise is present due to
the nature of the process
or our measurement apparatus
many variables are
usually unknown to us
10. Big Data & Machine Learning
2000 2015
more data sources
high data volume
higher density
higher frequency
longer retention
13. Exploiting New Sources of Data
𝑦 = 𝑚 𝑥, 𝑝 + 𝜀ℎ𝑖𝑑𝑑𝑒𝑛 + ⋯
incorporate variables that were hidden
into the model, reducing error
14. Understanding Results
Models can be easy or very difficult to interpret
Parameter space is often huge and can't be
explored entirely
age > 37 ?
height < 1.78 projects > 19 ?
decision tree classifier (easy to interpret) neural network classifier (hard to interpret
yes no
15. Example: Deep Learning for Image
Recognition
http://googleresearch.blogspot.com/2015/06/inceptionism-going-deeper-into-neural.html
16. Classifying Use of Algorithms
low risk
mildly annoying in case of failure /
misbehaviour
medium risk
large impact on our life in
case of failure / misbehaviour
high risk
critical impact on our
life in case of failure /
misbehaviour
17. low risk
personalization of services
(e.g. recommendation engines for webs
video-on-demand, content, ...)
individualized ad targeting
customer rating / profiling
consumer demand prediction
19. military intelligence / intervention
political oppression
critical infrastructure services (e.g. elect
life-changing decisions (e.g. about healt
high risk
23. Discrimination
Discrimination is treatment or consideration of, or making
a distinction in favor of or against, a person or thing based
on the group, class, or category to which that person or
thing is perceived to belong to rather than on individual
merit.
Wikipedia
Protected attributes (examples):
Ethnicity, Gender, Sexual Orientation, ...
24. When is a process discriminating?
Disparate Impact: Adverse impact of a process C on a given
group X
Outcome X = 0 X = 1
C = NO P(C = NO, X = 0) P(C = NO, X = 1)
C = YES P(C = YES, X = 0) P(C =YES,X = 1)
𝑃 𝐶 = 𝑌𝐸𝑆 𝑋 = 0
𝑃 𝐶 = 𝑌𝐸𝑆 𝑋 = 1
< τ
see e.g. "Certifying and Removing Disparate Impact" M. Feldman et. al.
25. When is a process discriminating?
Estimating with real-world data
Outcome X = 0 X = 1
C = NO a b
C = YES c d
𝑐/ 𝑎 + 𝑐
𝑑/ 𝑏 + 𝑑
< τ
26. Discrimination through Data Analysis
Replacing a manual hiring process with
an automated one.
Benefits:
Save time screening CVs by hand
Improve candidate choice
28. The Setup
Use submitted information (CV, work
samples) along with publicly available /
external information to predict candidate
success.
Use data from the manual process (invite/ no
invite) to train the classifier
Provide it with as much data as possible to
29. Our decision model
𝑆 = 𝑚 𝑌 + 𝑑 𝑋 + 𝜀
score of candidate
(merit function) discrimination
malus/bonus
hidden variables &
luck (if you believe in it)
𝐶 =
𝑌𝐸𝑆, 𝑆 > 𝑡
𝑁𝑂, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
luckcandidate merit
without discrimination with discrimination
30. Training a predictor for C
𝐶 𝑌, 𝑍
information about Y
(unprotected attributes)
additional information
we give to the algorithm
𝒁 ∝ 𝑋 + 𝜀 𝛾
we can predict the value of X from Z with fidelity
31. A Simulation
• Generate 10.000 samples of C with disparate impact
• Train a classifer (e.g. Support-Vector-Machine) on
the test data
• Provide it with (noisy) information about X
• Measure the algorithm-based on the test data
38. Why give that information to the
algorithm?
𝒁
We don't! But it leaks through anyway...
𝑋
39. But can it be done?
Discrimination through information
leakage is possible, but how likely is it in
practice?
Let's try!
We use publicly available data to predict
the gender of Github users (protected
attribute X).
40. Basic Information
Manually classify users as men/women (by looking at
profile pictures, names) -> 5.000 training samples with
small error
Use the Github API to retrieve information about users
(followers, repositories, stargazers, contributions, ...)
We only use data that is easy to get and likely to be used in
real-world setting for classification
We only use a limited dataset (proof of concept, not
44. Commit Message Analysis
Use the commit messages (as obtained from the event
data) to predict gender by training a Support Vector
Machine (SVM) classifier on the word frequency data.
lol
emoji
wtf
serious
ly
rtfm
dude
fuck
git
45. Predictive Power of Model
15 % 35 % error50 % baseline fidelity
30 % information leakage
(with a very simple data set)
46. Takeaways
Algorithms will readily "learn"
discrimination from us if we provide
them with contaminated training
data.
Information leakage of protected
attributes can happen easily.
47. How we can fix this
Harder than you might think! We need to know X to
measure disparate impact and remove it
Incorporate penality for discrimination into target
function
Remove information about X from dataset by
performing a suitable transformation (reduces
fidelity of model)
see e.g. "Certifying and Removing Disparate Impact" M. Feldman et. al
49. What is de-anonymization?
Use data recorded about individuals / entities
to identify those same individuals / entities in
another set of data (exactly or with high
likelihood).
Deanonymization becomes an increasing risk as datasets
about individual entities become larger and more detailed.
50. "Buckets of Truth"
N boolean attributes per entity - on average M < N of them
are set
𝑃𝑐𝑜𝑙. = 𝑃(𝑀1
1
= 𝑀1
2
, ⋯, 𝑀 𝑁
1
= 𝑀 𝑁
2
)
fun with deanonymization: http://en.akinato
51. Examples
𝑃𝑐𝑜𝑙. = 1 − 2𝑝(1 − 𝑝) 𝑁
uniform distribution long-tailed distribution
𝑃𝑐𝑜𝑙. = ?
61. Testing De-Anonymization
Use 75 % of the trajectories as prior data set
Predict the user ID belonging to the remaining
25 %
Measure average success probability and
identification rank (i.e. at which position is the
correct user)
64. Possible Improvements
Use Temporal / Sequence Information
Use speed of movement / mode of transportation
Improve choice of buckets for fingerprinting
Interesting Review Article: "Life in the network: the coming age of computational social science." D. Laze
65. Summary
The more data we have, the more difficult it is
to keep algorithms from directly learning and
using object identities instead of attributes.
Our data follows us around!
67. As Data Scientists / Analysts /
Programmers
Consume data responsibly: Don't include everything
under the sun just because it increases fidelity by a
slim margin
Check for disparate impact and remove it from the
input data
Test anonymization safety by using machine learning
68. As Citizens / Hackers / Users
Do not blindly trust decisions made by algorithms
Test them if possible (using different input values)
Reverse-engineer them (using e.g. active learning)
Fight back with data: Collect and analyze
algorithm-based decisions using collaborative
approaches
69. As a Society
Create better regulations for algorithms and their
use
Force companies / organizations to open up black
boxes
Making access to data easier, also for small
organizations
76. Case Study: Click Rate Optimization
Simple but common use case for big data: Collaborative
filtering
• Users have an opinion on a given topic A (between 0-1)
• They are more likely to like articles that confirm their
opinion
• Our algorithm knows nothing about A, just tries to
optimize click rate
• User opinion may change over time according to the
content he/she is exposed to (2 % change per exposure)
81. Clustering users into groups
Similarity measure: # Articles that both users like or dislike
Clustering: K-Means (minimize distance within clusters, maximize distance betw
82. Like Rate vs. Articles Viewed
with click-rate
optimization