SlideShare ist ein Scribd-Unternehmen logo
1 von 86
Say "Hi!" to Your New Boss
How algorithms might soon control our lifes
(and why we should be careful with them)
Motivation
no alternatives, Google?
Outline
Theory
1. Algorithms
2. Machine Learning
3. Big Data & Consequences for Machine Learning
4. Use of Algorithms Today and in the Future
Experiments
1. Discriminating people with machine learning & algorithms
2. Creating persistent user identities by (accidental) de-
anonymization
Summary & Outlook
1. Strategies for Handling Data Responsibly
Algorithms , Machine Learning & Big
Data
Algorithms
An algorithm is a "recipe" that gives a computer (or a
human) step-by-step instructions in order to achieve a
certain goal.
Start
Door
bell
ringing
Andreas
stands on
trapdoor?
Open
trapdoor
Wait.
Our time
will
come.
yes
no
Machine Learning
A machine learning algorithm automatically generates
models and checks them against the training data we
provide, trying to find a model that explains the data well
and can predict unknown data.
Data vs. Model
𝒙 𝑦 = 𝑚 𝒙, 𝒑 + 𝜀
see e.g. "Machine Learning" by Tom Mitchell (McGraw Hill, 1997).
y
x1
Data vs. Model
𝒙 𝑦 = 𝑚 𝒙, 𝒑 + 𝜀
see e.g. "Machine Learning" by Tom Mitchell (McGraw Hill, 1997).
y
x1
Sources of Error
𝜀 = 𝜀 𝑠𝑦𝑠 + 𝜀 𝑛𝑜𝑖𝑠𝑒 + 𝜀ℎ𝑖𝑑𝑑𝑒𝑛
systematic errors arise due to
imperfect measurements of
known variables
noise is present due to
the nature of the process
or our measurement apparatus
many variables are
usually unknown to us
Big Data & Machine Learning
2000 2015
more data sources
high data volume
higher density
higher frequency
longer retention
Data Volume: More is (usually) better
Data Volume: More is (usually) better
Exploiting New Sources of Data
𝑦 = 𝑚 𝑥, 𝑝 + 𝜀ℎ𝑖𝑑𝑑𝑒𝑛 + ⋯
incorporate variables that were hidden
into the model, reducing error
Understanding Results
Models can be easy or very difficult to interpret
Parameter space is often huge and can't be
explored entirely
age > 37 ?
height < 1.78 projects > 19 ?
decision tree classifier (easy to interpret) neural network classifier (hard to interpret
yes no
Example: Deep Learning for Image
Recognition
http://googleresearch.blogspot.com/2015/06/inceptionism-going-deeper-into-neural.html
Classifying Use of Algorithms
low risk
mildly annoying in case of failure /
misbehaviour
medium risk
large impact on our life in
case of failure / misbehaviour
high risk
critical impact on our
life in case of failure /
misbehaviour
low risk
personalization of services
(e.g. recommendation engines for webs
video-on-demand, content, ...)
individualized ad targeting
customer rating / profiling
consumer demand prediction
medium risk
personalized health
person classification (e.g. crime,
terrorism)
autonomous cars/ planes/ machines
...
automated trading
military intelligence / intervention
political oppression
critical infrastructure services (e.g. elect
life-changing decisions (e.g. about healt
high risk
Big Data & Advances in Machine
Learning
Data
"Mishaps"
Two Experiments
Discriminating People
With Algorithms
Humans can be prejudiced.
Are algorithms better?
Discrimination
Discrimination is treatment or consideration of, or making
a distinction in favor of or against, a person or thing based
on the group, class, or category to which that person or
thing is perceived to belong to rather than on individual
merit.
Wikipedia
Protected attributes (examples):
Ethnicity, Gender, Sexual Orientation, ...
When is a process discriminating?
Disparate Impact: Adverse impact of a process C on a given
group X
Outcome X = 0 X = 1
C = NO P(C = NO, X = 0) P(C = NO, X = 1)
C = YES P(C = YES, X = 0) P(C =YES,X = 1)
𝑃 𝐶 = 𝑌𝐸𝑆 𝑋 = 0
𝑃 𝐶 = 𝑌𝐸𝑆 𝑋 = 1
< τ
see e.g. "Certifying and Removing Disparate Impact" M. Feldman et. al.
When is a process discriminating?
Estimating  with real-world data
Outcome X = 0 X = 1
C = NO a b
C = YES c d
𝑐/ 𝑎 + 𝑐
𝑑/ 𝑏 + 𝑑
< τ
Discrimination through Data Analysis
Replacing a manual hiring process with
an automated one.
Benefits:
Save time screening CVs by hand
Improve candidate choice
The Setup
human
CV
algorithm
C Training Data
The Setup
Use submitted information (CV, work
samples) along with publicly available /
external information to predict candidate
success.
Use data from the manual process (invite/ no
invite) to train the classifier
Provide it with as much data as possible to
Our decision model
𝑆 = 𝑚 𝑌 + 𝑑 𝑋 + 𝜀
score of candidate
(merit function) discrimination
malus/bonus
hidden variables &
luck (if you believe in it)
𝐶 =
𝑌𝐸𝑆, 𝑆 > 𝑡
𝑁𝑂, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
luckcandidate merit
without discrimination with discrimination
Training a predictor for C
𝐶 𝑌, 𝑍
information about Y
(unprotected attributes)
additional information
we give to the algorithm
𝒁 ∝ 𝑋 + 𝜀 𝛾
we can predict the value of X from Z with fidelity 
A Simulation
• Generate 10.000 samples of C with disparate impact

• Train a classifer (e.g. Support-Vector-Machine) on
the test data
• Provide it with (noisy) information about X
• Measure the algorithm-based  on the test data
Discrimination by Algorithm
Discrimination by Algorithm
 (how much information about X leaks into the data)
Discrimination by Algorithm
 (disparate impact on protected class)
Discrimination by Algorithm
8 % luck / noise
6-8 % discrimination
87 % merit
Discrimination by Algorithm
Discrimination by Algorithm
Why give that information to the
algorithm?
𝒁
We don't! But it leaks through anyway...
𝑋
But can it be done?
Discrimination through information
leakage is possible, but how likely is it in
practice?
Let's try!
We use publicly available data to predict
the gender of Github users (protected
attribute X).
Basic Information
Manually classify users as men/women (by looking at
profile pictures, names) -> 5.000 training samples with
small error
Use the Github API to retrieve information about users
(followers, repositories, stargazers, contributions, ...)
We only use data that is easy to get and likely to be used in
real-world setting for classification
We only use a limited dataset (proof of concept, not
Stargazers, Followers, Projects, ...
No predictive power for X
Github Event Data
https://www.githubarchive.org/
PushEvent
2015-03-17 21:21h
3 commits
Log : "..."
PullRequestEvent
2015-03-17 22:43
CommentEvent
2015-03-17 23:14h
"Hi, I think we should add more
cats to the landing page"
Hourly event patterns & event types
Commit Message Analysis
Use the commit messages (as obtained from the event
data) to predict gender by training a Support Vector
Machine (SVM) classifier on the word frequency data.
lol
emoji
wtf
serious
ly
rtfm
dude
fuck
git
Predictive Power of Model
15 % 35 % error50 % baseline fidelity
30 % information leakage
(with a very simple data set)
Takeaways
Algorithms will readily "learn"
discrimination from us if we provide
them with contaminated training
data.
Information leakage of protected
attributes can happen easily.
How we can fix this
Harder than you might think! We need to know X to
measure disparate impact and remove it
Incorporate penality for discrimination into target
function
Remove information about X from dataset by
performing a suitable transformation (reduces
fidelity of model)
see e.g. "Certifying and Removing Disparate Impact" M. Feldman et. al
Oh, it's you again! De-anonymizing
data
What is de-anonymization?
Use data recorded about individuals / entities
to identify those same individuals / entities in
another set of data (exactly or with high
likelihood).
Deanonymization becomes an increasing risk as datasets
about individual entities become larger and more detailed.
"Buckets of Truth"
N boolean attributes per entity - on average M < N of them
are set
𝑃𝑐𝑜𝑙. = 𝑃(𝑀1
1
= 𝑀1
2
, ⋯, 𝑀 𝑁
1
= 𝑀 𝑁
2
)
fun with deanonymization: http://en.akinato
Examples
𝑃𝑐𝑜𝑙. = 1 − 2𝑝(1 − 𝑝) 𝑁
uniform distribution long-tailed distribution
𝑃𝑐𝑜𝑙. = ?
Geolife Trajectories
http://research.microsoft.com/en-us/downloads/b16d359d-d164-469e-
Question:
w easy is it to re-identify single users through their data?
Could an algorithm build a representation of a given user?
Individual trajectories (color-coded)
http://research.microsoft.com/en-us/downloads/b16d359d-d164-469e-
How good are our buckets?
𝑒−𝑥 𝑎 𝛾 here's the interesting information
Identifying / comparing fingerprints
𝑠 𝑢𝑖, 𝑢𝑗 =
𝑓 𝑢𝑖 ∙ 𝑓 𝑢𝑗
𝑓 𝑢𝑖 ∙ 𝑓 𝑢𝑗
* =
Testing De-Anonymization
Use 75 % of the trajectories as prior data set
Predict the user ID belonging to the remaining
25 %
Measure average success probability and
identification rank (i.e. at which position is the
correct user)
Identification Rate
Finding Similar Users
Possible Improvements
Use Temporal / Sequence Information
Use speed of movement / mode of transportation
Improve choice of buckets for fingerprinting
Interesting Review Article: "Life in the network: the coming age of computational social science." D. Laze
Summary
The more data we have, the more difficult it is
to keep algorithms from directly learning and
using object identities instead of attributes.
Our data follows us around!
What can we do?
As Data Scientists / Analysts /
Programmers
Consume data responsibly: Don't include everything
under the sun just because it increases fidelity by a
slim margin
Check for disparate impact and remove it from the
input data
Test anonymization safety by using machine learning
As Citizens / Hackers / Users
Do not blindly trust decisions made by algorithms
Test them if possible (using different input values)
Reverse-engineer them (using e.g. active learning)
Fight back with data: Collect and analyze
algorithm-based decisions using collaborative
approaches
As a Society
Create better regulations for algorithms and their
use
Force companies / organizations to open up black
boxes
Making access to data easier, also for small
organizations
Algorithms are
like children:
Smart & eager to learn
So let's make sure
we raise them to
be responsible
adults.
Thanks!
Slides slideshare.net/japh44
Website andreas-dewes.de/en
Code (coming soon) github.com/adewes/32c3
E-Mail andreas@7scientists.com
Twitter @japh44
License Creative Commons Attribution 4.0
International
(except Google Deep Learning image)
Result
Intro
Whenever we measure user actions, we (automatically) gain
information about them that we can use to classify them.
Classifying and Controlling People
Case Study: Click Rate Optimization
Simple but common use case for big data: Collaborative
filtering
• Users have an opinion on a given topic A (between 0-1)
• They are more likely to like articles that confirm their
opinion
• Our algorithm knows nothing about A, just tries to
optimize click rate
• User opinion may change over time according to the
content he/she is exposed to (2 % change per exposure)
Mathematical Model
𝑃 𝐿𝑖𝑘𝑒 ∝ 𝐴 𝑎𝑟𝑡𝑖𝑐𝑙𝑒 − 𝐴 𝑢𝑠𝑒𝑟 + 𝜀 𝑚𝑜𝑜𝑑
Like Rate vs. Articles Viewed
Like Rate vs. Articles Viewed
only observe, don't
optimize
What have we learned?
60 observations / user
Clustering users into groups
Similarity measure: # Articles that both users like or dislike
Clustering: K-Means (minimize distance within clusters, maximize distance betw
Like Rate vs. Articles Viewed
with click-rate
optimization
Consequence of optimization: "Filter
Bubbles"
Switching On User Feedback
𝐴 𝑢𝑠𝑒𝑟
𝑡+1 = 𝐴 𝑢𝑠𝑒𝑟
𝑡 + γ ∙ sgn 𝐴 𝑢𝑠𝑒𝑟
𝑡 − 𝐴 𝑎𝑟𝑡𝑖𝑐𝑙𝑒
User opinions with and without
feedback
the algorithm has an interest to steer opinions towards the
no feedback 2 % feedback
Summary

Weitere ähnliche Inhalte

Was ist angesagt?

Module 2: Machine Learning Deep Dive
Module 2:  Machine Learning Deep DiveModule 2:  Machine Learning Deep Dive
Module 2: Machine Learning Deep DiveSara Hooker
 
Module 1 introduction to machine learning
Module 1  introduction to machine learningModule 1  introduction to machine learning
Module 1 introduction to machine learningSara Hooker
 
Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...
Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...
Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...Simplilearn
 
Random Forest Tutorial | Random Forest in R | Machine Learning | Data Science...
Random Forest Tutorial | Random Forest in R | Machine Learning | Data Science...Random Forest Tutorial | Random Forest in R | Machine Learning | Data Science...
Random Forest Tutorial | Random Forest in R | Machine Learning | Data Science...Edureka!
 
Module 9: Natural Language Processing Part 2
Module 9:  Natural Language Processing Part 2Module 9:  Natural Language Processing Part 2
Module 9: Natural Language Processing Part 2Sara Hooker
 
Introduction to Machine learning
Introduction to Machine learningIntroduction to Machine learning
Introduction to Machine learningKnoldus Inc.
 
Sentiment Analysis | Machine Learning Algorithms | Data Science Tutorial | Ed...
Sentiment Analysis | Machine Learning Algorithms | Data Science Tutorial | Ed...Sentiment Analysis | Machine Learning Algorithms | Data Science Tutorial | Ed...
Sentiment Analysis | Machine Learning Algorithms | Data Science Tutorial | Ed...Edureka!
 
Module 6: Ensemble Algorithms
Module 6:  Ensemble AlgorithmsModule 6:  Ensemble Algorithms
Module 6: Ensemble AlgorithmsSara Hooker
 
Module 8: Natural language processing Pt 1
Module 8:  Natural language processing Pt 1Module 8:  Natural language processing Pt 1
Module 8: Natural language processing Pt 1Sara Hooker
 
Module 5: Decision Trees
Module 5: Decision TreesModule 5: Decision Trees
Module 5: Decision TreesSara Hooker
 
Measures and mismeasures of algorithmic fairness
Measures and mismeasures of algorithmic fairnessMeasures and mismeasures of algorithmic fairness
Measures and mismeasures of algorithmic fairnessManojit Nandi
 
Module 3: Linear Regression
Module 3:  Linear RegressionModule 3:  Linear Regression
Module 3: Linear RegressionSara Hooker
 
Module 1.3 data exploratory
Module 1.3  data exploratoryModule 1.3  data exploratory
Module 1.3 data exploratorySara Hooker
 
Module 1.2 data preparation
Module 1.2  data preparationModule 1.2  data preparation
Module 1.2 data preparationSara Hooker
 
Spark + AI Summit - The Importance of Model Fairness and Interpretability in ...
Spark + AI Summit - The Importance of Model Fairness and Interpretability in ...Spark + AI Summit - The Importance of Model Fairness and Interpretability in ...
Spark + AI Summit - The Importance of Model Fairness and Interpretability in ...Francesca Lazzeri, PhD
 
Data Science Full Course | Edureka
Data Science Full Course | EdurekaData Science Full Course | Edureka
Data Science Full Course | EdurekaEdureka!
 
Applications of Machine Learning
Applications of Machine LearningApplications of Machine Learning
Applications of Machine LearningHayim Makabee
 

Was ist angesagt? (20)

Module 2: Machine Learning Deep Dive
Module 2:  Machine Learning Deep DiveModule 2:  Machine Learning Deep Dive
Module 2: Machine Learning Deep Dive
 
Machine Learning Algorithms (Part 1)
Machine Learning Algorithms (Part 1)Machine Learning Algorithms (Part 1)
Machine Learning Algorithms (Part 1)
 
Module 1 introduction to machine learning
Module 1  introduction to machine learningModule 1  introduction to machine learning
Module 1 introduction to machine learning
 
Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...
Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...
Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...
 
Random Forest Tutorial | Random Forest in R | Machine Learning | Data Science...
Random Forest Tutorial | Random Forest in R | Machine Learning | Data Science...Random Forest Tutorial | Random Forest in R | Machine Learning | Data Science...
Random Forest Tutorial | Random Forest in R | Machine Learning | Data Science...
 
Module 9: Natural Language Processing Part 2
Module 9:  Natural Language Processing Part 2Module 9:  Natural Language Processing Part 2
Module 9: Natural Language Processing Part 2
 
Introduction to Machine learning
Introduction to Machine learningIntroduction to Machine learning
Introduction to Machine learning
 
Sentiment Analysis | Machine Learning Algorithms | Data Science Tutorial | Ed...
Sentiment Analysis | Machine Learning Algorithms | Data Science Tutorial | Ed...Sentiment Analysis | Machine Learning Algorithms | Data Science Tutorial | Ed...
Sentiment Analysis | Machine Learning Algorithms | Data Science Tutorial | Ed...
 
Module 6: Ensemble Algorithms
Module 6:  Ensemble AlgorithmsModule 6:  Ensemble Algorithms
Module 6: Ensemble Algorithms
 
Module 8: Natural language processing Pt 1
Module 8:  Natural language processing Pt 1Module 8:  Natural language processing Pt 1
Module 8: Natural language processing Pt 1
 
Module 5: Decision Trees
Module 5: Decision TreesModule 5: Decision Trees
Module 5: Decision Trees
 
Measures and mismeasures of algorithmic fairness
Measures and mismeasures of algorithmic fairnessMeasures and mismeasures of algorithmic fairness
Measures and mismeasures of algorithmic fairness
 
Module 3: Linear Regression
Module 3:  Linear RegressionModule 3:  Linear Regression
Module 3: Linear Regression
 
Module 1.3 data exploratory
Module 1.3  data exploratoryModule 1.3  data exploratory
Module 1.3 data exploratory
 
Explainable AI
Explainable AIExplainable AI
Explainable AI
 
Module 1.2 data preparation
Module 1.2  data preparationModule 1.2  data preparation
Module 1.2 data preparation
 
Spark + AI Summit - The Importance of Model Fairness and Interpretability in ...
Spark + AI Summit - The Importance of Model Fairness and Interpretability in ...Spark + AI Summit - The Importance of Model Fairness and Interpretability in ...
Spark + AI Summit - The Importance of Model Fairness and Interpretability in ...
 
Machine Learning for Dummies
Machine Learning for DummiesMachine Learning for Dummies
Machine Learning for Dummies
 
Data Science Full Course | Edureka
Data Science Full Course | EdurekaData Science Full Course | Edureka
Data Science Full Course | Edureka
 
Applications of Machine Learning
Applications of Machine LearningApplications of Machine Learning
Applications of Machine Learning
 

Ähnlich wie Say "Hi!" to Your New Boss

Modex Talks - AI Conceptual Overview
Modex Talks - AI Conceptual OverviewModex Talks - AI Conceptual Overview
Modex Talks - AI Conceptual OverviewModex
 
Machine Learning - Challenges, Learnings & Opportunities
Machine Learning - Challenges, Learnings & OpportunitiesMachine Learning - Challenges, Learnings & Opportunities
Machine Learning - Challenges, Learnings & OpportunitiesCodePolitan
 
Data Analytics, Machine Learning, and HPC in Today’s Changing Application Env...
Data Analytics, Machine Learning, and HPC in Today’s Changing Application Env...Data Analytics, Machine Learning, and HPC in Today’s Changing Application Env...
Data Analytics, Machine Learning, and HPC in Today’s Changing Application Env...Intel® Software
 
Algorithmic Bias : What is it? Why should we care? What can we do about it?
Algorithmic Bias : What is it? Why should we care? What can we do about it?Algorithmic Bias : What is it? Why should we care? What can we do about it?
Algorithmic Bias : What is it? Why should we care? What can we do about it?University of Minnesota, Duluth
 
what-is-machine-learning-and-its-importance-in-todays-world.pdf
what-is-machine-learning-and-its-importance-in-todays-world.pdfwhat-is-machine-learning-and-its-importance-in-todays-world.pdf
what-is-machine-learning-and-its-importance-in-todays-world.pdfTemok IT Services
 
An Introduction to Machine Learning
An Introduction to Machine LearningAn Introduction to Machine Learning
An Introduction to Machine LearningVedaj Padman
 
Machine learning at b.e.s.t. summer university
Machine learning  at b.e.s.t. summer universityMachine learning  at b.e.s.t. summer university
Machine learning at b.e.s.t. summer universityLászló Kovács
 
Machine Learning in Cybersecurity.pdf
Machine Learning in Cybersecurity.pdfMachine Learning in Cybersecurity.pdf
Machine Learning in Cybersecurity.pdfWaiYipLiew
 
Machine learning for sensor Data Analytics
Machine learning for sensor Data AnalyticsMachine learning for sensor Data Analytics
Machine learning for sensor Data AnalyticsMATLABISRAEL
 
Intro to machine learning
Intro to machine learningIntro to machine learning
Intro to machine learningGovind Mudumbai
 
Artificial Intelligence
Artificial IntelligenceArtificial Intelligence
Artificial IntelligenceEnes Bolfidan
 
introduction to machine learning
introduction to machine learningintroduction to machine learning
introduction to machine learningJohnson Ubah
 
Machine Learning: Addressing the Disillusionment to Bring Actual Business Ben...
Machine Learning: Addressing the Disillusionment to Bring Actual Business Ben...Machine Learning: Addressing the Disillusionment to Bring Actual Business Ben...
Machine Learning: Addressing the Disillusionment to Bring Actual Business Ben...Jon Mead
 
Data Analysis - Making Big Data Work
Data Analysis - Making Big Data WorkData Analysis - Making Big Data Work
Data Analysis - Making Big Data WorkDavid Chiu
 
The Ultimate Guide to Machine Learning (ML)
The Ultimate Guide to Machine Learning (ML)The Ultimate Guide to Machine Learning (ML)
The Ultimate Guide to Machine Learning (ML)RR IT Zone
 
The math behind big systems analysis.
The math behind big systems analysis.The math behind big systems analysis.
The math behind big systems analysis.Theo Schlossnagle
 

Ähnlich wie Say "Hi!" to Your New Boss (20)

Intro 2 Machine Learning
Intro 2 Machine LearningIntro 2 Machine Learning
Intro 2 Machine Learning
 
Modex Talks - AI Conceptual Overview
Modex Talks - AI Conceptual OverviewModex Talks - AI Conceptual Overview
Modex Talks - AI Conceptual Overview
 
Machine Learning - Challenges, Learnings & Opportunities
Machine Learning - Challenges, Learnings & OpportunitiesMachine Learning - Challenges, Learnings & Opportunities
Machine Learning - Challenges, Learnings & Opportunities
 
Data Analytics, Machine Learning, and HPC in Today’s Changing Application Env...
Data Analytics, Machine Learning, and HPC in Today’s Changing Application Env...Data Analytics, Machine Learning, and HPC in Today’s Changing Application Env...
Data Analytics, Machine Learning, and HPC in Today’s Changing Application Env...
 
Algorithmic Bias : What is it? Why should we care? What can we do about it?
Algorithmic Bias : What is it? Why should we care? What can we do about it?Algorithmic Bias : What is it? Why should we care? What can we do about it?
Algorithmic Bias : What is it? Why should we care? What can we do about it?
 
what-is-machine-learning-and-its-importance-in-todays-world.pdf
what-is-machine-learning-and-its-importance-in-todays-world.pdfwhat-is-machine-learning-and-its-importance-in-todays-world.pdf
what-is-machine-learning-and-its-importance-in-todays-world.pdf
 
An Introduction to Machine Learning
An Introduction to Machine LearningAn Introduction to Machine Learning
An Introduction to Machine Learning
 
Machine Learning_Unit 2_Full.ppt.pdf
Machine Learning_Unit 2_Full.ppt.pdfMachine Learning_Unit 2_Full.ppt.pdf
Machine Learning_Unit 2_Full.ppt.pdf
 
Machine learning at b.e.s.t. summer university
Machine learning  at b.e.s.t. summer universityMachine learning  at b.e.s.t. summer university
Machine learning at b.e.s.t. summer university
 
Machine learning
Machine learningMachine learning
Machine learning
 
Machine Learning in Cybersecurity.pdf
Machine Learning in Cybersecurity.pdfMachine Learning in Cybersecurity.pdf
Machine Learning in Cybersecurity.pdf
 
machine learning.pptx
machine learning.pptxmachine learning.pptx
machine learning.pptx
 
Machine learning for sensor Data Analytics
Machine learning for sensor Data AnalyticsMachine learning for sensor Data Analytics
Machine learning for sensor Data Analytics
 
Intro to machine learning
Intro to machine learningIntro to machine learning
Intro to machine learning
 
Artificial Intelligence
Artificial IntelligenceArtificial Intelligence
Artificial Intelligence
 
introduction to machine learning
introduction to machine learningintroduction to machine learning
introduction to machine learning
 
Machine Learning: Addressing the Disillusionment to Bring Actual Business Ben...
Machine Learning: Addressing the Disillusionment to Bring Actual Business Ben...Machine Learning: Addressing the Disillusionment to Bring Actual Business Ben...
Machine Learning: Addressing the Disillusionment to Bring Actual Business Ben...
 
Data Analysis - Making Big Data Work
Data Analysis - Making Big Data WorkData Analysis - Making Big Data Work
Data Analysis - Making Big Data Work
 
The Ultimate Guide to Machine Learning (ML)
The Ultimate Guide to Machine Learning (ML)The Ultimate Guide to Machine Learning (ML)
The Ultimate Guide to Machine Learning (ML)
 
The math behind big systems analysis.
The math behind big systems analysis.The math behind big systems analysis.
The math behind big systems analysis.
 

Mehr von Andreas Dewes

Type Annotations in Python: Whats, Whys and Wows!
Type Annotations in Python: Whats, Whys and Wows!Type Annotations in Python: Whats, Whys and Wows!
Type Annotations in Python: Whats, Whys and Wows!Andreas Dewes
 
Analyzing data with docker v4
Analyzing data with docker   v4Analyzing data with docker   v4
Analyzing data with docker v4Andreas Dewes
 
Code is not text! How graph technologies can help us to understand our code b...
Code is not text! How graph technologies can help us to understand our code b...Code is not text! How graph technologies can help us to understand our code b...
Code is not text! How graph technologies can help us to understand our code b...Andreas Dewes
 
Learning from other's mistakes: Data-driven code analysis
Learning from other's mistakes: Data-driven code analysisLearning from other's mistakes: Data-driven code analysis
Learning from other's mistakes: Data-driven code analysisAndreas Dewes
 
Let's build a quantum computer!
Let's build a quantum computer!Let's build a quantum computer!
Let's build a quantum computer!Andreas Dewes
 
Demonstrating Quantum Speed-Up with a Two-Transmon Quantum Processor Ph.D. d...
Demonstrating Quantum Speed-Up  with a Two-Transmon Quantum Processor Ph.D. d...Demonstrating Quantum Speed-Up  with a Two-Transmon Quantum Processor Ph.D. d...
Demonstrating Quantum Speed-Up with a Two-Transmon Quantum Processor Ph.D. d...Andreas Dewes
 
Python for Scientists
Python for ScientistsPython for Scientists
Python for ScientistsAndreas Dewes
 

Mehr von Andreas Dewes (7)

Type Annotations in Python: Whats, Whys and Wows!
Type Annotations in Python: Whats, Whys and Wows!Type Annotations in Python: Whats, Whys and Wows!
Type Annotations in Python: Whats, Whys and Wows!
 
Analyzing data with docker v4
Analyzing data with docker   v4Analyzing data with docker   v4
Analyzing data with docker v4
 
Code is not text! How graph technologies can help us to understand our code b...
Code is not text! How graph technologies can help us to understand our code b...Code is not text! How graph technologies can help us to understand our code b...
Code is not text! How graph technologies can help us to understand our code b...
 
Learning from other's mistakes: Data-driven code analysis
Learning from other's mistakes: Data-driven code analysisLearning from other's mistakes: Data-driven code analysis
Learning from other's mistakes: Data-driven code analysis
 
Let's build a quantum computer!
Let's build a quantum computer!Let's build a quantum computer!
Let's build a quantum computer!
 
Demonstrating Quantum Speed-Up with a Two-Transmon Quantum Processor Ph.D. d...
Demonstrating Quantum Speed-Up  with a Two-Transmon Quantum Processor Ph.D. d...Demonstrating Quantum Speed-Up  with a Two-Transmon Quantum Processor Ph.D. d...
Demonstrating Quantum Speed-Up with a Two-Transmon Quantum Processor Ph.D. d...
 
Python for Scientists
Python for ScientistsPython for Scientists
Python for Scientists
 

Kürzlich hochgeladen

Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1ranjankumarbehera14
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...nirzagarg
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制vexqp
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...gajnagarg
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabiaahmedjiabur940
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...gajnagarg
 
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...kumargunjan9515
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxchadhar227
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Researchmichael115558
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...Elaine Werffeli
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Klinik kandungan
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...nirzagarg
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNKTimothy Spann
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...gajnagarg
 
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...HyderabadDolls
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...Health
 
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...gragchanchal546
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRajesh Mondal
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...nirzagarg
 
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...HyderabadDolls
 

Kürzlich hochgeladen (20)

Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
 
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
 
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
 
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for Research
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
 
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
 

Say "Hi!" to Your New Boss

  • 1. Say "Hi!" to Your New Boss How algorithms might soon control our lifes (and why we should be careful with them)
  • 3. Outline Theory 1. Algorithms 2. Machine Learning 3. Big Data & Consequences for Machine Learning 4. Use of Algorithms Today and in the Future Experiments 1. Discriminating people with machine learning & algorithms 2. Creating persistent user identities by (accidental) de- anonymization Summary & Outlook 1. Strategies for Handling Data Responsibly
  • 4. Algorithms , Machine Learning & Big Data
  • 5. Algorithms An algorithm is a "recipe" that gives a computer (or a human) step-by-step instructions in order to achieve a certain goal. Start Door bell ringing Andreas stands on trapdoor? Open trapdoor Wait. Our time will come. yes no
  • 6. Machine Learning A machine learning algorithm automatically generates models and checks them against the training data we provide, trying to find a model that explains the data well and can predict unknown data.
  • 7. Data vs. Model 𝒙 𝑦 = 𝑚 𝒙, 𝒑 + 𝜀 see e.g. "Machine Learning" by Tom Mitchell (McGraw Hill, 1997). y x1
  • 8. Data vs. Model 𝒙 𝑦 = 𝑚 𝒙, 𝒑 + 𝜀 see e.g. "Machine Learning" by Tom Mitchell (McGraw Hill, 1997). y x1
  • 9. Sources of Error 𝜀 = 𝜀 𝑠𝑦𝑠 + 𝜀 𝑛𝑜𝑖𝑠𝑒 + 𝜀ℎ𝑖𝑑𝑑𝑒𝑛 systematic errors arise due to imperfect measurements of known variables noise is present due to the nature of the process or our measurement apparatus many variables are usually unknown to us
  • 10. Big Data & Machine Learning 2000 2015 more data sources high data volume higher density higher frequency longer retention
  • 11. Data Volume: More is (usually) better
  • 12. Data Volume: More is (usually) better
  • 13. Exploiting New Sources of Data 𝑦 = 𝑚 𝑥, 𝑝 + 𝜀ℎ𝑖𝑑𝑑𝑒𝑛 + ⋯ incorporate variables that were hidden into the model, reducing error
  • 14. Understanding Results Models can be easy or very difficult to interpret Parameter space is often huge and can't be explored entirely age > 37 ? height < 1.78 projects > 19 ? decision tree classifier (easy to interpret) neural network classifier (hard to interpret yes no
  • 15. Example: Deep Learning for Image Recognition http://googleresearch.blogspot.com/2015/06/inceptionism-going-deeper-into-neural.html
  • 16. Classifying Use of Algorithms low risk mildly annoying in case of failure / misbehaviour medium risk large impact on our life in case of failure / misbehaviour high risk critical impact on our life in case of failure / misbehaviour
  • 17. low risk personalization of services (e.g. recommendation engines for webs video-on-demand, content, ...) individualized ad targeting customer rating / profiling consumer demand prediction
  • 18. medium risk personalized health person classification (e.g. crime, terrorism) autonomous cars/ planes/ machines ... automated trading
  • 19. military intelligence / intervention political oppression critical infrastructure services (e.g. elect life-changing decisions (e.g. about healt high risk
  • 20. Big Data & Advances in Machine Learning
  • 22. Discriminating People With Algorithms Humans can be prejudiced. Are algorithms better?
  • 23. Discrimination Discrimination is treatment or consideration of, or making a distinction in favor of or against, a person or thing based on the group, class, or category to which that person or thing is perceived to belong to rather than on individual merit. Wikipedia Protected attributes (examples): Ethnicity, Gender, Sexual Orientation, ...
  • 24. When is a process discriminating? Disparate Impact: Adverse impact of a process C on a given group X Outcome X = 0 X = 1 C = NO P(C = NO, X = 0) P(C = NO, X = 1) C = YES P(C = YES, X = 0) P(C =YES,X = 1) 𝑃 𝐶 = 𝑌𝐸𝑆 𝑋 = 0 𝑃 𝐶 = 𝑌𝐸𝑆 𝑋 = 1 < τ see e.g. "Certifying and Removing Disparate Impact" M. Feldman et. al.
  • 25. When is a process discriminating? Estimating  with real-world data Outcome X = 0 X = 1 C = NO a b C = YES c d 𝑐/ 𝑎 + 𝑐 𝑑/ 𝑏 + 𝑑 < τ
  • 26. Discrimination through Data Analysis Replacing a manual hiring process with an automated one. Benefits: Save time screening CVs by hand Improve candidate choice
  • 28. The Setup Use submitted information (CV, work samples) along with publicly available / external information to predict candidate success. Use data from the manual process (invite/ no invite) to train the classifier Provide it with as much data as possible to
  • 29. Our decision model 𝑆 = 𝑚 𝑌 + 𝑑 𝑋 + 𝜀 score of candidate (merit function) discrimination malus/bonus hidden variables & luck (if you believe in it) 𝐶 = 𝑌𝐸𝑆, 𝑆 > 𝑡 𝑁𝑂, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 luckcandidate merit without discrimination with discrimination
  • 30. Training a predictor for C 𝐶 𝑌, 𝑍 information about Y (unprotected attributes) additional information we give to the algorithm 𝒁 ∝ 𝑋 + 𝜀 𝛾 we can predict the value of X from Z with fidelity 
  • 31. A Simulation • Generate 10.000 samples of C with disparate impact  • Train a classifer (e.g. Support-Vector-Machine) on the test data • Provide it with (noisy) information about X • Measure the algorithm-based  on the test data
  • 33. Discrimination by Algorithm  (how much information about X leaks into the data)
  • 34. Discrimination by Algorithm  (disparate impact on protected class)
  • 35. Discrimination by Algorithm 8 % luck / noise 6-8 % discrimination 87 % merit
  • 38. Why give that information to the algorithm? 𝒁 We don't! But it leaks through anyway... 𝑋
  • 39. But can it be done? Discrimination through information leakage is possible, but how likely is it in practice? Let's try! We use publicly available data to predict the gender of Github users (protected attribute X).
  • 40. Basic Information Manually classify users as men/women (by looking at profile pictures, names) -> 5.000 training samples with small error Use the Github API to retrieve information about users (followers, repositories, stargazers, contributions, ...) We only use data that is easy to get and likely to be used in real-world setting for classification We only use a limited dataset (proof of concept, not
  • 41. Stargazers, Followers, Projects, ... No predictive power for X
  • 42. Github Event Data https://www.githubarchive.org/ PushEvent 2015-03-17 21:21h 3 commits Log : "..." PullRequestEvent 2015-03-17 22:43 CommentEvent 2015-03-17 23:14h "Hi, I think we should add more cats to the landing page"
  • 43. Hourly event patterns & event types
  • 44. Commit Message Analysis Use the commit messages (as obtained from the event data) to predict gender by training a Support Vector Machine (SVM) classifier on the word frequency data. lol emoji wtf serious ly rtfm dude fuck git
  • 45. Predictive Power of Model 15 % 35 % error50 % baseline fidelity 30 % information leakage (with a very simple data set)
  • 46. Takeaways Algorithms will readily "learn" discrimination from us if we provide them with contaminated training data. Information leakage of protected attributes can happen easily.
  • 47. How we can fix this Harder than you might think! We need to know X to measure disparate impact and remove it Incorporate penality for discrimination into target function Remove information about X from dataset by performing a suitable transformation (reduces fidelity of model) see e.g. "Certifying and Removing Disparate Impact" M. Feldman et. al
  • 48. Oh, it's you again! De-anonymizing data
  • 49. What is de-anonymization? Use data recorded about individuals / entities to identify those same individuals / entities in another set of data (exactly or with high likelihood). Deanonymization becomes an increasing risk as datasets about individual entities become larger and more detailed.
  • 50. "Buckets of Truth" N boolean attributes per entity - on average M < N of them are set 𝑃𝑐𝑜𝑙. = 𝑃(𝑀1 1 = 𝑀1 2 , ⋯, 𝑀 𝑁 1 = 𝑀 𝑁 2 ) fun with deanonymization: http://en.akinato
  • 51. Examples 𝑃𝑐𝑜𝑙. = 1 − 2𝑝(1 − 𝑝) 𝑁 uniform distribution long-tailed distribution 𝑃𝑐𝑜𝑙. = ?
  • 52. Geolife Trajectories http://research.microsoft.com/en-us/downloads/b16d359d-d164-469e- Question: w easy is it to re-identify single users through their data? Could an algorithm build a representation of a given user?
  • 53.
  • 55.
  • 56.
  • 57.
  • 58.
  • 59. How good are our buckets? 𝑒−𝑥 𝑎 𝛾 here's the interesting information
  • 60. Identifying / comparing fingerprints 𝑠 𝑢𝑖, 𝑢𝑗 = 𝑓 𝑢𝑖 ∙ 𝑓 𝑢𝑗 𝑓 𝑢𝑖 ∙ 𝑓 𝑢𝑗 * =
  • 61. Testing De-Anonymization Use 75 % of the trajectories as prior data set Predict the user ID belonging to the remaining 25 % Measure average success probability and identification rank (i.e. at which position is the correct user)
  • 64. Possible Improvements Use Temporal / Sequence Information Use speed of movement / mode of transportation Improve choice of buckets for fingerprinting Interesting Review Article: "Life in the network: the coming age of computational social science." D. Laze
  • 65. Summary The more data we have, the more difficult it is to keep algorithms from directly learning and using object identities instead of attributes. Our data follows us around!
  • 66. What can we do?
  • 67. As Data Scientists / Analysts / Programmers Consume data responsibly: Don't include everything under the sun just because it increases fidelity by a slim margin Check for disparate impact and remove it from the input data Test anonymization safety by using machine learning
  • 68. As Citizens / Hackers / Users Do not blindly trust decisions made by algorithms Test them if possible (using different input values) Reverse-engineer them (using e.g. active learning) Fight back with data: Collect and analyze algorithm-based decisions using collaborative approaches
  • 69. As a Society Create better regulations for algorithms and their use Force companies / organizations to open up black boxes Making access to data easier, also for small organizations
  • 70. Algorithms are like children: Smart & eager to learn So let's make sure we raise them to be responsible adults.
  • 71. Thanks! Slides slideshare.net/japh44 Website andreas-dewes.de/en Code (coming soon) github.com/adewes/32c3 E-Mail andreas@7scientists.com Twitter @japh44 License Creative Commons Attribution 4.0 International (except Google Deep Learning image)
  • 73. Intro Whenever we measure user actions, we (automatically) gain information about them that we can use to classify them.
  • 74.
  • 76. Case Study: Click Rate Optimization Simple but common use case for big data: Collaborative filtering • Users have an opinion on a given topic A (between 0-1) • They are more likely to like articles that confirm their opinion • Our algorithm knows nothing about A, just tries to optimize click rate • User opinion may change over time according to the content he/she is exposed to (2 % change per exposure)
  • 77. Mathematical Model 𝑃 𝐿𝑖𝑘𝑒 ∝ 𝐴 𝑎𝑟𝑡𝑖𝑐𝑙𝑒 − 𝐴 𝑢𝑠𝑒𝑟 + 𝜀 𝑚𝑜𝑜𝑑
  • 78. Like Rate vs. Articles Viewed
  • 79. Like Rate vs. Articles Viewed only observe, don't optimize
  • 80. What have we learned? 60 observations / user
  • 81. Clustering users into groups Similarity measure: # Articles that both users like or dislike Clustering: K-Means (minimize distance within clusters, maximize distance betw
  • 82. Like Rate vs. Articles Viewed with click-rate optimization
  • 83. Consequence of optimization: "Filter Bubbles"
  • 84. Switching On User Feedback 𝐴 𝑢𝑠𝑒𝑟 𝑡+1 = 𝐴 𝑢𝑠𝑒𝑟 𝑡 + γ ∙ sgn 𝐴 𝑢𝑠𝑒𝑟 𝑡 − 𝐴 𝑎𝑟𝑡𝑖𝑐𝑙𝑒
  • 85. User opinions with and without feedback the algorithm has an interest to steer opinions towards the no feedback 2 % feedback