Perceived versus Actual Predictability of Personal Information in Social Networks

Perceived versus Actual Predictability of
Personal Information in Social Networks
Eleftherios (Lefteris) Spyromitros-Xioufis1, Georgios Petkos1,
Symeon Papadopoulos1, Rob Heyman2, Yiannis Kompatsiaris1
1Center for Research and Technology Hellas – Information Technologies Institute (CERTH-ITI)
2iMinds-SMIT, Vrije Universiteit Brussel, Brussels, Belgium
INSCI 2016, Sep 12-14, 2016, Florence, Italy 1

Disclosure of Personal Information in OSNs
 Online Social Networks (OSNs) have had transforming impact!
• People use it for communication, as news source, to make business,…
 However, participation in OSNs comes at a price!
• User-related data is shared with:
• a) other OSN users, b) the OSN itself, c) third parties (e.g. ad networks)
• Disclosure of specific types of data:
• e.g. gender, age, ethnicity, political or religious beliefs, sexual
preferences, financial status, etc.
• Has implications:
• e.g. unjustified discrimination in personnel selection / loan approval
• Information need not be explicitly disclosed!
• Several types of personal information can be accurately inferred based
on implicit cues (e.g. Facebook likes) using machine learning!
2

Inferring Personal Information
 Supervised learning algorithms
• Learn a mapping (model) from inputs 𝒙𝑖 to outputs 𝑦 𝑖 by analyzing a
set of training examples 𝐷 = 𝒙𝑖, 𝑦 𝑖
𝑖
𝑁
• In this case
• 𝑦 𝑖
corresponds to a personal user attribute, e.g. sexual orientation
• 𝒙𝑖
corresponds to a set of predictive attributes or features, e.g. user likes
• Using this mapping, inferences can be made for new users!
 Some previous results
• Kosinski et al. [1]: likes features (SVD) + logistic regression
• Highly accurate inferences of ethnicity, gender, sexual orientation, etc.
• Schwartz et al. [2] status updates (PCA) + linear SVM
• Highly accurate inference of gender
3
[1] Kosinski, et al. Private traits and attributes are predictable from digital records of human
behavior. Proceedings of the National Academy of Sciences, 2013.
[2] Schwartz, et al. Personality, gender, and age in the language of social media: The open-
vocabulary approach. PloS one, 2013.

Inferred Information & Privacy in OSNs
 Study of user awareness with regard to inferred information
largely neglected by social research on OSN privacy
 Privacy usually presented as a question of giving access or
communicating personal information to a particular party
• E.g. Westin’s [1] definition of privacy:
“The claim of individuals, groups, or institutions to determine for themselves
when, how, and to what extent information about them is communicated to others.”
 However, access control is non-existent for inferred information:
a) Users are unaware of the inferences being made
b) Have not control over their logic
 Aim of our work:
• Investigate if and how users intuitively grasp what can be inferred
from their disclosed data!
4[1] Alan Westin. Privacy and freedom. Bodley Head, London, 1970.

Main Research Questions
 Our study attempts to answer the following questions:
1. Predictability
• How predictable different types of personal information are, based on
users’ OSN data?
2. Actual vs perceived predictability
• How realistic are user perceptions about predictability of their personal
information?
3. Predictability vs sensitivity
• What is the relationship between perceived sensitivity and predictability
of personal information?
 Previous work has focused mainly on Q1
 We address Q1 using a variety of data and methods and
additionally we address Q2 and Q3
5

What data is needed for this study?
 We collected 3 types of data about 170 Facebook users:
1. OSN data: likes, posts, images
• Collected through a test Facebook application (Databait1 developed
within the USEMP2 FP7 project)
2. Answers to questions about 96 personal attributes, organized3 into
9 categories (disclosure dimensions)
• E.g. health factors, sexual orientation, income, political attitude, etc.
3. Answers to questions related to their perceptions about
predictability and sensitivity of the 9 disclosure dimensions
 What is the purpose of each data type?
• 1 & 2 allow accessing actual predictability of personal information
• Training sets for supervised learning algorithms
• 3 facilitates a comparison between actual predictability and perceived
predictability/sensitivity of personal information
6
1 https://databait.hwcomms.com
2 http://www.usemp-project.eu/
3 http://usemp-mklab.iti.gr/usemp/prepilot_survey_data_statistics.pdf

Example from the questionnaire
7
 What is your sexual orientation?
• Ground truth!
 Do you think the information on your Facebook
profile reveals your sexual orientation? Either
because you yourself have put it online, or it could
be inferred from a combination of posts.
• Measures perceived predictability
 How sensitive do you find the information you had
to reveal about your sexual orientation in the
previous section? (1=not sensitive at all, 7= very
sensitive)
• Measures perceived sensitivity
Response No. of participants
heterosexual 147
homosexual 14
bisexual 7
n/a 2
Response No. of participants
yes 134
no 33
n/a 3

Predictive Attributes Extracted from OSN Data
 likes: binary vector denoting presence/absence of like (#3.6K)
 likesCats: histogram of like category frequencies (#191)
 likesTerms: Bag-of-Words (BoW) of terms in description, title
and about sections of likes (#62.5K)
 msgTerms: BoW vector of terms in user posts (#25K)
 lda-t: Distribution of topics in the textual contents of both
likes (description, title and about section) and posts
• Latent Dirichlet Allocation with t=20,30,50,100
 visual: concepts depicted in user images (#11.9K)
• Detected using CNN, top 12 concepts per images, 3 variants
• visual-bin: hard 0/1 encoding
• visual-freq: concept frequency histogram
• visual-conf: sum of detection scores across all images
8

Experimental Setup
 Evaluation method: repeated random sub-sampling
• Data split randomly 𝑛 = 10 times into train (67%) / test (33%)
• Model fit on train / accuracy of inferences assessed on test
• 96 questions (user attributes) were considered
 Evaluation measure: area under ROC curve (AUC)
• Appropriate for imbalanced classes
 Classification algorithms
• Baseline: 𝑘-nearest neighbors, decision tree, Naïve Bayes
• SoA: Adaboost, random forest, regularized logistic regression
9

Results 1: Evaluating Classifiers
10
0.45
0.50
0.55
0.60
0.65
0.70
0.75
bmiclass healthstatus smoking
behavior
drinking
behavior
income cannabis employment sexual
orientation
tree nb knn adaboost rf logistic

Results 2: Evaluating Features
11
0.50
0.51
0.52
0.53
0.54
0.55
0.56
0.57
0.58
LDA-20 LDA-30 LDA-50 LDA-100 likesCats msgTerms likesTerms likes visual-bin visual-conf visual-freq
rf logistic

12
0.53 0.54 0.55 0.56 0.57 0.58
visual-conf
likesCats
msgTerms
likesTerms
LDA-30
likes
visual-conf/likesCats
likesCats/likes
visual-conf/msgTerms
likesTerms/likesCats
msgTerms/likesTerms
msgTerms/likesCats
visual-conf/likes
visual-conf/likesTerms
LDA-30/msgTerms
msgTerms/likes
likesTerms/likes
LDA-30/likesTerms
LDA-30/likesCats
visual-conf/LDA-30
LDA-30/likes
nolatefusion
Results 3: Combining Features

Results 4: Best Performance per Attribute
13
0.50
0.55
0.60
0.65
0.70
0.75
0.80
0.85
0.90
0.95
1.00
degree
differentorigins
gender
language
nationality
residence
income
employment
livingsituation
relationshipstatus
religiousstance
religiouspractise
has-an-assertive-personality
tends-to-be-lazy
can-be-cold-and-aloof
remains-calm-in-tense-situatio
has-few-artistic-interests
is-sophisticated-in-art-music-
is-emotionally-stable-not-easi
generates-a-lot-of-enthusiasm
starts-quarrels-with-others
does-a-thorough-job
perseveres-until-the-task-is-f
has-an-active-imagination
is-full-of-energy
is-reserved
is-considerate-and-kind-to-alm
is-relaxed-handles-stress-well
gets-nervous-easily
likes-to-reflect-play-with-ide
is-sometimes-shy-inhibited
worries-a-lot
prefers-work-that-is-routine
tends-to-be-quiet
values-artistic-aesthetic-expe
likes-to-cooperate-with-others
is-generally-trusting
is-easily-distracted
makes-plans-and-follows-throug
is-sometimes-rude-to-others
is-depressed-blue
has-a-forgiving-nature
tends-to-find-fault-with-other
is-original-comes-up-with-new-
does-things-efficiently
tends-to-be-disorganised
can-be-tense
is-curious-about-many-differen
is-outgoing-sociable
is-inventive
can-be-somewhat-careless
is-talkative
is-helpful-and-unselfish-with-
is-ingenious-a-deep-thinker
can-be-moody
is-a-reliable-worker
sexualOrientation
politicalideology
bmiclass
healthstatus
cigarettes
smokingbehavior
alcohol
drinkingbehavior
nosubstance
coffee
energydrink
cannabis
Playing-hockey
Running
Eating-out
Going-to-the-movies
Cooking
Watching-series-or-movies-at-h
Reading
Listening-to-music
Bicycling
Swimming
Cars-motorcycles-boats
Playing-music
Shopping
Travelling
Playing-tennis
Walking
Dancing
Skiing
Watching-sports
Exercising
Going-to-the-theatre
Hiking
Animals
Going-to-the-beach
Camping
Gardening
Playing-basketball
Playing-soccer
Playing-volleyball
1 2 3 4 5 6 7 8 10
1 demographics
2 employment/income
3 relationship/living
4 religion
5 personality
6 sexual orientation
7 political ideology
8 health factors
10 consumer profile

Ranking of Dimensions
14
Rank Perceived
predictability
Actual predictability Actual predictability
according to [1]
1 Demographics Demographics - Demographics
2 Relationship status
and living condition
Political views +3 Political views
3 Sexual orientation Sexual orientation - Religious views
4 Consumer profile Employment/Income +4 Sexual orientation
5 Political views Consumer profile -1 Health status
6 Personality traits Relationship status
-4 Relationship status
7 Religious views Religious views -
8 Employment/Income Health status +1
9 Health status Personality traits -3
[1] Kosinski, et al. Private traits and attributes are predictable from digital records of human
behavior. Proceedings of the National Academy of Sciences, 2013.

Perceived/Actual Predictability vs Sensitivity
15

Conclusions & Future Work
 Conclusions
• Both correct and incorrect perceptions about predictability
• Predictability of sensitive information is underestimated
• Sophisticated privacy assistance tools are needed
• Support users in managing disclosure of personal information
 Databait: a privacy assistance tool (still in beta mode)
16

Thank you!
 Resources
• Code/models: https://github.com/MKLab-ITI/usemp-pscore
• Databait: https://databait.hwcomms.com
 Contact us
http://www.usemp-project.eu/
17
@espyromi espyromi@iti.gr
@sympap papadop@iti.gr
@kompats ikom@iti.gr

Perceived versus Actual Predictability of Personal Information in Social Networks

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (17)

Andere mochten auch

Andere mochten auch (18)

Ähnlich wie Perceived versus Actual Predictability of Personal Information in Social Networks

Ähnlich wie Perceived versus Actual Predictability of Personal Information in Social Networks (20)

Mehr von Symeon Papadopoulos

Mehr von Symeon Papadopoulos (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Perceived versus Actual Predictability of Personal Information in Social Networks