'Future Evolution of the Internet' delivered by Geoff Huston at Everything Op...
Perceived versus Actual Predictability of Personal Information in Social Networks
1. Perceived versus Actual Predictability of
Personal Information in Social Networks
Eleftherios (Lefteris) Spyromitros-Xioufis1, Georgios Petkos1,
Symeon Papadopoulos1, Rob Heyman2, Yiannis Kompatsiaris1
1Center for Research and Technology Hellas – Information Technologies Institute (CERTH-ITI)
2iMinds-SMIT, Vrije Universiteit Brussel, Brussels, Belgium
INSCI 2016, Sep 12-14, 2016, Florence, Italy 1
2. Disclosure of Personal Information in OSNs
Online Social Networks (OSNs) have had transforming impact!
• People use it for communication, as news source, to make business,…
However, participation in OSNs comes at a price!
• User-related data is shared with:
• a) other OSN users, b) the OSN itself, c) third parties (e.g. ad networks)
• Disclosure of specific types of data:
• e.g. gender, age, ethnicity, political or religious beliefs, sexual
preferences, financial status, etc.
• Has implications:
• e.g. unjustified discrimination in personnel selection / loan approval
• Information need not be explicitly disclosed!
• Several types of personal information can be accurately inferred based
on implicit cues (e.g. Facebook likes) using machine learning!
2
3. Inferring Personal Information
Supervised learning algorithms
• Learn a mapping (model) from inputs 𝒙𝑖 to outputs 𝑦 𝑖 by analyzing a
set of training examples 𝐷 = 𝒙𝑖, 𝑦 𝑖
𝑖
𝑁
• In this case
• 𝑦 𝑖
corresponds to a personal user attribute, e.g. sexual orientation
• 𝒙𝑖
corresponds to a set of predictive attributes or features, e.g. user likes
• Using this mapping, inferences can be made for new users!
Some previous results
• Kosinski et al. [1]: likes features (SVD) + logistic regression
• Highly accurate inferences of ethnicity, gender, sexual orientation, etc.
• Schwartz et al. [2] status updates (PCA) + linear SVM
• Highly accurate inference of gender
3
[1] Kosinski, et al. Private traits and attributes are predictable from digital records of human
behavior. Proceedings of the National Academy of Sciences, 2013.
[2] Schwartz, et al. Personality, gender, and age in the language of social media: The open-
vocabulary approach. PloS one, 2013.
4. Inferred Information & Privacy in OSNs
Study of user awareness with regard to inferred information
largely neglected by social research on OSN privacy
Privacy usually presented as a question of giving access or
communicating personal information to a particular party
• E.g. Westin’s [1] definition of privacy:
“The claim of individuals, groups, or institutions to determine for themselves
when, how, and to what extent information about them is communicated to others.”
However, access control is non-existent for inferred information:
a) Users are unaware of the inferences being made
b) Have not control over their logic
Aim of our work:
• Investigate if and how users intuitively grasp what can be inferred
from their disclosed data!
4[1] Alan Westin. Privacy and freedom. Bodley Head, London, 1970.
5. Main Research Questions
Our study attempts to answer the following questions:
1. Predictability
• How predictable different types of personal information are, based on
users’ OSN data?
2. Actual vs perceived predictability
• How realistic are user perceptions about predictability of their personal
information?
3. Predictability vs sensitivity
• What is the relationship between perceived sensitivity and predictability
of personal information?
Previous work has focused mainly on Q1
We address Q1 using a variety of data and methods and
additionally we address Q2 and Q3
5
6. What data is needed for this study?
We collected 3 types of data about 170 Facebook users:
1. OSN data: likes, posts, images
• Collected through a test Facebook application (Databait1 developed
within the USEMP2 FP7 project)
2. Answers to questions about 96 personal attributes, organized3 into
9 categories (disclosure dimensions)
• E.g. health factors, sexual orientation, income, political attitude, etc.
3. Answers to questions related to their perceptions about
predictability and sensitivity of the 9 disclosure dimensions
What is the purpose of each data type?
• 1 & 2 allow accessing actual predictability of personal information
• Training sets for supervised learning algorithms
• 3 facilitates a comparison between actual predictability and perceived
predictability/sensitivity of personal information
6
1 https://databait.hwcomms.com
2 http://www.usemp-project.eu/
3 http://usemp-mklab.iti.gr/usemp/prepilot_survey_data_statistics.pdf
7. Example from the questionnaire
7
What is your sexual orientation?
• Ground truth!
Do you think the information on your Facebook
profile reveals your sexual orientation? Either
because you yourself have put it online, or it could
be inferred from a combination of posts.
• Measures perceived predictability
How sensitive do you find the information you had
to reveal about your sexual orientation in the
previous section? (1=not sensitive at all, 7= very
sensitive)
• Measures perceived sensitivity
Response No. of participants
heterosexual 147
homosexual 14
bisexual 7
n/a 2
Response No. of participants
yes 134
no 33
n/a 3
8. Predictive Attributes Extracted from OSN Data
likes: binary vector denoting presence/absence of like (#3.6K)
likesCats: histogram of like category frequencies (#191)
likesTerms: Bag-of-Words (BoW) of terms in description, title
and about sections of likes (#62.5K)
msgTerms: BoW vector of terms in user posts (#25K)
lda-t: Distribution of topics in the textual contents of both
likes (description, title and about section) and posts
• Latent Dirichlet Allocation with t=20,30,50,100
visual: concepts depicted in user images (#11.9K)
• Detected using CNN, top 12 concepts per images, 3 variants
• visual-bin: hard 0/1 encoding
• visual-freq: concept frequency histogram
• visual-conf: sum of detection scores across all images
8
9. Experimental Setup
Evaluation method: repeated random sub-sampling
• Data split randomly 𝑛 = 10 times into train (67%) / test (33%)
• Model fit on train / accuracy of inferences assessed on test
• 96 questions (user attributes) were considered
Evaluation measure: area under ROC curve (AUC)
• Appropriate for imbalanced classes
Classification algorithms
• Baseline: 𝑘-nearest neighbors, decision tree, Naïve Bayes
• SoA: Adaboost, random forest, regularized logistic regression
9
10. Results 1: Evaluating Classifiers
10
0.45
0.50
0.55
0.60
0.65
0.70
0.75
bmiclass healthstatus smoking
behavior
drinking
behavior
income cannabis employment sexual
orientation
tree nb knn adaboost rf logistic
14. Ranking of Dimensions
14
Rank Perceived
predictability
Actual predictability Actual predictability
according to [1]
1 Demographics Demographics - Demographics
2 Relationship status
and living condition
Political views +3 Political views
3 Sexual orientation Sexual orientation - Religious views
4 Consumer profile Employment/Income +4 Sexual orientation
5 Political views Consumer profile -1 Health status
6 Personality traits Relationship status
and living condition
-4 Relationship status
and living condition
7 Religious views Religious views -
8 Employment/Income Health status +1
9 Health status Personality traits -3
[1] Kosinski, et al. Private traits and attributes are predictable from digital records of human
behavior. Proceedings of the National Academy of Sciences, 2013.
16. Conclusions & Future Work
Conclusions
• Both correct and incorrect perceptions about predictability
• Predictability of sensitive information is underestimated
• Sophisticated privacy assistance tools are needed
• Support users in managing disclosure of personal information
Databait: a privacy assistance tool (still in beta mode)
16