Characterising the Emergent Semantics in Twitter Lists

Characterising the Emergent
Semantics in Twitter Lists
Andrés García-Silva †, Jeon-Hyung Kang*, Kristina Lerman*,
Oscar Corcho †
† {hgarcia, ocorcho}@fi.upm.es
Facultad de Informática
Universidad Politécnica de Madrid, Spain

*{jeonhyuk,lerman}@isi.edu
Information Sciences Institute,
University of Southern California, USA

Introduction

Twitter Lists

Characterising the Emergent Semantics in Twitter Lists 2

Introduction

Curators
and
List Names


Introduction

Members
and
List Names


Introduction

Subscribers
and
List Names


Introduction

• Previous examples showed individual uses of lists
• Some list names where related among them

• What about if we group the lists?


Introduction
Lists where the Yahoo!Finance user is a member
grouped by frequency of membership

Lists where the NASDAQ user is a member
grouped by number of subscriptions


Introduction: Research questions
• Is it possible to identify related keywords from list names according to
the use given by the different user roles?
• Are two list names related if they have been used by a similar set of
curators?
• Are two list names related if a similar set of users have subscribe to the
corresponding lists?
• Are two list names related if their corresponding lists have a similar set of
members?
• What kind of user roles will generate more related keywords?
• What types of relations between keywords can we obtain?
• Synonyms, is-a, siblings..?

Stocks Investment

Curator 1
PersonalBanking Banks Curator 2
List members

Subscriber 1


Approach

Elicit related keywords Characterise the
from Twitter lists semantics of the relations

Schema Representation Model to identify similar
of keywords Pairs of
keywords
related
Based on curators keywords
Vector Space Model per
Twitter Schema
Based on subscribers
Lists Latent Dirichlet Rep. and
Based on members Allocation Model


Approach

Elicit related keywords Characterise the
from Twitter lists semantics of the relations

Synonyms
Similarity based on WordNet Is-a
Siblings
Path Length Indirect is-a
Pairs of
related Wu & Palmer (Hierarchical Inf.) Specificity of
keywords relations
per Jiang & Conrath (Distributional Inf.)
Schema
Rep. and Synonyms
Model (sameAs)
SPARQL queries over general KBs
Binary relations
published as Linked Data
(TypeOf, BT)
DBpedia, OpenCyc, and UMBEL
Object Prop.
(Occupation)


Experiment: Setup

• Data set
• Total
• 297,521 lists, 2,171,140 members, 215,599 curators, and
616,662 subscribers

• We extracted 5932 unique keywords from list names; 55% of them
were found in WordNet.
• We use approximate matching of the list names with
dictionary entries
• The dictionary was created from Wikipedia article titles


Experiment: Execution

Elicit related keywords from Twitter lists
Pairs of
Schema Representation related
Model to identify similar
of keywords keywords
keywords
per
Based on curators Schema
Vector Space Model Rep. and
Data Model
Based on subscribers Latent Dirichlet
set
Allocation
Based on members

Characterise the semantics of the relations Each
keyword
Similarity based on WordNet with the 5
WordNet Path Length Most
Similarity related
Wu & Palmer (Hierarchical Inf.)

Jiang & Conrath (Distributional Inf.)


Experiment: Data Analysis
Pearson's coefficient of correlations
Correlation Values (-1 to 1)

Average J&C distance and W&P similarity


Path Length in WordNet

Path Length Members Subscribers Curators
VSM LDA VSM LDA VSM LDA
1 (synonyms) 8.58% 10.87% 3.97% 3.24% 1.24% 0.50%
2 (is-a) 3.42% 3.08% 1.93% 0.47% 0.70% 0.00%
3 (Siblings, ind. Is-a) 2.37% 3.77% 2.96% 2.06% 2.38% 4.03%
>3 67.61% 65.5% 67.2% 67.5% 77.8% 75.8%
% of relations found by each schema representation and model

In average 97.65% of the relations with a path length greater than 3
involve a common subsumer


Depth (LCS) and path length as indicators of specificity

Depth of the least common subsumer
Relations in WordNet

Length of the path setting up the relation

Relations with dept(LCS) >=5


Experiment: Findings
Summary
• Similarity models based on members
• produce the results that are most correlated to the results of similarity measures
based on WordNet
• find more synonyms and direct relations is-a when compared to the other
models (path length).

• The majority of relations found by any model have a path length >= 3 and
involve a common subsumer.
• Depth of LCS
• VSM based on subscribers produces the highest number of specific
relations (depth of LCS >= 5 or 6).

• Similarity models based on curators produce a lower number of relations.


Experiment: Execution

Elicit related keywords from Twitter lists
Pairs of
Schema Representation related
Model to identify similar
of keywords keywords
keywords
per
Based on curators Schema
Vector Space Model Rep. and
Data Model
Based on subscribers Latent Dirichlet
set
Allocation
Based on members

Each
Characterise the semantics of the keyword
Ontological relations with the 5
Relations SPARQL queries over general KBs Most
between published as Linked Data related
keywords DBpedia, OpenCyc, and UMBEL


Experiment

• We anchor 63.77% of the keywords extracted from
Twitter Lists to DBPedia resources


Experiment
Vector-space model based on members (direct relations)
Relation type Example of keywords
Broader Term 26% life-science biotech
subClassOf 26% writers authors
developer 11% google google_apps
genre 11% funland comedy
largest city 6% houston texas
Others 20% - -

Vector-space model based on subscribers (relations of length 3)
Linked data pattern (54.73%): x -> object <-y
Relations object Keywords
type type 67.35% company nokia intel
subClassOf subClassOf 30.61% activities philanthropy fundraising
Linked data pattern (43.49%): x <-object->y
Relations object Keywords
genre genre 12.43% Aesthetica theater film
occupation genre 10.27% Adam Maxwell fiction writer
occupation occupation 8.11% Alina Tugend poet writer
product product 7.57% ChenOne clothes fashion
industry product 9.73% UserLand Softw. blogs internet
known for occupation 5.41% Adeline Yen Mah author writing
known for known for 3.78% Rebecca Watson skeptics atheist
main interest main interest 3.24% Aristotle politics government


Conclusions

• Different models to elicit related keywords from Twitter lists.
• Curators, Subscribers and members - VSM and LDA
• Characterise the semantics of relations: WordNet-based similarity
measures and SPARQL queries over linked data sets


Conclusions

• Vector-space and LDA models based on members produce the most
correlated results to those of WordNet-based metrics.
• Shortest JC distance and highest WP similarities
• According to the path length in WordNet
• Models based on members produce more synonyms and direct is-a
• Most of the relations have path length ≥ 3 and have a common subsumer
• Depth of LCS
• Vector-space model based on subscribers finds highest
number of relations (depth LCS ≥ 5 and 4 ≤ path length ≤ 0)
• We confirm these results according to linked data sets


Characterising the Emergent Semantics in Twitter Lists

Recommended

Recommended

More Related Content

What's hot

What's hot (17)

Similar to Characterising the Emergent Semantics in Twitter Lists

Similar to Characterising the Emergent Semantics in Twitter Lists (20)

More from Oscar Corcho

More from Oscar Corcho (20)

Recently uploaded

Recently uploaded (20)

Characterising the Emergent Semantics in Twitter Lists