1. Introduction
Benchmarks and testing
Results
Extracting tag hierarchies
Gergely Tibély, Péter Pollner, Tamás Vicsek and
Gergely Palla
Statistical and Biological Physics Research Group,
HAS (Eötvös University), Hungary
KnowEscape2013 Conference
G. Tibély, P. Pollner, T. Vicsek and G. Palla
Extracting tag hierarchies
2. Introduction
Benchmarks and testing
Results
Tags and tagging
The goal
Tag hierarchy extraction methods
Tags and tagging: blogs, news portals
G. Tibély, P. Pollner, T. Vicsek and G. Palla
Extracting tag hierarchies
3. Introduction
Benchmarks and testing
Results
Tags and tagging
The goal
Tag hierarchy extraction methods
Tags and tagging: blogs, news portals
G. Tibély, P. Pollner, T. Vicsek and G. Palla
Extracting tag hierarchies
4. Introduction
Benchmarks and testing
Results
Tags and tagging
The goal
Tag hierarchy extraction methods
Tags and tagging: blogs, news portals
G. Tibély, P. Pollner, T. Vicsek and G. Palla
Extracting tag hierarchies
5. Introduction
Benchmarks and testing
Results
Tags and tagging
The goal
Tag hierarchy extraction methods
Tags and tagging: blogs, news portals
G. Tibély, P. Pollner, T. Vicsek and G. Palla
Extracting tag hierarchies
6. Introduction
Benchmarks and testing
Results
Tags and tagging
The goal
Tag hierarchy extraction methods
Tags and tagging: blogs, news portals
G. Tibély, P. Pollner, T. Vicsek and G. Palla
Extracting tag hierarchies
7. Introduction
Benchmarks and testing
Results
Tags and tagging
The goal
Tag hierarchy extraction methods
Tags and tagging: blogs, news portals
G. Tibély, P. Pollner, T. Vicsek and G. Palla
Extracting tag hierarchies
8. Introduction
Benchmarks and testing
Results
Tags and tagging
The goal
Tag hierarchy extraction methods
Tags and tagging: blogs, news portals
G. Tibély, P. Pollner, T. Vicsek and G. Palla
Extracting tag hierarchies
9. Introduction
Benchmarks and testing
Results
Tags and tagging
The goal
Tag hierarchy extraction methods
Tags and tagging: blogs, news portals
G. Tibély, P. Pollner, T. Vicsek and G. Palla
Extracting tag hierarchies
10. Introduction
Benchmarks and testing
Results
Tags and tagging
The goal
Tag hierarchy extraction methods
Tags and tagging: video and photo sharing
G. Tibély, P. Pollner, T. Vicsek and G. Palla
Extracting tag hierarchies
11. Introduction
Benchmarks and testing
Results
Tags and tagging
The goal
Tag hierarchy extraction methods
Tags and tagging: video and photo sharing
G. Tibély, P. Pollner, T. Vicsek and G. Palla
Extracting tag hierarchies
12. Introduction
Benchmarks and testing
Results
Tags and tagging
The goal
Tag hierarchy extraction methods
Tagging and folksonomies
In some cases the emerging set of free tags is called as a
FOLKSONOMY
collaborative nature,
tags are equal,
no hierarchy,
−→ The opposite of an ONTOLOGY.
G. Tibély, P. Pollner, T. Vicsek and G. Palla
Extracting tag hierarchies
13. Introduction
Benchmarks and testing
Results
Tags and tagging
The goal
Tag hierarchy extraction methods
Tagging and folksonomies
In some cases the emerging set of free tags is called as a
FOLKSONOMY
collaborative nature,
tags are equal,
no hierarchy,
−→ The opposite of an ONTOLOGY.
G. Tibély, P. Pollner, T. Vicsek and G. Palla
Extracting tag hierarchies
14. Introduction
Benchmarks and testing
Results
Tags and tagging
The goal
Tag hierarchy extraction methods
Tagging and folksonomies
In some cases the emerging set of free tags is called as a
FOLKSONOMY
collaborative nature,
tags are equal,
no hierarchy,
−→ The opposite of an ONTOLOGY.
G. Tibély, P. Pollner, T. Vicsek and G. Palla
Extracting tag hierarchies
15. Introduction
Benchmarks and testing
Results
Tags and tagging
The goal
Tag hierarchy extraction methods
Tagging and folksonomies
In some cases the emerging set of free tags is called as a
FOLKSONOMY
collaborative nature,
tags are equal,
no hierarchy,
−→ The opposite of an ONTOLOGY.
G. Tibély, P. Pollner, T. Vicsek and G. Palla
Extracting tag hierarchies
16. Introduction
Benchmarks and testing
Results
Tags and tagging
The goal
Tag hierarchy extraction methods
Tagging and folksonomies
In some cases the emerging set of free tags is called as a
FOLKSONOMY
collaborative nature,
tags are equal,
no hierarchy,
−→ The opposite of an ONTOLOGY.
G. Tibély, P. Pollner, T. Vicsek and G. Palla
Extracting tag hierarchies
17. Introduction
Benchmarks and testing
Results
Tags and tagging
The goal
Tag hierarchy extraction methods
Tagging systems
How can we search?
G. Tibély, P. Pollner, T. Vicsek and G. Palla
Extracting tag hierarchies
18. Introduction
Benchmarks and testing
Results
Tags and tagging
The goal
Tag hierarchy extraction methods
Tagging systems
How can we search?
G. Tibély, P. Pollner, T. Vicsek and G. Palla
Extracting tag hierarchies
19. Introduction
Benchmarks and testing
Results
Tags and tagging
The goal
Tag hierarchy extraction methods
Tagging systems
How can we search?
G. Tibély, P. Pollner, T. Vicsek and G. Palla
Extracting tag hierarchies
20. Introduction
Benchmarks and testing
Results
Tags and tagging
The goal
Tag hierarchy extraction methods
Tagging systems
How can we search?
G. Tibély, P. Pollner, T. Vicsek and G. Palla
Extracting tag hierarchies
21. Introduction
Benchmarks and testing
Results
Tags and tagging
The goal
Tag hierarchy extraction methods
Tagging systems
Searching items
G. Tibély, P. Pollner, T. Vicsek and G. Palla
Extracting tag hierarchies
22. Introduction
Benchmarks and testing
Results
Tags and tagging
The goal
Tag hierarchy extraction methods
Tagging systems
Searching items
G. Tibély, P. Pollner, T. Vicsek and G. Palla
Extracting tag hierarchies
23. Introduction
Benchmarks and testing
Results
Tags and tagging
The goal
Tag hierarchy extraction methods
Tagging systems
Searching
G. Tibély, P. Pollner, T. Vicsek and G. Palla
Extracting tag hierarchies
24. Introduction
Benchmarks and testing
Results
Tags and tagging
The goal
Tag hierarchy extraction methods
Tagging systems
Searching tags
G. Tibély, P. Pollner, T. Vicsek and G. Palla
Extracting tag hierarchies
25. Introduction
Benchmarks and testing
Results
Tags and tagging
The goal
Tag hierarchy extraction methods
Tagging systems
Searching tags
G. Tibély, P. Pollner, T. Vicsek and G. Palla
Extracting tag hierarchies
27. Introduction
Benchmarks and testing
Results
Tags and tagging
The goal
Tag hierarchy extraction methods
The goal
Extracting a tag hierarchy!
G. Tibély, P. Pollner, T. Vicsek and G. Palla
Extracting tag hierarchies
28. Introduction
Benchmarks and testing
Results
Tags and tagging
The goal
Tag hierarchy extraction methods
The goal
Extracting a tag hierarchy!
Motivation:
G. Tibély, P. Pollner, T. Vicsek and G. Palla
Extracting tag hierarchies
29. Introduction
Benchmarks and testing
Results
Tags and tagging
The goal
Tag hierarchy extraction methods
The goal
Extracting a tag hierarchy!
Motivation:
Help searching: If the tags are organised into a
hierarchy, broadening or narrowing the scope is
simple.
G. Tibély, P. Pollner, T. Vicsek and G. Palla
Extracting tag hierarchies
30. Introduction
Benchmarks and testing
Results
Tags and tagging
The goal
Tag hierarchy extraction methods
The goal
Extracting a tag hierarchy!
Motivation:
Help searching: If the tags are organised into a
hierarchy, broadening or narrowing the scope is
simple.
Give recommendations about yet unvisited objects.
G. Tibély, P. Pollner, T. Vicsek and G. Palla
Extracting tag hierarchies
31. Introduction
Benchmarks and testing
Results
Tags and tagging
The goal
Tag hierarchy extraction methods
Tag hierarchy extracting algorithms
- P. Heymann and H. Garcia-Molina, "Collaborative Creation of
Communal Hierarchical Taxonomies in Social Tagging Systems",
Technical Report, Stanford InfoLab, (2006).
- P. Schmitz , "Inducing Ontology from Flickr Tags", Proceedings of the
15th International Conference on World Wide Web (WWW), (2006).
- C. Van Damme, M. Hepp and K. Siorpaes, "FolksOntology: An
Integrated Approach for Turning Folksonomies into Ontologies", Social
Networks 2, 57–70, (2007)
- A.Plangprasopchok and K. Lerman, "Constructing Folksonomies from
User-specified Relations on Flickr", Proceedings of the World Wide
Web conference, (2009)
G. Tibély, P. Pollner, T. Vicsek and G. Palla
Extracting tag hierarchies
32. Introduction
Benchmarks and testing
Results
Tags and tagging
The goal
Tag hierarchy extraction methods
Our methods
Basic outline
DEFINE LINKS
TAG 1
TAG 2
TAG 3
TAG 5
TAG 4
G. Tibély, P. Pollner, T. Vicsek and G. Palla
Extracting tag hierarchies
33. Introduction
Benchmarks and testing
Results
Tags and tagging
The goal
Tag hierarchy extraction methods
Our methods
Basic outline
DEFINE LINKS
TAG 1
TAG 2
TAG 3
TAG 5
TAG 4
G. Tibély, P. Pollner, T. Vicsek and G. Palla
Extracting tag hierarchies
34. Introduction
Benchmarks and testing
Results
Tags and tagging
The goal
Tag hierarchy extraction methods
Our methods
Basic outline
DEFINE LINKS
THRESHOLD
TAG 1
TAG 1
TAG 2
TAG 3
TAG 5
TAG 4
TAG 2
TAG 3
G. Tibély, P. Pollner, T. Vicsek and G. Palla
TAG 5
TAG 4
Extracting tag hierarchies
35. Introduction
Benchmarks and testing
Results
Tags and tagging
The goal
Tag hierarchy extraction methods
Our methods
Basic outline
DEFINE LINKS
THRESHOLD
DETERMINE
DIRECTION
TAG 1
TAG 1
TAG 1
TAG 2
TAG 3
TAG 5
TAG 4
TAG 2
TAG 3
G. Tibély, P. Pollner, T. Vicsek and G. Palla
TAG 5
TAG 4
TAG 2
TAG 3
Extracting tag hierarchies
TAG 5
TAG 4
36. Introduction
Benchmarks and testing
Results
Tags and tagging
The goal
Tag hierarchy extraction methods
Our methods
Basic outline
DEFINE LINKS
THRESHOLD
TAG 1
TAG 1
TAG 2
TAG 5
TAG 2
DETERMINE
DIRECTION
TAG 1
TAG 5
TAG 2
TAG 3
TAG 4
TAG 3
G. Tibély, P. Pollner, T. Vicsek and G. Palla
TAG 4
Extracting tag hierarchies
TAG 3
TAG 4
TAG 5
37. Introduction
Benchmarks and testing
Results
Benchmarks
Quality measures
How to test the method?
Benchmarks?
Gene Ontology
- tag hierarchy: hierarchy of protein functions,
- tagged objects: proteins annotated by their known
functions
Synthetic benchmark:
- tag hierarchy: user defined,
- tagged objects: simulated tagging (random walk on the
hierarchy).
- G. Tibély, P. Pollner, T. Vicsek and G. Palla, “Ontologies and
-,tag-statistics”, New Journal of Physics 14, 053009 (2012).
G. Tibély, P. Pollner, T. Vicsek and G. Palla
Extracting tag hierarchies
38. Introduction
Benchmarks and testing
Results
Benchmarks
Quality measures
How to test the method?
Benchmarks:
Gene Ontology
- tag hierarchy: hierarchy of protein functions,
- tagged objects: proteins annotated by their known
functions.
Synthetic benchmark:
- tag hierarchy: user defined,
- tagged objects: simulated tagging (random walk on the
hierarchy).
- G. Tibély, P. Pollner, T. Vicsek and G. Palla, “Ontologies and
-,tag-statistics”, New Journal of Physics 14, 053009 (2012).
G. Tibély, P. Pollner, T. Vicsek and G. Palla
Extracting tag hierarchies
39. Introduction
Benchmarks and testing
Results
Benchmarks
Quality measures
How to test the method?
Benchmarks:
Gene Ontology
- tag hierarchy: hierarchy of protein functions,
- tagged objects: proteins annotated by their known
functions.
Synthetic benchmark:
- tag hierarchy: user defined,
- tagged objects: simulated tagging (random walk on the
hierarchy).
- G. Tibély, P. Pollner, T. Vicsek and G. Palla, “Ontologies and
-,tag-statistics”, New Journal of Physics 14, 053009 (2012).
G. Tibély, P. Pollner, T. Vicsek and G. Palla
Extracting tag hierarchies
40. Introduction
Benchmarks and testing
Results
Benchmarks
Quality measures
How to measure the quality?
EXACT
RECONSTRUCTED
0A
1A
2A
3A
3B
1B
2B
2C
0A
1A
1C
2D
2A
2E
3C 3D 3E 3F 3G
3H
3A
3B
1B
2B
2C
1C
2D
3C 3D 3E 3F 3G
Evaluation?
G. Tibély, P. Pollner, T. Vicsek and G. Palla
2E
Extracting tag hierarchies
3H
41. Introduction
Benchmarks and testing
Results
Benchmarks
Quality measures
How to measure the quality?
EXACT
RECONSTRUCTED
0A
1A
2A
3A
3B
1B
2B
2C
0A
1A
1C
2D
2A
2E
3C 3D 3E 3F 3G
3H
3A
3B
1B
2B
2C
1C
2D
2E
3C 3D 3E 3F 3G
Evaluation:
fraction of correctly identified links, fraction of
acceptable links, fraction of missing links, etc.
G. Tibély, P. Pollner, T. Vicsek and G. Palla
Extracting tag hierarchies
3H
42. Introduction
Benchmarks and testing
Results
Benchmarks
Quality measures
How to measure the quality?
EXACT
RECONSTRUCTED
0A
1A
2A
3A
3B
1B
2B
2C
0A
1A
1C
2D
2A
2E
3C 3D 3E 3F 3G
3H
3A
3B
1B
2B
2C
1C
2D
2E
3C 3D 3E 3F 3G
3H
Evaluation:
fraction of correctly identified links, fraction of
acceptable links, fraction of missing links, etc.
Normalised Mutual Information: sensitive also to the
position of the non-matching links.
L. Danon et al., "Comparing community structure
identification", J. Stat. Mech. P09008 , (2005)
G. Tibély, P. Pollner, T. Vicsek and G. Palla
Extracting tag hierarchies
43. Introduction
Benchmarks and testing
Results
Benchmarks
Quality measures
Normalised Mutual Information
Mathematical formulation
The probability for picking a tag at random from the descendants of tag i in the exact
hierarchy, Ge , and in the reconstructed hierarchy Gr :
pe (i) =
|De (i)|
,
N −1
pr (i) =
|Dr (i)|
.
N −1
The probability for picking a tag at random from the intersection of the two sets of
descendants:
|De (i) ∩ Dr (i)|
pe,r (i) =
N −1
Based on this, the Normalised Mutual Information between the exact- and
reconstructed hierarchies:
N
2
Ie,r = −
pe,r (i) ln
i=1
N
i=1
N
|De (i)| ln
pr (i) ln pr (i)
i=1
G. Tibély, P. Pollner, T. Vicsek and G. Palla
|De (i) ∩ Dr (i)| ln
2
=
N
pe (i) ln pe (i) +
i=1
N
pe,r (i)
pe (i)pr (i)
i=1
|De (i)|
N−1
Extracting tag hierarchies
|De (i)∩Dr (i)|(N−1)
|De (i)|·|Dr (i)|
.
N
|Dr (i)| ln
+
i=1
|Dr (i)|
N−1
44. Introduction
Benchmarks and testing
Results
Benchmarks
Quality measures
Normalised Mutual Information
Behaviour
0A
0A
1A
1B
2A
3A
2B
3B
3C
3E
2A
2D
2C
3D
1A
RANDOMIZATION
3F
3G
3H
G. Tibély, P. Pollner, T. Vicsek and G. Palla
3A
3B
1B
2B
3C
3D
Extracting tag hierarchies
2D
2C
3E
3F
3G
3H
45. Introduction
Benchmarks and testing
Results
Benchmarks
Quality measures
Normalised Mutual Information
Behaviour
0A
0A
1A
1B
2A
3A
2B
3B
3C
3E
2A
2D
2C
3D
1A
RANDOMIZATION
3F
3G
3A
3H
3B
1B
2B
3C
3D
1
bottom up
random
top down
0.8
I
0.6
0.4
0.2
0
0
0.2
0.4
0.6
0.8
f
G. Tibély, P. Pollner, T. Vicsek and G. Palla
Extracting tag hierarchies
2D
2C
3E
1
3F
3G
3H
46. Introduction
Benchmarks and testing
Results
Hierarchy of protein functions
Flickr and IMDb
Synthetic data
Studied systems
Tagged proteins:
- 5,913,610 proteins from GO, annotated by
- 4,181 molecular functions.
Tagged photos:
- 1,519,030 photos from Flickr, tagged by
- 25,441 free English words.
Tagged films:
- 336,223 films from IMDb, tagged by
- 6,358 English keywords.
Synthetic benchmark:
- 2,000,000 virtual objects,
- 1,023 tags.
G. Tibély, P. Pollner, T. Vicsek and G. Palla
Extracting tag hierarchies
47. Introduction
Benchmarks and testing
Results
Hierarchy of protein functions
Flickr and IMDb
Synthetic data
Results
Hierarchy of protein functions
0A
1A
2B
2A
2D
2C
0A
3A 3B 3C 3D 3E
3F
3G 3H 3I
3J 3K 3L 3M 3N 3O
3P
3Q
1A
4A
4B
4C
4D
4G 4H 4I
4E 4F
4J
4K 4L
4M 4N
4O 4P 4Q 4R
2B
2A
5A
5B
2D
2C
5C
3R 3S
3A 3B 3C 3D 3E
4A
4B
4C
4D
3F
3G 3H 3I
4E 4S
5A
G. Tibély, P. Pollner, T. Vicsek and G. Palla
3J 3K 3L 3M 3N 3O
4G 4H 4I
5B
Extracting tag hierarchies
5C
4J
4K 4L
3P
4M 4N
3Q
3T
4O 4P
4Q
48. Introduction
Benchmarks and testing
Results
Hierarchy of protein functions
Flickr and IMDb
Synthetic data
Results
Hierarchy of protein functions
Matching
links
Acceptable
links
Normalised
Mut. Info.
Linearised
Mut. Info
algorithm A
algorithm B
P. H. & H. G.-M.
P. Schmitz
algorithm A
algorithm B
P. H. & H. G.-M.
P. Schmitz
algorithm A
algorithm B
P. H. & H. G.-M.
P. Schmitz
algorithm A
algorithm B
P. H. & H. G.-M.
P. Schmitz
G. Tibély, P. Pollner, T. Vicsek and G. Palla
21%
20%
19%
18%
66%
52%
51%
65%
35%
30%
30%
30%
78%
75%
75%
75%
Extracting tag hierarchies
49. Introduction
Benchmarks and testing
Results
Hierarchy of protein functions
Flickr and IMDb
Synthetic data
Results
Tag hierarchy from Flickr data
G. Tibély, P. Pollner, T. Vicsek and G. Palla
Extracting tag hierarchies
50. Introduction
Benchmarks and testing
Results
Hierarchy of protein functions
Flickr and IMDb
Synthetic data
Results
Tag hierarchy from IMDb data
G. Tibély, P. Pollner, T. Vicsek and G. Palla
Extracting tag hierarchies
51. Introduction
Benchmarks and testing
Results
Hierarchy of protein functions
Flickr and IMDb
Synthetic data
Results
Synthetic benchmark: tagging by random walks
Input:
the tag hierarchy,
the distribution of the number tags on the objects,
the tag frequency distribution.
The tagging:
the first tag at random,
the rest:
with probability p: a
short random walk on
the DAG starting from
the first tag,
with probability 1 − p:
at random.
G. Tibély, P. Pollner, T. Vicsek and G. Palla
Extracting tag hierarchies
52. Introduction
Benchmarks and testing
Results
Hierarchy of protein functions
Flickr and IMDb
Synthetic data
Results
Synthetic benchmark: tagging by random walks
Input:
the tag hierarchy,
the distribution of the number tags on the objects,
the tag frequency distribution.
The tagging:
the first tag at random,
the rest:
with probability p: a
short random walk on
the DAG starting from
the first tag,
with probability 1 − p:
at random.
G. Tibély, P. Pollner, T. Vicsek and G. Palla
Extracting tag hierarchies
53. Introduction
Benchmarks and testing
Results
Hierarchy of protein functions
Flickr and IMDb
Synthetic data
Results
Synthetic benchmark: tagging by random walks
Input:
the tag hierarchy,
the distribution of the number tags on the objects,
the tag frequency distribution.
The tagging:
the first tag at random,
the rest:
with probability p: a
short random walk on
the DAG starting from
the first tag,
with probability 1 − p:
at random.
G. Tibély, P. Pollner, T. Vicsek and G. Palla
Extracting tag hierarchies
54. Introduction
Benchmarks and testing
Results
Hierarchy of protein functions
Flickr and IMDb
Synthetic data
Results
Synthetic benchmark: tagging by random walks
Input:
the tag hierarchy,
the distribution of the number tags on the objects,
the tag frequency distribution.
The tagging:
the first tag at random,
the rest:
with probability p: a
short random walk on
the DAG starting from
the first tag,
with probability 1 − p:
at random.
G. Tibély, P. Pollner, T. Vicsek and G. Palla
Extracting tag hierarchies
55. Introduction
Benchmarks and testing
Results
Hierarchy of protein functions
Flickr and IMDb
Synthetic data
Results
Synthetic benchmark: tagging by random walks
Input:
the tag hierarchy,
the distribution of the number tags on the objects,
the tag frequency distribution.
The tagging:
the first tag at random,
the rest:
with probability p: a
short random walk on
the DAG starting from
the first tag,
with probability 1 − p:
at random.
G. Tibély, P. Pollner, T. Vicsek and G. Palla
Extracting tag hierarchies
56. Introduction
Benchmarks and testing
Results
Hierarchy of protein functions
Flickr and IMDb
Synthetic data
Results
Synthetic benchmark: tagging by random walks
Input:
the tag hierarchy,
the distribution of the number tags on the objects,
the tag frequency distribution.
The tagging:
the first tag at random,
the rest:
with probability p: a
short random walk on
the DAG starting from
the first tag,
with probability 1 − p:
at random.
G. Tibély, P. Pollner, T. Vicsek and G. Palla
Extracting tag hierarchies
57. Introduction
Benchmarks and testing
Results
Hierarchy of protein functions
Flickr and IMDb
Synthetic data
Results
Synthetic benchmark: tagging by random walks
Input:
the tag hierarchy,
the distribution of the number tags on the objects,
the tag frequency distribution.
The tagging:
the first tag at random,
the rest:
with probability p: a
short random walk on
the DAG starting from
the first tag,
with probability 1 − p:
at random.
G. Tibély, P. Pollner, T. Vicsek and G. Palla
Extracting tag hierarchies
58. Introduction
Benchmarks and testing
Results
Hierarchy of protein functions
Flickr and IMDb
Synthetic data
Results
Synthetic benchmark: tagging by random walks
Input:
the tag hierarchy,
the distribution of the number tags on the objects,
the tag frequency distribution.
The tagging:
the first tag at random,
the rest:
with probability p: a
short random walk on
the DAG starting from
the first tag,
with probability 1 − p:
at random.
G. Tibély, P. Pollner, T. Vicsek and G. Palla
Extracting tag hierarchies
59. Introduction
Benchmarks and testing
Results
Hierarchy of protein functions
Flickr and IMDb
Synthetic data
Results
Synthetic benchmark: tagging by random walks
Input:
the tag hierarchy,
the distribution of the number tags on the objects,
the tag frequency distribution.
The tagging:
the first tag at random,
the rest:
with probability p: a
short random walk on
the DAG starting from
the first tag,
with probability 1 − p:
at random.
G. Tibély, P. Pollner, T. Vicsek and G. Palla
Extracting tag hierarchies
60. Introduction
Benchmarks and testing
Results
Hierarchy of protein functions
Flickr and IMDb
Synthetic data
Results
Synthetic benchmark: tagging by random walks
Input:
the tag hierarchy,
the distribution of the number tags on the objects,
the tag frequency distribution.
The tagging:
the first tag at random,
the rest:
with probability p: a
short random walk on
the DAG starting from
the first tag,
with probability 1 − p:
at random.
G. Tibély, P. Pollner, T. Vicsek and G. Palla
Extracting tag hierarchies
61. Introduction
Benchmarks and testing
Results
Hierarchy of protein functions
Flickr and IMDb
Synthetic data
Results
Synthetic benchmark: tagging by random walks
Input:
the tag hierarchy,
the distribution of the number tags on the objects,
the tag frequency distribution.
The tagging:
the first tag at random,
the rest:
with probability p: a
short random walk on
the DAG starting from
the first tag,
with probability 1 − p:
at random.
G. Tibély, P. Pollner, T. Vicsek and G. Palla
Extracting tag hierarchies
62. Introduction
Benchmarks and testing
Results
Hierarchy of protein functions
Flickr and IMDb
Synthetic data
Results
Synthetic benchmark
a)
b)
c)
d)
G. Tibély, P. Pollner, T. Vicsek and G. Palla
Extracting tag hierarchies
63. Introduction
Benchmarks and testing
Results
Hierarchy of protein functions
Flickr and IMDb
Synthetic data
Results
Synthetic benchmark
Matching
links
Acceptable
links
Normalised
Mut. Info.
Linearised
Mut. Info
algorithm A
algorithm B
P. H. & H. G.-M.
P. Schmitz
algorithm A
algorithm B
P. H. & H. G.-M.
P. Schmitz
algorithm A
algorithm B
P. H. & H. G.-M.
P. Schmitz
algorithm A
algorithm B
P. H. & H. G.-M.
P. Schmitz
G. Tibély, P. Pollner, T. Vicsek and G. Palla
31%
89%
48%
1%
35%
91%
54%
2%
18%
83%
29%
1%
66%
97%
76%
5%
Extracting tag hierarchies
64. Introduction
Benchmarks and testing
Results
Hierarchy of protein functions
Flickr and IMDb
Synthetic data
Summary
Tags are important in knowledge organisation.
Tag-hierarchy extraction is an interesting problem with a
great potential for practical applications.
We have set up a framework for tag-hierarchy extraction:
Benchmark systems for testing the tag-hierarchy extraction
algorithm can be found and can be created.
The mutual information provides a quality measure
sensitive also to the position of the links in the hierarchy.
-G. Tibély, P. Pollner, T. Vicsek and G. Palla, Ontologies and tag-statistics
New Journal of Physics 14, 053009 (2012).
-G. Tibély, P. Pollner, T. Vicsek and G. Palla, Extracting tag hierarchies
accepted in PLoS ONE
G. Tibély, P. Pollner, T. Vicsek and G. Palla
Extracting tag hierarchies
65. Tagging by random walks
Mutual information
Further results from Flickr
winter
snow
frost
warmer
hibernation
hoarfrost
freeze
hoar
water ski skier
skiing
ski pole snowboard
christmas
frostbite
shivering freezing
flake
arctic
sublimation
melting
iceberg
samuel
taylor
coleridge
mariner
thaw
carbon dry
dioxide ice
antarctic
peninsula
diamond
dust caroling
ice igloo ice
water machine
G. Tibély, P. Pollner, T. Vicsek and G. Palla
christholi- decomas santa
day ration
tree
season
greenland
icearctic
breaker circle
bow
dec
inuit
xmas nati
gift
walrus
spitsbergen
spitzbergen
erignathus
feb
tree
farm
christmastime
antarctic
baffin
island
jan
ribbon
svalbard
ice north
cube pole
krill
pygoscelis
adeliae
wi
madison milwaukee
december
nunavut
whiteout melt
minus
chill
chilling
antarctica
coleridge
january february
ice
snowsnowball shoe
snowflake sledding
cold
snowweather storm
blizzard
rime
wisconsin
midwinter
ski
pogonip
jack
frost
holida
cold
wintertime
o
Extracting tagr hierarchies d o b e n u s
ba batus
arrow
pacific
present
66. Tagging by random walks
Mutual information
Algorithm A
Details
pre-process: for a tag i keep neighbours only if
nij ≥ 0.4 · max(nij ).
Link-weight:
random estimate for the number of co-occurrences:
nn
nij R = in j ,
link weight corresponds to the z-score: zij =
nij − nij
R
σR (nij ) .
Threshold:
keep only the strongest link on every tag.
Direction:
we assume that tag frequencies are higher close to the
root,
→ for any given tag i, the strongest link is coming from its
parent.
G. Tibély, P. Pollner, T. Vicsek and G. Palla
Extracting tag hierarchies
67. Tagging by random walks
Mutual information
Algorithm A
Details
pre-process: for a tag i keep neighbours only if
nij ≥ 0.4 · max(nij ).
Link-weight:
random estimate for the number of co-occurrences:
nn
nij R = in j ,
link weight corresponds to the z-score: zij =
nij − nij
R
σR (nij ) .
Threshold:
keep only the strongest link on every tag.
Direction:
we assume that tag frequencies are higher close to the
root,
→ for any given tag i, the strongest link is coming from its
parent.
G. Tibély, P. Pollner, T. Vicsek and G. Palla
Extracting tag hierarchies
68. Tagging by random walks
Mutual information
Algorithm A
Details
pre-process: for a tag i keep neighbours only if
nij ≥ 0.4 · max(nij ).
Link-weight:
random estimate for the number of co-occurrences:
nn
nij R = in j ,
link weight corresponds to the z-score: zij =
nij − nij
R
σR (nij ) .
Threshold:
keep only the strongest link on every tag.
Direction:
we assume that tag frequencies are higher close to the
root,
→ for any given tag i, the strongest link is coming from its
parent.
G. Tibély, P. Pollner, T. Vicsek and G. Palla
Extracting tag hierarchies
69. Tagging by random walks
Mutual information
Algorithm A
Details
pre-process: for a tag i keep neighbours only if
nij ≥ 0.4 · max(nij ).
Link-weight:
random estimate for the number of co-occurrences:
nn
nij R = in j ,
link weight corresponds to the z-score: zij =
nij − nij
R
σR (nij ) .
Threshold:
keep only the strongest link on every tag.
Direction:
we assume that tag frequencies are higher close to the
root,
→ for any given tag i, the strongest link is coming from its
parent.
G. Tibély, P. Pollner, T. Vicsek and G. Palla
Extracting tag hierarchies
70. Tagging by random walks
Mutual information
Algorithm A
Details
pre-process: for a tag i keep neighbours only if
nij ≥ 0.4 · max(nij ).
Link-weight:
random estimate for the number of co-occurrences:
nn
nij R = in j ,
link weight corresponds to the z-score: zij =
nij − nij
R
σR (nij ) .
Threshold:
keep only the strongest link on every tag.
Direction:
we assume that tag frequencies are higher close to the
root,
→ for any given tag i, the strongest link is coming from its
parent.
G. Tibély, P. Pollner, T. Vicsek and G. Palla
Extracting tag hierarchies
71. Tagging by random walks
Mutual information
Algorithm A
Details
pre-process: for a tag i keep neighbours only if
nij ≥ 0.4 · max(nij ).
Link-weight:
random estimate for the number of co-occurrences:
nn
nij R = in j ,
link weight corresponds to the z-score: zij =
nij − nij
R
σR (nij ) .
Threshold:
keep only the strongest link on every tag.
Direction:
we assume that tag frequencies are higher close to the
root,
→ for any given tag i, the strongest link is coming from its
parent.
G. Tibély, P. Pollner, T. Vicsek and G. Palla
Extracting tag hierarchies
72. Tagging by random walks
Mutual information
Algorithm A
Details
Exception:
when the given tag is also the proposed parent of its
strongest neighbour.
→ in this case we choose the 2nd strongest neighbour as
parent
Local root:
when the given tag is also the proposed parent for all of its
strong neighbours.
Global assembly:
the local root with largest “entropy” becomes the global root,
rest of the local roots are linked in the order of their entropy.
G. Tibély, P. Pollner, T. Vicsek and G. Palla
Extracting tag hierarchies
73. Tagging by random walks
Mutual information
Algorithm B
Details
Link-weight:
random estimate for the number of co-occurrences:
nn
nij R = in j ,
link weight corresponds to the z-score: zij =
nij − nij
R
σR (nij ) .
Threshold:
keep links stronger than zij ≥ 10.
Centrality:
calculate the eigenvector centrality based on the remaining
weighted adjacency matrix.
Build the hierarchy:
start from lowest centrality tags,
choose parent from neighbours with higher centrality than
the given tag,
in case there are more candidates, choose the most related
one, (according to the descendants already under the given
tag).
G. Tibély, P. Pollner, T. Vicsek and G. Palla
Extracting tag hierarchies
74. Tagging by random walks
Mutual information
Mutual information and entropy
For discrete variables xi and yj with a joint probability distribution given by P(xi , yj ), the
mutual information is defined as
I(x, y ) ≡
p(xi , yj ) ln
i
j
p(xi , yj )
p(xi )p(yj )
.
The entropies of the variables is usually formulated as
H(x) = −
p(xi ) ln p(xi ),
H(y ) = −
i
p(yj ) ln p(yj ).
j
Thus, the mutual information can be also given as
I(x, y ) = H(x) + H(y ) − H(x, y ).
Based on this, the normalised mutual information in general is defined as
Inorm (x, y ) ≡
G. Tibély, P. Pollner, T. Vicsek and G. Palla
2I(x, y )
.
H(x) + H(y )
Extracting tag hierarchies
75. Tagging by random walks
Mutual information
Comparing DAGs with mutual information
The probability for picking a tag at random from the descendants of tag i in the exact
hierarchy, Ge , and in the reconstructed hierarchy Gr :
pe (i) =
|De (i)|
,
N −1
pr (i) =
|Dr (i)|
.
N −1
The probability for picking a tag at random from the intersection of the two sets of
descendants:
|De (i) ∩ Dr (i)|
pe,r (i) =
N −1
Based on this, the Normalised Mutual Information between the exact- and
reconstructed hierarchies:
N
2
Ie,r = −
pe,r (i) ln
i=1
N
i=1
N
|De (i)| ln
pr (i) ln pr (i)
i=1
G. Tibély, P. Pollner, T. Vicsek and G. Palla
|De (i) ∩ Dr (i)| ln
2
=
N
pe (i) ln pe (i) +
i=1
N
pe,r (i)
pe (i)pr (i)
i=1
|De (i)|
N−1
Extracting tag hierarchies
|De (i)∩Dr (i)|(N−1)
|De (i)|·|Dr (i)|
.
N
|Dr (i)| ln
+
i=1
|Dr (i)|
N−1
76. Tagging by random walks
Mutual information
Comparing DAGs with mutual information
G. Tibély, P. Pollner, T. Vicsek and G. Palla
Extracting tag hierarchies
77. Tagging by random walks
Mutual information
Comparing DAGs with mutual information
G. Tibély, P. Pollner, T. Vicsek and G. Palla
Extracting tag hierarchies
78. Tagging by random walks
Mutual information
Comparing DAGs with mutual information
G. Tibély, P. Pollner, T. Vicsek and G. Palla
Extracting tag hierarchies
79. Tagging by random walks
Mutual information
Results
In numbers
Hierarchy (target)
Data (input)
algorithm A
Matching
algorithm B
links
P. H. & H. G.-M.
P. Schmitz
algorithm A
Acceptable
algorithm B
links
P. H. & H. G.-M.
P. Schmitz
algorithm A
Normalised algorithm B
Mut. Info.
P. H. & H. G.-M.
P. Schmitz
algorithm A
Linearised
algorithm B
Mut. Info
P. H. & H. G.-M.
P. Schmitz
GO subset
proteins
21%
20%
19%
18%
66%
52%
51%
65%
35%
30%
30%
30%
78%
75%
75%
75%
G. Tibély, P. Pollner, T. Vicsek and G. Palla
user defined
simulated tagging
31%
89%
48%
1%
35%
91%
54%
2%
18%
83%
29%
1%
66%
97%
76%
5%
Extracting tag hierarchies