SlideShare ist ein Scribd-Unternehmen logo
1 von 79
Downloaden Sie, um offline zu lesen
Introduction
Benchmarks and testing
Results

Extracting tag hierarchies
Gergely Tibély, Péter Pollner, Tamás Vicsek and
Gergely Palla
Statistical and Biological Physics Research Group,
HAS (Eötvös University), Hungary

KnowEscape2013 Conference

G. Tibély, P. Pollner, T. Vicsek and G. Palla

Extracting tag hierarchies
Introduction
Benchmarks and testing
Results

Tags and tagging
The goal
Tag hierarchy extraction methods

Tags and tagging: blogs, news portals

G. Tibély, P. Pollner, T. Vicsek and G. Palla

Extracting tag hierarchies
Introduction
Benchmarks and testing
Results

Tags and tagging
The goal
Tag hierarchy extraction methods

Tags and tagging: blogs, news portals

G. Tibély, P. Pollner, T. Vicsek and G. Palla

Extracting tag hierarchies
Introduction
Benchmarks and testing
Results

Tags and tagging
The goal
Tag hierarchy extraction methods

Tags and tagging: blogs, news portals

G. Tibély, P. Pollner, T. Vicsek and G. Palla

Extracting tag hierarchies
Introduction
Benchmarks and testing
Results

Tags and tagging
The goal
Tag hierarchy extraction methods

Tags and tagging: blogs, news portals

G. Tibély, P. Pollner, T. Vicsek and G. Palla

Extracting tag hierarchies
Introduction
Benchmarks and testing
Results

Tags and tagging
The goal
Tag hierarchy extraction methods

Tags and tagging: blogs, news portals

G. Tibély, P. Pollner, T. Vicsek and G. Palla

Extracting tag hierarchies
Introduction
Benchmarks and testing
Results

Tags and tagging
The goal
Tag hierarchy extraction methods

Tags and tagging: blogs, news portals

G. Tibély, P. Pollner, T. Vicsek and G. Palla

Extracting tag hierarchies
Introduction
Benchmarks and testing
Results

Tags and tagging
The goal
Tag hierarchy extraction methods

Tags and tagging: blogs, news portals

G. Tibély, P. Pollner, T. Vicsek and G. Palla

Extracting tag hierarchies
Introduction
Benchmarks and testing
Results

Tags and tagging
The goal
Tag hierarchy extraction methods

Tags and tagging: blogs, news portals

G. Tibély, P. Pollner, T. Vicsek and G. Palla

Extracting tag hierarchies
Introduction
Benchmarks and testing
Results

Tags and tagging
The goal
Tag hierarchy extraction methods

Tags and tagging: video and photo sharing

G. Tibély, P. Pollner, T. Vicsek and G. Palla

Extracting tag hierarchies
Introduction
Benchmarks and testing
Results

Tags and tagging
The goal
Tag hierarchy extraction methods

Tags and tagging: video and photo sharing

G. Tibély, P. Pollner, T. Vicsek and G. Palla

Extracting tag hierarchies
Introduction
Benchmarks and testing
Results

Tags and tagging
The goal
Tag hierarchy extraction methods

Tagging and folksonomies

In some cases the emerging set of free tags is called as a
FOLKSONOMY

collaborative nature,
tags are equal,
no hierarchy,
−→ The opposite of an ONTOLOGY.

G. Tibély, P. Pollner, T. Vicsek and G. Palla

Extracting tag hierarchies
Introduction
Benchmarks and testing
Results

Tags and tagging
The goal
Tag hierarchy extraction methods

Tagging and folksonomies

In some cases the emerging set of free tags is called as a
FOLKSONOMY

collaborative nature,
tags are equal,
no hierarchy,
−→ The opposite of an ONTOLOGY.

G. Tibély, P. Pollner, T. Vicsek and G. Palla

Extracting tag hierarchies
Introduction
Benchmarks and testing
Results

Tags and tagging
The goal
Tag hierarchy extraction methods

Tagging and folksonomies

In some cases the emerging set of free tags is called as a
FOLKSONOMY

collaborative nature,
tags are equal,
no hierarchy,
−→ The opposite of an ONTOLOGY.

G. Tibély, P. Pollner, T. Vicsek and G. Palla

Extracting tag hierarchies
Introduction
Benchmarks and testing
Results

Tags and tagging
The goal
Tag hierarchy extraction methods

Tagging and folksonomies

In some cases the emerging set of free tags is called as a
FOLKSONOMY

collaborative nature,
tags are equal,
no hierarchy,
−→ The opposite of an ONTOLOGY.

G. Tibély, P. Pollner, T. Vicsek and G. Palla

Extracting tag hierarchies
Introduction
Benchmarks and testing
Results

Tags and tagging
The goal
Tag hierarchy extraction methods

Tagging and folksonomies

In some cases the emerging set of free tags is called as a
FOLKSONOMY

collaborative nature,
tags are equal,
no hierarchy,
−→ The opposite of an ONTOLOGY.

G. Tibély, P. Pollner, T. Vicsek and G. Palla

Extracting tag hierarchies
Introduction
Benchmarks and testing
Results

Tags and tagging
The goal
Tag hierarchy extraction methods

Tagging systems
How can we search?

G. Tibély, P. Pollner, T. Vicsek and G. Palla

Extracting tag hierarchies
Introduction
Benchmarks and testing
Results

Tags and tagging
The goal
Tag hierarchy extraction methods

Tagging systems
How can we search?

G. Tibély, P. Pollner, T. Vicsek and G. Palla

Extracting tag hierarchies
Introduction
Benchmarks and testing
Results

Tags and tagging
The goal
Tag hierarchy extraction methods

Tagging systems
How can we search?

G. Tibély, P. Pollner, T. Vicsek and G. Palla

Extracting tag hierarchies
Introduction
Benchmarks and testing
Results

Tags and tagging
The goal
Tag hierarchy extraction methods

Tagging systems
How can we search?

G. Tibély, P. Pollner, T. Vicsek and G. Palla

Extracting tag hierarchies
Introduction
Benchmarks and testing
Results

Tags and tagging
The goal
Tag hierarchy extraction methods

Tagging systems
Searching items

G. Tibély, P. Pollner, T. Vicsek and G. Palla

Extracting tag hierarchies
Introduction
Benchmarks and testing
Results

Tags and tagging
The goal
Tag hierarchy extraction methods

Tagging systems
Searching items

G. Tibély, P. Pollner, T. Vicsek and G. Palla

Extracting tag hierarchies
Introduction
Benchmarks and testing
Results

Tags and tagging
The goal
Tag hierarchy extraction methods

Tagging systems
Searching

G. Tibély, P. Pollner, T. Vicsek and G. Palla

Extracting tag hierarchies
Introduction
Benchmarks and testing
Results

Tags and tagging
The goal
Tag hierarchy extraction methods

Tagging systems
Searching tags

G. Tibély, P. Pollner, T. Vicsek and G. Palla

Extracting tag hierarchies
Introduction
Benchmarks and testing
Results

Tags and tagging
The goal
Tag hierarchy extraction methods

Tagging systems
Searching tags

G. Tibély, P. Pollner, T. Vicsek and G. Palla

Extracting tag hierarchies
Introduction
Benchmarks and testing
Results

Tags and tagging
The goal
Tag hierarchy extraction methods

The goal

G. Tibély, P. Pollner, T. Vicsek and G. Palla

Extracting tag hierarchies
Introduction
Benchmarks and testing
Results

Tags and tagging
The goal
Tag hierarchy extraction methods

The goal
Extracting a tag hierarchy!

G. Tibély, P. Pollner, T. Vicsek and G. Palla

Extracting tag hierarchies
Introduction
Benchmarks and testing
Results

Tags and tagging
The goal
Tag hierarchy extraction methods

The goal
Extracting a tag hierarchy!

Motivation:

G. Tibély, P. Pollner, T. Vicsek and G. Palla

Extracting tag hierarchies
Introduction
Benchmarks and testing
Results

Tags and tagging
The goal
Tag hierarchy extraction methods

The goal
Extracting a tag hierarchy!

Motivation:
Help searching: If the tags are organised into a
hierarchy, broadening or narrowing the scope is
simple.

G. Tibély, P. Pollner, T. Vicsek and G. Palla

Extracting tag hierarchies
Introduction
Benchmarks and testing
Results

Tags and tagging
The goal
Tag hierarchy extraction methods

The goal
Extracting a tag hierarchy!

Motivation:
Help searching: If the tags are organised into a
hierarchy, broadening or narrowing the scope is
simple.
Give recommendations about yet unvisited objects.

G. Tibély, P. Pollner, T. Vicsek and G. Palla

Extracting tag hierarchies
Introduction
Benchmarks and testing
Results

Tags and tagging
The goal
Tag hierarchy extraction methods

Tag hierarchy extracting algorithms

- P. Heymann and H. Garcia-Molina, "Collaborative Creation of
Communal Hierarchical Taxonomies in Social Tagging Systems",
Technical Report, Stanford InfoLab, (2006).
- P. Schmitz , "Inducing Ontology from Flickr Tags", Proceedings of the
15th International Conference on World Wide Web (WWW), (2006).
- C. Van Damme, M. Hepp and K. Siorpaes, "FolksOntology: An
Integrated Approach for Turning Folksonomies into Ontologies", Social
Networks 2, 57–70, (2007)
- A.Plangprasopchok and K. Lerman, "Constructing Folksonomies from
User-specified Relations on Flickr", Proceedings of the World Wide
Web conference, (2009)

G. Tibély, P. Pollner, T. Vicsek and G. Palla

Extracting tag hierarchies
Introduction
Benchmarks and testing
Results

Tags and tagging
The goal
Tag hierarchy extraction methods

Our methods
Basic outline

DEFINE LINKS

TAG 1
TAG 2

TAG 3

TAG 5

TAG 4

G. Tibély, P. Pollner, T. Vicsek and G. Palla

Extracting tag hierarchies
Introduction
Benchmarks and testing
Results

Tags and tagging
The goal
Tag hierarchy extraction methods

Our methods
Basic outline

DEFINE LINKS

TAG 1
TAG 2

TAG 3

TAG 5

TAG 4

G. Tibély, P. Pollner, T. Vicsek and G. Palla

Extracting tag hierarchies
Introduction
Benchmarks and testing
Results

Tags and tagging
The goal
Tag hierarchy extraction methods

Our methods
Basic outline

DEFINE LINKS

THRESHOLD

TAG 1

TAG 1

TAG 2

TAG 3

TAG 5

TAG 4

TAG 2

TAG 3

G. Tibély, P. Pollner, T. Vicsek and G. Palla

TAG 5

TAG 4

Extracting tag hierarchies
Introduction
Benchmarks and testing
Results

Tags and tagging
The goal
Tag hierarchy extraction methods

Our methods
Basic outline

DEFINE LINKS

THRESHOLD

DETERMINE
DIRECTION

TAG 1

TAG 1

TAG 1

TAG 2

TAG 3

TAG 5

TAG 4

TAG 2

TAG 3

G. Tibély, P. Pollner, T. Vicsek and G. Palla

TAG 5

TAG 4

TAG 2

TAG 3

Extracting tag hierarchies

TAG 5

TAG 4
Introduction
Benchmarks and testing
Results

Tags and tagging
The goal
Tag hierarchy extraction methods

Our methods
Basic outline

DEFINE LINKS

THRESHOLD

TAG 1

TAG 1

TAG 2

TAG 5

TAG 2

DETERMINE
DIRECTION
TAG 1

TAG 5
TAG 2

TAG 3

TAG 4

TAG 3

G. Tibély, P. Pollner, T. Vicsek and G. Palla

TAG 4

Extracting tag hierarchies

TAG 3

TAG 4

TAG 5
Introduction
Benchmarks and testing
Results

Benchmarks
Quality measures

How to test the method?
Benchmarks?
Gene Ontology
- tag hierarchy: hierarchy of protein functions,
- tagged objects: proteins annotated by their known
functions
Synthetic benchmark:
- tag hierarchy: user defined,
- tagged objects: simulated tagging (random walk on the
hierarchy).
- G. Tibély, P. Pollner, T. Vicsek and G. Palla, “Ontologies and
-,tag-statistics”, New Journal of Physics 14, 053009 (2012).

G. Tibély, P. Pollner, T. Vicsek and G. Palla

Extracting tag hierarchies
Introduction
Benchmarks and testing
Results

Benchmarks
Quality measures

How to test the method?
Benchmarks:
Gene Ontology
- tag hierarchy: hierarchy of protein functions,
- tagged objects: proteins annotated by their known
functions.
Synthetic benchmark:
- tag hierarchy: user defined,
- tagged objects: simulated tagging (random walk on the
hierarchy).
- G. Tibély, P. Pollner, T. Vicsek and G. Palla, “Ontologies and
-,tag-statistics”, New Journal of Physics 14, 053009 (2012).

G. Tibély, P. Pollner, T. Vicsek and G. Palla

Extracting tag hierarchies
Introduction
Benchmarks and testing
Results

Benchmarks
Quality measures

How to test the method?
Benchmarks:
Gene Ontology
- tag hierarchy: hierarchy of protein functions,
- tagged objects: proteins annotated by their known
functions.
Synthetic benchmark:
- tag hierarchy: user defined,
- tagged objects: simulated tagging (random walk on the
hierarchy).
- G. Tibély, P. Pollner, T. Vicsek and G. Palla, “Ontologies and
-,tag-statistics”, New Journal of Physics 14, 053009 (2012).

G. Tibély, P. Pollner, T. Vicsek and G. Palla

Extracting tag hierarchies
Introduction
Benchmarks and testing
Results

Benchmarks
Quality measures

How to measure the quality?
EXACT

RECONSTRUCTED

0A

1A

2A

3A

3B

1B

2B

2C

0A

1A

1C

2D

2A

2E

3C 3D 3E 3F 3G

3H

3A

3B

1B

2B

2C

1C

2D

3C 3D 3E 3F 3G

Evaluation?

G. Tibély, P. Pollner, T. Vicsek and G. Palla

2E

Extracting tag hierarchies

3H
Introduction
Benchmarks and testing
Results

Benchmarks
Quality measures

How to measure the quality?
EXACT

RECONSTRUCTED

0A

1A

2A

3A

3B

1B

2B

2C

0A

1A

1C

2D

2A

2E

3C 3D 3E 3F 3G

3H

3A

3B

1B

2B

2C

1C

2D

2E

3C 3D 3E 3F 3G

Evaluation:
fraction of correctly identified links, fraction of
acceptable links, fraction of missing links, etc.

G. Tibély, P. Pollner, T. Vicsek and G. Palla

Extracting tag hierarchies

3H
Introduction
Benchmarks and testing
Results

Benchmarks
Quality measures

How to measure the quality?
EXACT

RECONSTRUCTED

0A

1A

2A

3A

3B

1B

2B

2C

0A

1A

1C

2D

2A

2E

3C 3D 3E 3F 3G

3H

3A

3B

1B

2B

2C

1C

2D

2E

3C 3D 3E 3F 3G

3H

Evaluation:
fraction of correctly identified links, fraction of
acceptable links, fraction of missing links, etc.
Normalised Mutual Information: sensitive also to the
position of the non-matching links.
L. Danon et al., "Comparing community structure
identification", J. Stat. Mech. P09008 , (2005)
G. Tibély, P. Pollner, T. Vicsek and G. Palla

Extracting tag hierarchies
Introduction
Benchmarks and testing
Results

Benchmarks
Quality measures

Normalised Mutual Information
Mathematical formulation
The probability for picking a tag at random from the descendants of tag i in the exact
hierarchy, Ge , and in the reconstructed hierarchy Gr :
pe (i) =

|De (i)|
,
N −1

pr (i) =

|Dr (i)|
.
N −1

The probability for picking a tag at random from the intersection of the two sets of
descendants:
|De (i) ∩ Dr (i)|
pe,r (i) =
N −1
Based on this, the Normalised Mutual Information between the exact- and
reconstructed hierarchies:
N

2
Ie,r = −

pe,r (i) ln
i=1

N

i=1
N

|De (i)| ln

pr (i) ln pr (i)
i=1

G. Tibély, P. Pollner, T. Vicsek and G. Palla

|De (i) ∩ Dr (i)| ln

2
=

N

pe (i) ln pe (i) +
i=1

N

pe,r (i)
pe (i)pr (i)

i=1

|De (i)|
N−1

Extracting tag hierarchies

|De (i)∩Dr (i)|(N−1)
|De (i)|·|Dr (i)|

.

N

|Dr (i)| ln

+
i=1

|Dr (i)|
N−1
Introduction
Benchmarks and testing
Results

Benchmarks
Quality measures

Normalised Mutual Information
Behaviour
0A

0A

1A

1B

2A
3A

2B
3B

3C

3E

2A

2D

2C
3D

1A

RANDOMIZATION

3F

3G

3H

G. Tibély, P. Pollner, T. Vicsek and G. Palla

3A

3B

1B
2B

3C

3D

Extracting tag hierarchies

2D

2C
3E

3F

3G

3H
Introduction
Benchmarks and testing
Results

Benchmarks
Quality measures

Normalised Mutual Information
Behaviour
0A

0A

1A

1B

2A
3A

2B
3B

3C

3E

2A

2D

2C
3D

1A

RANDOMIZATION

3F

3G

3A

3H

3B

1B
2B

3C

3D

1
bottom up
random
top down

0.8

I

0.6
0.4
0.2
0

0

0.2

0.4

0.6

0.8

f
G. Tibély, P. Pollner, T. Vicsek and G. Palla

Extracting tag hierarchies

2D

2C
3E

1

3F

3G

3H
Introduction
Benchmarks and testing
Results

Hierarchy of protein functions
Flickr and IMDb
Synthetic data

Studied systems
Tagged proteins:
- 5,913,610 proteins from GO, annotated by
- 4,181 molecular functions.

Tagged photos:
- 1,519,030 photos from Flickr, tagged by
- 25,441 free English words.

Tagged films:
- 336,223 films from IMDb, tagged by
- 6,358 English keywords.

Synthetic benchmark:
- 2,000,000 virtual objects,
- 1,023 tags.
G. Tibély, P. Pollner, T. Vicsek and G. Palla

Extracting tag hierarchies
Introduction
Benchmarks and testing
Results

Hierarchy of protein functions
Flickr and IMDb
Synthetic data

Results
Hierarchy of protein functions
0A

1A

2B

2A

2D

2C

0A
3A 3B 3C 3D 3E

3F

3G 3H 3I

3J 3K 3L 3M 3N 3O

3P

3Q

1A
4A

4B

4C

4D

4G 4H 4I

4E 4F

4J

4K 4L

4M 4N

4O 4P 4Q 4R

2B

2A
5A

5B

2D

2C

5C

3R 3S

3A 3B 3C 3D 3E

4A

4B

4C

4D

3F

3G 3H 3I

4E 4S

5A

G. Tibély, P. Pollner, T. Vicsek and G. Palla

3J 3K 3L 3M 3N 3O

4G 4H 4I

5B

Extracting tag hierarchies

5C

4J

4K 4L

3P

4M 4N

3Q

3T

4O 4P

4Q
Introduction
Benchmarks and testing
Results

Hierarchy of protein functions
Flickr and IMDb
Synthetic data

Results
Hierarchy of protein functions

Matching
links

Acceptable
links

Normalised
Mut. Info.

Linearised
Mut. Info

algorithm A
algorithm B
P. H. & H. G.-M.
P. Schmitz
algorithm A
algorithm B
P. H. & H. G.-M.
P. Schmitz
algorithm A
algorithm B
P. H. & H. G.-M.
P. Schmitz
algorithm A
algorithm B
P. H. & H. G.-M.
P. Schmitz

G. Tibély, P. Pollner, T. Vicsek and G. Palla

21%
20%
19%
18%
66%
52%
51%
65%
35%
30%
30%
30%
78%
75%
75%
75%

Extracting tag hierarchies
Introduction
Benchmarks and testing
Results

Hierarchy of protein functions
Flickr and IMDb
Synthetic data

Results
Tag hierarchy from Flickr data

G. Tibély, P. Pollner, T. Vicsek and G. Palla

Extracting tag hierarchies
Introduction
Benchmarks and testing
Results

Hierarchy of protein functions
Flickr and IMDb
Synthetic data

Results
Tag hierarchy from IMDb data

G. Tibély, P. Pollner, T. Vicsek and G. Palla

Extracting tag hierarchies
Introduction
Benchmarks and testing
Results

Hierarchy of protein functions
Flickr and IMDb
Synthetic data

Results
Synthetic benchmark: tagging by random walks

Input:
the tag hierarchy,
the distribution of the number tags on the objects,
the tag frequency distribution.
The tagging:
the first tag at random,
the rest:
with probability p: a
short random walk on
the DAG starting from
the first tag,
with probability 1 − p:
at random.
G. Tibély, P. Pollner, T. Vicsek and G. Palla

Extracting tag hierarchies
Introduction
Benchmarks and testing
Results

Hierarchy of protein functions
Flickr and IMDb
Synthetic data

Results
Synthetic benchmark: tagging by random walks

Input:
the tag hierarchy,
the distribution of the number tags on the objects,
the tag frequency distribution.
The tagging:
the first tag at random,
the rest:
with probability p: a
short random walk on
the DAG starting from
the first tag,
with probability 1 − p:
at random.
G. Tibély, P. Pollner, T. Vicsek and G. Palla

Extracting tag hierarchies
Introduction
Benchmarks and testing
Results

Hierarchy of protein functions
Flickr and IMDb
Synthetic data

Results
Synthetic benchmark: tagging by random walks

Input:
the tag hierarchy,
the distribution of the number tags on the objects,
the tag frequency distribution.
The tagging:
the first tag at random,
the rest:
with probability p: a
short random walk on
the DAG starting from
the first tag,
with probability 1 − p:
at random.
G. Tibély, P. Pollner, T. Vicsek and G. Palla

Extracting tag hierarchies
Introduction
Benchmarks and testing
Results

Hierarchy of protein functions
Flickr and IMDb
Synthetic data

Results
Synthetic benchmark: tagging by random walks

Input:
the tag hierarchy,
the distribution of the number tags on the objects,
the tag frequency distribution.
The tagging:
the first tag at random,
the rest:
with probability p: a
short random walk on
the DAG starting from
the first tag,
with probability 1 − p:
at random.
G. Tibély, P. Pollner, T. Vicsek and G. Palla

Extracting tag hierarchies
Introduction
Benchmarks and testing
Results

Hierarchy of protein functions
Flickr and IMDb
Synthetic data

Results
Synthetic benchmark: tagging by random walks

Input:
the tag hierarchy,
the distribution of the number tags on the objects,
the tag frequency distribution.
The tagging:
the first tag at random,
the rest:
with probability p: a
short random walk on
the DAG starting from
the first tag,
with probability 1 − p:
at random.
G. Tibély, P. Pollner, T. Vicsek and G. Palla

Extracting tag hierarchies
Introduction
Benchmarks and testing
Results

Hierarchy of protein functions
Flickr and IMDb
Synthetic data

Results
Synthetic benchmark: tagging by random walks

Input:
the tag hierarchy,
the distribution of the number tags on the objects,
the tag frequency distribution.
The tagging:
the first tag at random,
the rest:
with probability p: a
short random walk on
the DAG starting from
the first tag,
with probability 1 − p:
at random.
G. Tibély, P. Pollner, T. Vicsek and G. Palla

Extracting tag hierarchies
Introduction
Benchmarks and testing
Results

Hierarchy of protein functions
Flickr and IMDb
Synthetic data

Results
Synthetic benchmark: tagging by random walks

Input:
the tag hierarchy,
the distribution of the number tags on the objects,
the tag frequency distribution.
The tagging:
the first tag at random,
the rest:
with probability p: a
short random walk on
the DAG starting from
the first tag,
with probability 1 − p:
at random.
G. Tibély, P. Pollner, T. Vicsek and G. Palla

Extracting tag hierarchies
Introduction
Benchmarks and testing
Results

Hierarchy of protein functions
Flickr and IMDb
Synthetic data

Results
Synthetic benchmark: tagging by random walks

Input:
the tag hierarchy,
the distribution of the number tags on the objects,
the tag frequency distribution.
The tagging:
the first tag at random,
the rest:
with probability p: a
short random walk on
the DAG starting from
the first tag,
with probability 1 − p:
at random.
G. Tibély, P. Pollner, T. Vicsek and G. Palla

Extracting tag hierarchies
Introduction
Benchmarks and testing
Results

Hierarchy of protein functions
Flickr and IMDb
Synthetic data

Results
Synthetic benchmark: tagging by random walks

Input:
the tag hierarchy,
the distribution of the number tags on the objects,
the tag frequency distribution.
The tagging:
the first tag at random,
the rest:
with probability p: a
short random walk on
the DAG starting from
the first tag,
with probability 1 − p:
at random.
G. Tibély, P. Pollner, T. Vicsek and G. Palla

Extracting tag hierarchies
Introduction
Benchmarks and testing
Results

Hierarchy of protein functions
Flickr and IMDb
Synthetic data

Results
Synthetic benchmark: tagging by random walks

Input:
the tag hierarchy,
the distribution of the number tags on the objects,
the tag frequency distribution.
The tagging:
the first tag at random,
the rest:
with probability p: a
short random walk on
the DAG starting from
the first tag,
with probability 1 − p:
at random.
G. Tibély, P. Pollner, T. Vicsek and G. Palla

Extracting tag hierarchies
Introduction
Benchmarks and testing
Results

Hierarchy of protein functions
Flickr and IMDb
Synthetic data

Results
Synthetic benchmark: tagging by random walks

Input:
the tag hierarchy,
the distribution of the number tags on the objects,
the tag frequency distribution.
The tagging:
the first tag at random,
the rest:
with probability p: a
short random walk on
the DAG starting from
the first tag,
with probability 1 − p:
at random.
G. Tibély, P. Pollner, T. Vicsek and G. Palla

Extracting tag hierarchies
Introduction
Benchmarks and testing
Results

Hierarchy of protein functions
Flickr and IMDb
Synthetic data

Results
Synthetic benchmark

a)

b)

c)

d)

G. Tibély, P. Pollner, T. Vicsek and G. Palla

Extracting tag hierarchies
Introduction
Benchmarks and testing
Results

Hierarchy of protein functions
Flickr and IMDb
Synthetic data

Results
Synthetic benchmark

Matching
links

Acceptable
links

Normalised
Mut. Info.

Linearised
Mut. Info

algorithm A
algorithm B
P. H. & H. G.-M.
P. Schmitz
algorithm A
algorithm B
P. H. & H. G.-M.
P. Schmitz
algorithm A
algorithm B
P. H. & H. G.-M.
P. Schmitz
algorithm A
algorithm B
P. H. & H. G.-M.
P. Schmitz

G. Tibély, P. Pollner, T. Vicsek and G. Palla

31%
89%
48%
1%
35%
91%
54%
2%
18%
83%
29%
1%
66%
97%
76%
5%

Extracting tag hierarchies
Introduction
Benchmarks and testing
Results

Hierarchy of protein functions
Flickr and IMDb
Synthetic data

Summary
Tags are important in knowledge organisation.
Tag-hierarchy extraction is an interesting problem with a
great potential for practical applications.
We have set up a framework for tag-hierarchy extraction:
Benchmark systems for testing the tag-hierarchy extraction
algorithm can be found and can be created.
The mutual information provides a quality measure
sensitive also to the position of the links in the hierarchy.

-G. Tibély, P. Pollner, T. Vicsek and G. Palla, Ontologies and tag-statistics
New Journal of Physics 14, 053009 (2012).
-G. Tibély, P. Pollner, T. Vicsek and G. Palla, Extracting tag hierarchies
accepted in PLoS ONE
G. Tibély, P. Pollner, T. Vicsek and G. Palla

Extracting tag hierarchies
Tagging by random walks
Mutual information

Further results from Flickr
winter

snow

frost
warmer

hibernation

hoarfrost

freeze

hoar

water ski skier
skiing
ski pole snowboard

christmas
frostbite
shivering freezing

flake

arctic

sublimation
melting
iceberg

samuel
taylor
coleridge

mariner

thaw

carbon dry
dioxide ice
antarctic
peninsula

diamond
dust caroling

ice igloo ice
water machine

G. Tibély, P. Pollner, T. Vicsek and G. Palla

christholi- decomas santa
day ration
tree
season

greenland
icearctic
breaker circle

bow
dec

inuit

xmas nati

gift

walrus
spitsbergen

spitzbergen

erignathus

feb

tree
farm

christmastime

antarctic
baffin
island

jan

ribbon

svalbard

ice north
cube pole

krill
pygoscelis
adeliae

wi
madison milwaukee

december

nunavut
whiteout melt

minus
chill

chilling

antarctica
coleridge

january february

ice
snowsnowball shoe
snowflake sledding

cold
snowweather storm

blizzard

rime

wisconsin

midwinter

ski
pogonip
jack
frost

holida

cold

wintertime

o
Extracting tagr hierarchies d o b e n u s
ba batus

arrow

pacific

present
Tagging by random walks
Mutual information

Algorithm A
Details

pre-process: for a tag i keep neighbours only if
nij ≥ 0.4 · max(nij ).
Link-weight:
random estimate for the number of co-occurrences:
nn
nij R = in j ,
link weight corresponds to the z-score: zij =

nij − nij
R
σR (nij ) .

Threshold:
keep only the strongest link on every tag.

Direction:
we assume that tag frequencies are higher close to the
root,
→ for any given tag i, the strongest link is coming from its
parent.
G. Tibély, P. Pollner, T. Vicsek and G. Palla

Extracting tag hierarchies
Tagging by random walks
Mutual information

Algorithm A
Details

pre-process: for a tag i keep neighbours only if
nij ≥ 0.4 · max(nij ).
Link-weight:
random estimate for the number of co-occurrences:
nn
nij R = in j ,
link weight corresponds to the z-score: zij =

nij − nij
R
σR (nij ) .

Threshold:
keep only the strongest link on every tag.

Direction:
we assume that tag frequencies are higher close to the
root,
→ for any given tag i, the strongest link is coming from its
parent.
G. Tibély, P. Pollner, T. Vicsek and G. Palla

Extracting tag hierarchies
Tagging by random walks
Mutual information

Algorithm A
Details

pre-process: for a tag i keep neighbours only if
nij ≥ 0.4 · max(nij ).
Link-weight:
random estimate for the number of co-occurrences:
nn
nij R = in j ,
link weight corresponds to the z-score: zij =

nij − nij
R
σR (nij ) .

Threshold:
keep only the strongest link on every tag.

Direction:
we assume that tag frequencies are higher close to the
root,
→ for any given tag i, the strongest link is coming from its
parent.
G. Tibély, P. Pollner, T. Vicsek and G. Palla

Extracting tag hierarchies
Tagging by random walks
Mutual information

Algorithm A
Details

pre-process: for a tag i keep neighbours only if
nij ≥ 0.4 · max(nij ).
Link-weight:
random estimate for the number of co-occurrences:
nn
nij R = in j ,
link weight corresponds to the z-score: zij =

nij − nij
R
σR (nij ) .

Threshold:
keep only the strongest link on every tag.

Direction:
we assume that tag frequencies are higher close to the
root,
→ for any given tag i, the strongest link is coming from its
parent.
G. Tibély, P. Pollner, T. Vicsek and G. Palla

Extracting tag hierarchies
Tagging by random walks
Mutual information

Algorithm A
Details

pre-process: for a tag i keep neighbours only if
nij ≥ 0.4 · max(nij ).
Link-weight:
random estimate for the number of co-occurrences:
nn
nij R = in j ,
link weight corresponds to the z-score: zij =

nij − nij
R
σR (nij ) .

Threshold:
keep only the strongest link on every tag.

Direction:
we assume that tag frequencies are higher close to the
root,
→ for any given tag i, the strongest link is coming from its
parent.
G. Tibély, P. Pollner, T. Vicsek and G. Palla

Extracting tag hierarchies
Tagging by random walks
Mutual information

Algorithm A
Details

pre-process: for a tag i keep neighbours only if
nij ≥ 0.4 · max(nij ).
Link-weight:
random estimate for the number of co-occurrences:
nn
nij R = in j ,
link weight corresponds to the z-score: zij =

nij − nij
R
σR (nij ) .

Threshold:
keep only the strongest link on every tag.

Direction:
we assume that tag frequencies are higher close to the
root,
→ for any given tag i, the strongest link is coming from its
parent.
G. Tibély, P. Pollner, T. Vicsek and G. Palla

Extracting tag hierarchies
Tagging by random walks
Mutual information

Algorithm A
Details

Exception:
when the given tag is also the proposed parent of its
strongest neighbour.
→ in this case we choose the 2nd strongest neighbour as
parent

Local root:
when the given tag is also the proposed parent for all of its
strong neighbours.

Global assembly:
the local root with largest “entropy” becomes the global root,
rest of the local roots are linked in the order of their entropy.

G. Tibély, P. Pollner, T. Vicsek and G. Palla

Extracting tag hierarchies
Tagging by random walks
Mutual information

Algorithm B
Details

Link-weight:
random estimate for the number of co-occurrences:
nn
nij R = in j ,
link weight corresponds to the z-score: zij =

nij − nij
R
σR (nij ) .

Threshold:
keep links stronger than zij ≥ 10.

Centrality:
calculate the eigenvector centrality based on the remaining
weighted adjacency matrix.

Build the hierarchy:
start from lowest centrality tags,
choose parent from neighbours with higher centrality than
the given tag,
in case there are more candidates, choose the most related
one, (according to the descendants already under the given
tag).
G. Tibély, P. Pollner, T. Vicsek and G. Palla

Extracting tag hierarchies
Tagging by random walks
Mutual information

Mutual information and entropy
For discrete variables xi and yj with a joint probability distribution given by P(xi , yj ), the
mutual information is defined as
I(x, y ) ≡

p(xi , yj ) ln
i

j

p(xi , yj )
p(xi )p(yj )

.

The entropies of the variables is usually formulated as
H(x) = −

p(xi ) ln p(xi ),

H(y ) = −

i

p(yj ) ln p(yj ).
j

Thus, the mutual information can be also given as
I(x, y ) = H(x) + H(y ) − H(x, y ).
Based on this, the normalised mutual information in general is defined as
Inorm (x, y ) ≡

G. Tibély, P. Pollner, T. Vicsek and G. Palla

2I(x, y )
.
H(x) + H(y )

Extracting tag hierarchies
Tagging by random walks
Mutual information

Comparing DAGs with mutual information
The probability for picking a tag at random from the descendants of tag i in the exact
hierarchy, Ge , and in the reconstructed hierarchy Gr :
pe (i) =

|De (i)|
,
N −1

pr (i) =

|Dr (i)|
.
N −1

The probability for picking a tag at random from the intersection of the two sets of
descendants:
|De (i) ∩ Dr (i)|
pe,r (i) =
N −1
Based on this, the Normalised Mutual Information between the exact- and
reconstructed hierarchies:
N

2
Ie,r = −

pe,r (i) ln
i=1

N

i=1
N

|De (i)| ln

pr (i) ln pr (i)
i=1

G. Tibély, P. Pollner, T. Vicsek and G. Palla

|De (i) ∩ Dr (i)| ln

2
=

N

pe (i) ln pe (i) +
i=1

N

pe,r (i)
pe (i)pr (i)

i=1

|De (i)|
N−1

Extracting tag hierarchies

|De (i)∩Dr (i)|(N−1)
|De (i)|·|Dr (i)|

.

N

|Dr (i)| ln

+
i=1

|Dr (i)|
N−1
Tagging by random walks
Mutual information

Comparing DAGs with mutual information

G. Tibély, P. Pollner, T. Vicsek and G. Palla

Extracting tag hierarchies
Tagging by random walks
Mutual information

Comparing DAGs with mutual information

G. Tibély, P. Pollner, T. Vicsek and G. Palla

Extracting tag hierarchies
Tagging by random walks
Mutual information

Comparing DAGs with mutual information

G. Tibély, P. Pollner, T. Vicsek and G. Palla

Extracting tag hierarchies
Tagging by random walks
Mutual information

Results
In numbers
Hierarchy (target)
Data (input)
algorithm A
Matching
algorithm B
links
P. H. & H. G.-M.
P. Schmitz
algorithm A
Acceptable
algorithm B
links
P. H. & H. G.-M.
P. Schmitz
algorithm A
Normalised algorithm B
Mut. Info.
P. H. & H. G.-M.
P. Schmitz
algorithm A
Linearised
algorithm B
Mut. Info
P. H. & H. G.-M.
P. Schmitz

GO subset
proteins
21%
20%
19%
18%
66%
52%
51%
65%
35%
30%
30%
30%
78%
75%
75%
75%

G. Tibély, P. Pollner, T. Vicsek and G. Palla

user defined
simulated tagging
31%
89%
48%
1%
35%
91%
54%
2%
18%
83%
29%
1%
66%
97%
76%
5%

Extracting tag hierarchies

Weitere ähnliche Inhalte

Kürzlich hochgeladen

+971565801893>> ORIGINAL CYTOTEC ABORTION PILLS FOR SALE IN DUBAI AND ABUDHABI<<
+971565801893>> ORIGINAL CYTOTEC ABORTION PILLS FOR SALE IN DUBAI AND ABUDHABI<<+971565801893>> ORIGINAL CYTOTEC ABORTION PILLS FOR SALE IN DUBAI AND ABUDHABI<<
+971565801893>> ORIGINAL CYTOTEC ABORTION PILLS FOR SALE IN DUBAI AND ABUDHABI<<
Health
 
Jual Obat Aborsi Palu ( Taiwan No.1 ) 085657271886 Obat Penggugur Kandungan C...
Jual Obat Aborsi Palu ( Taiwan No.1 ) 085657271886 Obat Penggugur Kandungan C...Jual Obat Aborsi Palu ( Taiwan No.1 ) 085657271886 Obat Penggugur Kandungan C...
Jual Obat Aborsi Palu ( Taiwan No.1 ) 085657271886 Obat Penggugur Kandungan C...
ZurliaSoop
 

Kürzlich hochgeladen (20)

BDSM⚡Call Girls in Sector 76 Noida Escorts >༒8448380779 Escort Service
BDSM⚡Call Girls in Sector 76 Noida Escorts >༒8448380779 Escort ServiceBDSM⚡Call Girls in Sector 76 Noida Escorts >༒8448380779 Escort Service
BDSM⚡Call Girls in Sector 76 Noida Escorts >༒8448380779 Escort Service
 
Film show production powerpoint for site
Film show production powerpoint for siteFilm show production powerpoint for site
Film show production powerpoint for site
 
College & House wife Call Girls in Paharganj 9634446618 -Best Escort call gi...
College & House wife  Call Girls in Paharganj 9634446618 -Best Escort call gi...College & House wife  Call Girls in Paharganj 9634446618 -Best Escort call gi...
College & House wife Call Girls in Paharganj 9634446618 -Best Escort call gi...
 
Karol Bagh, Delhi Call girls :8448380779 Model Escorts | 100% verified
Karol Bagh, Delhi Call girls :8448380779 Model Escorts | 100% verifiedKarol Bagh, Delhi Call girls :8448380779 Model Escorts | 100% verified
Karol Bagh, Delhi Call girls :8448380779 Model Escorts | 100% verified
 
Film show post-production powerpoint for site
Film show post-production powerpoint for siteFilm show post-production powerpoint for site
Film show post-production powerpoint for site
 
Film show pre-production powerpoint for site
Film show pre-production powerpoint for siteFilm show pre-production powerpoint for site
Film show pre-production powerpoint for site
 
Ignite Your Online Influence: Sociocosmos - Where Social Media Magic Happens
Ignite Your Online Influence: Sociocosmos - Where Social Media Magic HappensIgnite Your Online Influence: Sociocosmos - Where Social Media Magic Happens
Ignite Your Online Influence: Sociocosmos - Where Social Media Magic Happens
 
Film the city investagation powerpoint :)
Film the city investagation powerpoint :)Film the city investagation powerpoint :)
Film the city investagation powerpoint :)
 
Pondicherry Call Girls Book Now 8617697112 Top Class Pondicherry Escort Servi...
Pondicherry Call Girls Book Now 8617697112 Top Class Pondicherry Escort Servi...Pondicherry Call Girls Book Now 8617697112 Top Class Pondicherry Escort Servi...
Pondicherry Call Girls Book Now 8617697112 Top Class Pondicherry Escort Servi...
 
Finance-and-Operations-in-the-Azure-Cloud.pdf
Finance-and-Operations-in-the-Azure-Cloud.pdfFinance-and-Operations-in-the-Azure-Cloud.pdf
Finance-and-Operations-in-the-Azure-Cloud.pdf
 
Film show investigation powerpoint for the site
Film show investigation powerpoint for the siteFilm show investigation powerpoint for the site
Film show investigation powerpoint for the site
 
Capstone slide deck on the TikTok revolution
Capstone slide deck on the TikTok revolutionCapstone slide deck on the TikTok revolution
Capstone slide deck on the TikTok revolution
 
+971565801893>> ORIGINAL CYTOTEC ABORTION PILLS FOR SALE IN DUBAI AND ABUDHABI<<
+971565801893>> ORIGINAL CYTOTEC ABORTION PILLS FOR SALE IN DUBAI AND ABUDHABI<<+971565801893>> ORIGINAL CYTOTEC ABORTION PILLS FOR SALE IN DUBAI AND ABUDHABI<<
+971565801893>> ORIGINAL CYTOTEC ABORTION PILLS FOR SALE IN DUBAI AND ABUDHABI<<
 
CASH PAYMENT ON GIRL HAND TO HAND HOUSEWIFE
CASH PAYMENT ON GIRL HAND TO HAND HOUSEWIFECASH PAYMENT ON GIRL HAND TO HAND HOUSEWIFE
CASH PAYMENT ON GIRL HAND TO HAND HOUSEWIFE
 
Jual Obat Aborsi Palu ( Taiwan No.1 ) 085657271886 Obat Penggugur Kandungan C...
Jual Obat Aborsi Palu ( Taiwan No.1 ) 085657271886 Obat Penggugur Kandungan C...Jual Obat Aborsi Palu ( Taiwan No.1 ) 085657271886 Obat Penggugur Kandungan C...
Jual Obat Aborsi Palu ( Taiwan No.1 ) 085657271886 Obat Penggugur Kandungan C...
 
Hire↠Young Call Girls in Hari Nagar (Delhi) ☎️ 9205541914 ☎️ Independent Esco...
Hire↠Young Call Girls in Hari Nagar (Delhi) ☎️ 9205541914 ☎️ Independent Esco...Hire↠Young Call Girls in Hari Nagar (Delhi) ☎️ 9205541914 ☎️ Independent Esco...
Hire↠Young Call Girls in Hari Nagar (Delhi) ☎️ 9205541914 ☎️ Independent Esco...
 
BVG BEACH CLEANING PROJECTS- ORISSA , ANDAMAN, PORT BLAIR
BVG BEACH CLEANING PROJECTS- ORISSA , ANDAMAN, PORT BLAIRBVG BEACH CLEANING PROJECTS- ORISSA , ANDAMAN, PORT BLAIR
BVG BEACH CLEANING PROJECTS- ORISSA , ANDAMAN, PORT BLAIR
 
Enhancing Consumer Trust Through Strategic Content Marketing
Enhancing Consumer Trust Through Strategic Content MarketingEnhancing Consumer Trust Through Strategic Content Marketing
Enhancing Consumer Trust Through Strategic Content Marketing
 
SEO Expert in USA - 5 Ways to Improve Your Local Ranking - Macaw Digital.pdf
SEO Expert in USA - 5 Ways to Improve Your Local Ranking - Macaw Digital.pdfSEO Expert in USA - 5 Ways to Improve Your Local Ranking - Macaw Digital.pdf
SEO Expert in USA - 5 Ways to Improve Your Local Ranking - Macaw Digital.pdf
 
Marketing Plan - Social Media. The Sparks Foundation
Marketing Plan -  Social Media. The Sparks FoundationMarketing Plan -  Social Media. The Sparks Foundation
Marketing Plan - Social Media. The Sparks Foundation
 

Empfohlen

How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
ThinkNow
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
Kurio // The Social Media Age(ncy)
 

Empfohlen (20)

How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
 
Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work
 
ChatGPT webinar slides
ChatGPT webinar slidesChatGPT webinar slides
ChatGPT webinar slides
 

Gergely Palla - Extracting tag hierarchies

  • 1. Introduction Benchmarks and testing Results Extracting tag hierarchies Gergely Tibély, Péter Pollner, Tamás Vicsek and Gergely Palla Statistical and Biological Physics Research Group, HAS (Eötvös University), Hungary KnowEscape2013 Conference G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies
  • 2. Introduction Benchmarks and testing Results Tags and tagging The goal Tag hierarchy extraction methods Tags and tagging: blogs, news portals G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies
  • 3. Introduction Benchmarks and testing Results Tags and tagging The goal Tag hierarchy extraction methods Tags and tagging: blogs, news portals G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies
  • 4. Introduction Benchmarks and testing Results Tags and tagging The goal Tag hierarchy extraction methods Tags and tagging: blogs, news portals G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies
  • 5. Introduction Benchmarks and testing Results Tags and tagging The goal Tag hierarchy extraction methods Tags and tagging: blogs, news portals G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies
  • 6. Introduction Benchmarks and testing Results Tags and tagging The goal Tag hierarchy extraction methods Tags and tagging: blogs, news portals G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies
  • 7. Introduction Benchmarks and testing Results Tags and tagging The goal Tag hierarchy extraction methods Tags and tagging: blogs, news portals G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies
  • 8. Introduction Benchmarks and testing Results Tags and tagging The goal Tag hierarchy extraction methods Tags and tagging: blogs, news portals G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies
  • 9. Introduction Benchmarks and testing Results Tags and tagging The goal Tag hierarchy extraction methods Tags and tagging: blogs, news portals G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies
  • 10. Introduction Benchmarks and testing Results Tags and tagging The goal Tag hierarchy extraction methods Tags and tagging: video and photo sharing G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies
  • 11. Introduction Benchmarks and testing Results Tags and tagging The goal Tag hierarchy extraction methods Tags and tagging: video and photo sharing G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies
  • 12. Introduction Benchmarks and testing Results Tags and tagging The goal Tag hierarchy extraction methods Tagging and folksonomies In some cases the emerging set of free tags is called as a FOLKSONOMY collaborative nature, tags are equal, no hierarchy, −→ The opposite of an ONTOLOGY. G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies
  • 13. Introduction Benchmarks and testing Results Tags and tagging The goal Tag hierarchy extraction methods Tagging and folksonomies In some cases the emerging set of free tags is called as a FOLKSONOMY collaborative nature, tags are equal, no hierarchy, −→ The opposite of an ONTOLOGY. G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies
  • 14. Introduction Benchmarks and testing Results Tags and tagging The goal Tag hierarchy extraction methods Tagging and folksonomies In some cases the emerging set of free tags is called as a FOLKSONOMY collaborative nature, tags are equal, no hierarchy, −→ The opposite of an ONTOLOGY. G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies
  • 15. Introduction Benchmarks and testing Results Tags and tagging The goal Tag hierarchy extraction methods Tagging and folksonomies In some cases the emerging set of free tags is called as a FOLKSONOMY collaborative nature, tags are equal, no hierarchy, −→ The opposite of an ONTOLOGY. G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies
  • 16. Introduction Benchmarks and testing Results Tags and tagging The goal Tag hierarchy extraction methods Tagging and folksonomies In some cases the emerging set of free tags is called as a FOLKSONOMY collaborative nature, tags are equal, no hierarchy, −→ The opposite of an ONTOLOGY. G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies
  • 17. Introduction Benchmarks and testing Results Tags and tagging The goal Tag hierarchy extraction methods Tagging systems How can we search? G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies
  • 18. Introduction Benchmarks and testing Results Tags and tagging The goal Tag hierarchy extraction methods Tagging systems How can we search? G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies
  • 19. Introduction Benchmarks and testing Results Tags and tagging The goal Tag hierarchy extraction methods Tagging systems How can we search? G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies
  • 20. Introduction Benchmarks and testing Results Tags and tagging The goal Tag hierarchy extraction methods Tagging systems How can we search? G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies
  • 21. Introduction Benchmarks and testing Results Tags and tagging The goal Tag hierarchy extraction methods Tagging systems Searching items G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies
  • 22. Introduction Benchmarks and testing Results Tags and tagging The goal Tag hierarchy extraction methods Tagging systems Searching items G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies
  • 23. Introduction Benchmarks and testing Results Tags and tagging The goal Tag hierarchy extraction methods Tagging systems Searching G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies
  • 24. Introduction Benchmarks and testing Results Tags and tagging The goal Tag hierarchy extraction methods Tagging systems Searching tags G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies
  • 25. Introduction Benchmarks and testing Results Tags and tagging The goal Tag hierarchy extraction methods Tagging systems Searching tags G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies
  • 26. Introduction Benchmarks and testing Results Tags and tagging The goal Tag hierarchy extraction methods The goal G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies
  • 27. Introduction Benchmarks and testing Results Tags and tagging The goal Tag hierarchy extraction methods The goal Extracting a tag hierarchy! G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies
  • 28. Introduction Benchmarks and testing Results Tags and tagging The goal Tag hierarchy extraction methods The goal Extracting a tag hierarchy! Motivation: G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies
  • 29. Introduction Benchmarks and testing Results Tags and tagging The goal Tag hierarchy extraction methods The goal Extracting a tag hierarchy! Motivation: Help searching: If the tags are organised into a hierarchy, broadening or narrowing the scope is simple. G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies
  • 30. Introduction Benchmarks and testing Results Tags and tagging The goal Tag hierarchy extraction methods The goal Extracting a tag hierarchy! Motivation: Help searching: If the tags are organised into a hierarchy, broadening or narrowing the scope is simple. Give recommendations about yet unvisited objects. G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies
  • 31. Introduction Benchmarks and testing Results Tags and tagging The goal Tag hierarchy extraction methods Tag hierarchy extracting algorithms - P. Heymann and H. Garcia-Molina, "Collaborative Creation of Communal Hierarchical Taxonomies in Social Tagging Systems", Technical Report, Stanford InfoLab, (2006). - P. Schmitz , "Inducing Ontology from Flickr Tags", Proceedings of the 15th International Conference on World Wide Web (WWW), (2006). - C. Van Damme, M. Hepp and K. Siorpaes, "FolksOntology: An Integrated Approach for Turning Folksonomies into Ontologies", Social Networks 2, 57–70, (2007) - A.Plangprasopchok and K. Lerman, "Constructing Folksonomies from User-specified Relations on Flickr", Proceedings of the World Wide Web conference, (2009) G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies
  • 32. Introduction Benchmarks and testing Results Tags and tagging The goal Tag hierarchy extraction methods Our methods Basic outline DEFINE LINKS TAG 1 TAG 2 TAG 3 TAG 5 TAG 4 G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies
  • 33. Introduction Benchmarks and testing Results Tags and tagging The goal Tag hierarchy extraction methods Our methods Basic outline DEFINE LINKS TAG 1 TAG 2 TAG 3 TAG 5 TAG 4 G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies
  • 34. Introduction Benchmarks and testing Results Tags and tagging The goal Tag hierarchy extraction methods Our methods Basic outline DEFINE LINKS THRESHOLD TAG 1 TAG 1 TAG 2 TAG 3 TAG 5 TAG 4 TAG 2 TAG 3 G. Tibély, P. Pollner, T. Vicsek and G. Palla TAG 5 TAG 4 Extracting tag hierarchies
  • 35. Introduction Benchmarks and testing Results Tags and tagging The goal Tag hierarchy extraction methods Our methods Basic outline DEFINE LINKS THRESHOLD DETERMINE DIRECTION TAG 1 TAG 1 TAG 1 TAG 2 TAG 3 TAG 5 TAG 4 TAG 2 TAG 3 G. Tibély, P. Pollner, T. Vicsek and G. Palla TAG 5 TAG 4 TAG 2 TAG 3 Extracting tag hierarchies TAG 5 TAG 4
  • 36. Introduction Benchmarks and testing Results Tags and tagging The goal Tag hierarchy extraction methods Our methods Basic outline DEFINE LINKS THRESHOLD TAG 1 TAG 1 TAG 2 TAG 5 TAG 2 DETERMINE DIRECTION TAG 1 TAG 5 TAG 2 TAG 3 TAG 4 TAG 3 G. Tibély, P. Pollner, T. Vicsek and G. Palla TAG 4 Extracting tag hierarchies TAG 3 TAG 4 TAG 5
  • 37. Introduction Benchmarks and testing Results Benchmarks Quality measures How to test the method? Benchmarks? Gene Ontology - tag hierarchy: hierarchy of protein functions, - tagged objects: proteins annotated by their known functions Synthetic benchmark: - tag hierarchy: user defined, - tagged objects: simulated tagging (random walk on the hierarchy). - G. Tibély, P. Pollner, T. Vicsek and G. Palla, “Ontologies and -,tag-statistics”, New Journal of Physics 14, 053009 (2012). G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies
  • 38. Introduction Benchmarks and testing Results Benchmarks Quality measures How to test the method? Benchmarks: Gene Ontology - tag hierarchy: hierarchy of protein functions, - tagged objects: proteins annotated by their known functions. Synthetic benchmark: - tag hierarchy: user defined, - tagged objects: simulated tagging (random walk on the hierarchy). - G. Tibély, P. Pollner, T. Vicsek and G. Palla, “Ontologies and -,tag-statistics”, New Journal of Physics 14, 053009 (2012). G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies
  • 39. Introduction Benchmarks and testing Results Benchmarks Quality measures How to test the method? Benchmarks: Gene Ontology - tag hierarchy: hierarchy of protein functions, - tagged objects: proteins annotated by their known functions. Synthetic benchmark: - tag hierarchy: user defined, - tagged objects: simulated tagging (random walk on the hierarchy). - G. Tibély, P. Pollner, T. Vicsek and G. Palla, “Ontologies and -,tag-statistics”, New Journal of Physics 14, 053009 (2012). G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies
  • 40. Introduction Benchmarks and testing Results Benchmarks Quality measures How to measure the quality? EXACT RECONSTRUCTED 0A 1A 2A 3A 3B 1B 2B 2C 0A 1A 1C 2D 2A 2E 3C 3D 3E 3F 3G 3H 3A 3B 1B 2B 2C 1C 2D 3C 3D 3E 3F 3G Evaluation? G. Tibély, P. Pollner, T. Vicsek and G. Palla 2E Extracting tag hierarchies 3H
  • 41. Introduction Benchmarks and testing Results Benchmarks Quality measures How to measure the quality? EXACT RECONSTRUCTED 0A 1A 2A 3A 3B 1B 2B 2C 0A 1A 1C 2D 2A 2E 3C 3D 3E 3F 3G 3H 3A 3B 1B 2B 2C 1C 2D 2E 3C 3D 3E 3F 3G Evaluation: fraction of correctly identified links, fraction of acceptable links, fraction of missing links, etc. G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies 3H
  • 42. Introduction Benchmarks and testing Results Benchmarks Quality measures How to measure the quality? EXACT RECONSTRUCTED 0A 1A 2A 3A 3B 1B 2B 2C 0A 1A 1C 2D 2A 2E 3C 3D 3E 3F 3G 3H 3A 3B 1B 2B 2C 1C 2D 2E 3C 3D 3E 3F 3G 3H Evaluation: fraction of correctly identified links, fraction of acceptable links, fraction of missing links, etc. Normalised Mutual Information: sensitive also to the position of the non-matching links. L. Danon et al., "Comparing community structure identification", J. Stat. Mech. P09008 , (2005) G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies
  • 43. Introduction Benchmarks and testing Results Benchmarks Quality measures Normalised Mutual Information Mathematical formulation The probability for picking a tag at random from the descendants of tag i in the exact hierarchy, Ge , and in the reconstructed hierarchy Gr : pe (i) = |De (i)| , N −1 pr (i) = |Dr (i)| . N −1 The probability for picking a tag at random from the intersection of the two sets of descendants: |De (i) ∩ Dr (i)| pe,r (i) = N −1 Based on this, the Normalised Mutual Information between the exact- and reconstructed hierarchies: N 2 Ie,r = − pe,r (i) ln i=1 N i=1 N |De (i)| ln pr (i) ln pr (i) i=1 G. Tibély, P. Pollner, T. Vicsek and G. Palla |De (i) ∩ Dr (i)| ln 2 = N pe (i) ln pe (i) + i=1 N pe,r (i) pe (i)pr (i) i=1 |De (i)| N−1 Extracting tag hierarchies |De (i)∩Dr (i)|(N−1) |De (i)|·|Dr (i)| . N |Dr (i)| ln + i=1 |Dr (i)| N−1
  • 44. Introduction Benchmarks and testing Results Benchmarks Quality measures Normalised Mutual Information Behaviour 0A 0A 1A 1B 2A 3A 2B 3B 3C 3E 2A 2D 2C 3D 1A RANDOMIZATION 3F 3G 3H G. Tibély, P. Pollner, T. Vicsek and G. Palla 3A 3B 1B 2B 3C 3D Extracting tag hierarchies 2D 2C 3E 3F 3G 3H
  • 45. Introduction Benchmarks and testing Results Benchmarks Quality measures Normalised Mutual Information Behaviour 0A 0A 1A 1B 2A 3A 2B 3B 3C 3E 2A 2D 2C 3D 1A RANDOMIZATION 3F 3G 3A 3H 3B 1B 2B 3C 3D 1 bottom up random top down 0.8 I 0.6 0.4 0.2 0 0 0.2 0.4 0.6 0.8 f G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies 2D 2C 3E 1 3F 3G 3H
  • 46. Introduction Benchmarks and testing Results Hierarchy of protein functions Flickr and IMDb Synthetic data Studied systems Tagged proteins: - 5,913,610 proteins from GO, annotated by - 4,181 molecular functions. Tagged photos: - 1,519,030 photos from Flickr, tagged by - 25,441 free English words. Tagged films: - 336,223 films from IMDb, tagged by - 6,358 English keywords. Synthetic benchmark: - 2,000,000 virtual objects, - 1,023 tags. G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies
  • 47. Introduction Benchmarks and testing Results Hierarchy of protein functions Flickr and IMDb Synthetic data Results Hierarchy of protein functions 0A 1A 2B 2A 2D 2C 0A 3A 3B 3C 3D 3E 3F 3G 3H 3I 3J 3K 3L 3M 3N 3O 3P 3Q 1A 4A 4B 4C 4D 4G 4H 4I 4E 4F 4J 4K 4L 4M 4N 4O 4P 4Q 4R 2B 2A 5A 5B 2D 2C 5C 3R 3S 3A 3B 3C 3D 3E 4A 4B 4C 4D 3F 3G 3H 3I 4E 4S 5A G. Tibély, P. Pollner, T. Vicsek and G. Palla 3J 3K 3L 3M 3N 3O 4G 4H 4I 5B Extracting tag hierarchies 5C 4J 4K 4L 3P 4M 4N 3Q 3T 4O 4P 4Q
  • 48. Introduction Benchmarks and testing Results Hierarchy of protein functions Flickr and IMDb Synthetic data Results Hierarchy of protein functions Matching links Acceptable links Normalised Mut. Info. Linearised Mut. Info algorithm A algorithm B P. H. & H. G.-M. P. Schmitz algorithm A algorithm B P. H. & H. G.-M. P. Schmitz algorithm A algorithm B P. H. & H. G.-M. P. Schmitz algorithm A algorithm B P. H. & H. G.-M. P. Schmitz G. Tibély, P. Pollner, T. Vicsek and G. Palla 21% 20% 19% 18% 66% 52% 51% 65% 35% 30% 30% 30% 78% 75% 75% 75% Extracting tag hierarchies
  • 49. Introduction Benchmarks and testing Results Hierarchy of protein functions Flickr and IMDb Synthetic data Results Tag hierarchy from Flickr data G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies
  • 50. Introduction Benchmarks and testing Results Hierarchy of protein functions Flickr and IMDb Synthetic data Results Tag hierarchy from IMDb data G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies
  • 51. Introduction Benchmarks and testing Results Hierarchy of protein functions Flickr and IMDb Synthetic data Results Synthetic benchmark: tagging by random walks Input: the tag hierarchy, the distribution of the number tags on the objects, the tag frequency distribution. The tagging: the first tag at random, the rest: with probability p: a short random walk on the DAG starting from the first tag, with probability 1 − p: at random. G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies
  • 52. Introduction Benchmarks and testing Results Hierarchy of protein functions Flickr and IMDb Synthetic data Results Synthetic benchmark: tagging by random walks Input: the tag hierarchy, the distribution of the number tags on the objects, the tag frequency distribution. The tagging: the first tag at random, the rest: with probability p: a short random walk on the DAG starting from the first tag, with probability 1 − p: at random. G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies
  • 53. Introduction Benchmarks and testing Results Hierarchy of protein functions Flickr and IMDb Synthetic data Results Synthetic benchmark: tagging by random walks Input: the tag hierarchy, the distribution of the number tags on the objects, the tag frequency distribution. The tagging: the first tag at random, the rest: with probability p: a short random walk on the DAG starting from the first tag, with probability 1 − p: at random. G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies
  • 54. Introduction Benchmarks and testing Results Hierarchy of protein functions Flickr and IMDb Synthetic data Results Synthetic benchmark: tagging by random walks Input: the tag hierarchy, the distribution of the number tags on the objects, the tag frequency distribution. The tagging: the first tag at random, the rest: with probability p: a short random walk on the DAG starting from the first tag, with probability 1 − p: at random. G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies
  • 55. Introduction Benchmarks and testing Results Hierarchy of protein functions Flickr and IMDb Synthetic data Results Synthetic benchmark: tagging by random walks Input: the tag hierarchy, the distribution of the number tags on the objects, the tag frequency distribution. The tagging: the first tag at random, the rest: with probability p: a short random walk on the DAG starting from the first tag, with probability 1 − p: at random. G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies
  • 56. Introduction Benchmarks and testing Results Hierarchy of protein functions Flickr and IMDb Synthetic data Results Synthetic benchmark: tagging by random walks Input: the tag hierarchy, the distribution of the number tags on the objects, the tag frequency distribution. The tagging: the first tag at random, the rest: with probability p: a short random walk on the DAG starting from the first tag, with probability 1 − p: at random. G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies
  • 57. Introduction Benchmarks and testing Results Hierarchy of protein functions Flickr and IMDb Synthetic data Results Synthetic benchmark: tagging by random walks Input: the tag hierarchy, the distribution of the number tags on the objects, the tag frequency distribution. The tagging: the first tag at random, the rest: with probability p: a short random walk on the DAG starting from the first tag, with probability 1 − p: at random. G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies
  • 58. Introduction Benchmarks and testing Results Hierarchy of protein functions Flickr and IMDb Synthetic data Results Synthetic benchmark: tagging by random walks Input: the tag hierarchy, the distribution of the number tags on the objects, the tag frequency distribution. The tagging: the first tag at random, the rest: with probability p: a short random walk on the DAG starting from the first tag, with probability 1 − p: at random. G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies
  • 59. Introduction Benchmarks and testing Results Hierarchy of protein functions Flickr and IMDb Synthetic data Results Synthetic benchmark: tagging by random walks Input: the tag hierarchy, the distribution of the number tags on the objects, the tag frequency distribution. The tagging: the first tag at random, the rest: with probability p: a short random walk on the DAG starting from the first tag, with probability 1 − p: at random. G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies
  • 60. Introduction Benchmarks and testing Results Hierarchy of protein functions Flickr and IMDb Synthetic data Results Synthetic benchmark: tagging by random walks Input: the tag hierarchy, the distribution of the number tags on the objects, the tag frequency distribution. The tagging: the first tag at random, the rest: with probability p: a short random walk on the DAG starting from the first tag, with probability 1 − p: at random. G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies
  • 61. Introduction Benchmarks and testing Results Hierarchy of protein functions Flickr and IMDb Synthetic data Results Synthetic benchmark: tagging by random walks Input: the tag hierarchy, the distribution of the number tags on the objects, the tag frequency distribution. The tagging: the first tag at random, the rest: with probability p: a short random walk on the DAG starting from the first tag, with probability 1 − p: at random. G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies
  • 62. Introduction Benchmarks and testing Results Hierarchy of protein functions Flickr and IMDb Synthetic data Results Synthetic benchmark a) b) c) d) G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies
  • 63. Introduction Benchmarks and testing Results Hierarchy of protein functions Flickr and IMDb Synthetic data Results Synthetic benchmark Matching links Acceptable links Normalised Mut. Info. Linearised Mut. Info algorithm A algorithm B P. H. & H. G.-M. P. Schmitz algorithm A algorithm B P. H. & H. G.-M. P. Schmitz algorithm A algorithm B P. H. & H. G.-M. P. Schmitz algorithm A algorithm B P. H. & H. G.-M. P. Schmitz G. Tibély, P. Pollner, T. Vicsek and G. Palla 31% 89% 48% 1% 35% 91% 54% 2% 18% 83% 29% 1% 66% 97% 76% 5% Extracting tag hierarchies
  • 64. Introduction Benchmarks and testing Results Hierarchy of protein functions Flickr and IMDb Synthetic data Summary Tags are important in knowledge organisation. Tag-hierarchy extraction is an interesting problem with a great potential for practical applications. We have set up a framework for tag-hierarchy extraction: Benchmark systems for testing the tag-hierarchy extraction algorithm can be found and can be created. The mutual information provides a quality measure sensitive also to the position of the links in the hierarchy. -G. Tibély, P. Pollner, T. Vicsek and G. Palla, Ontologies and tag-statistics New Journal of Physics 14, 053009 (2012). -G. Tibély, P. Pollner, T. Vicsek and G. Palla, Extracting tag hierarchies accepted in PLoS ONE G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies
  • 65. Tagging by random walks Mutual information Further results from Flickr winter snow frost warmer hibernation hoarfrost freeze hoar water ski skier skiing ski pole snowboard christmas frostbite shivering freezing flake arctic sublimation melting iceberg samuel taylor coleridge mariner thaw carbon dry dioxide ice antarctic peninsula diamond dust caroling ice igloo ice water machine G. Tibély, P. Pollner, T. Vicsek and G. Palla christholi- decomas santa day ration tree season greenland icearctic breaker circle bow dec inuit xmas nati gift walrus spitsbergen spitzbergen erignathus feb tree farm christmastime antarctic baffin island jan ribbon svalbard ice north cube pole krill pygoscelis adeliae wi madison milwaukee december nunavut whiteout melt minus chill chilling antarctica coleridge january february ice snowsnowball shoe snowflake sledding cold snowweather storm blizzard rime wisconsin midwinter ski pogonip jack frost holida cold wintertime o Extracting tagr hierarchies d o b e n u s ba batus arrow pacific present
  • 66. Tagging by random walks Mutual information Algorithm A Details pre-process: for a tag i keep neighbours only if nij ≥ 0.4 · max(nij ). Link-weight: random estimate for the number of co-occurrences: nn nij R = in j , link weight corresponds to the z-score: zij = nij − nij R σR (nij ) . Threshold: keep only the strongest link on every tag. Direction: we assume that tag frequencies are higher close to the root, → for any given tag i, the strongest link is coming from its parent. G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies
  • 67. Tagging by random walks Mutual information Algorithm A Details pre-process: for a tag i keep neighbours only if nij ≥ 0.4 · max(nij ). Link-weight: random estimate for the number of co-occurrences: nn nij R = in j , link weight corresponds to the z-score: zij = nij − nij R σR (nij ) . Threshold: keep only the strongest link on every tag. Direction: we assume that tag frequencies are higher close to the root, → for any given tag i, the strongest link is coming from its parent. G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies
  • 68. Tagging by random walks Mutual information Algorithm A Details pre-process: for a tag i keep neighbours only if nij ≥ 0.4 · max(nij ). Link-weight: random estimate for the number of co-occurrences: nn nij R = in j , link weight corresponds to the z-score: zij = nij − nij R σR (nij ) . Threshold: keep only the strongest link on every tag. Direction: we assume that tag frequencies are higher close to the root, → for any given tag i, the strongest link is coming from its parent. G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies
  • 69. Tagging by random walks Mutual information Algorithm A Details pre-process: for a tag i keep neighbours only if nij ≥ 0.4 · max(nij ). Link-weight: random estimate for the number of co-occurrences: nn nij R = in j , link weight corresponds to the z-score: zij = nij − nij R σR (nij ) . Threshold: keep only the strongest link on every tag. Direction: we assume that tag frequencies are higher close to the root, → for any given tag i, the strongest link is coming from its parent. G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies
  • 70. Tagging by random walks Mutual information Algorithm A Details pre-process: for a tag i keep neighbours only if nij ≥ 0.4 · max(nij ). Link-weight: random estimate for the number of co-occurrences: nn nij R = in j , link weight corresponds to the z-score: zij = nij − nij R σR (nij ) . Threshold: keep only the strongest link on every tag. Direction: we assume that tag frequencies are higher close to the root, → for any given tag i, the strongest link is coming from its parent. G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies
  • 71. Tagging by random walks Mutual information Algorithm A Details pre-process: for a tag i keep neighbours only if nij ≥ 0.4 · max(nij ). Link-weight: random estimate for the number of co-occurrences: nn nij R = in j , link weight corresponds to the z-score: zij = nij − nij R σR (nij ) . Threshold: keep only the strongest link on every tag. Direction: we assume that tag frequencies are higher close to the root, → for any given tag i, the strongest link is coming from its parent. G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies
  • 72. Tagging by random walks Mutual information Algorithm A Details Exception: when the given tag is also the proposed parent of its strongest neighbour. → in this case we choose the 2nd strongest neighbour as parent Local root: when the given tag is also the proposed parent for all of its strong neighbours. Global assembly: the local root with largest “entropy” becomes the global root, rest of the local roots are linked in the order of their entropy. G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies
  • 73. Tagging by random walks Mutual information Algorithm B Details Link-weight: random estimate for the number of co-occurrences: nn nij R = in j , link weight corresponds to the z-score: zij = nij − nij R σR (nij ) . Threshold: keep links stronger than zij ≥ 10. Centrality: calculate the eigenvector centrality based on the remaining weighted adjacency matrix. Build the hierarchy: start from lowest centrality tags, choose parent from neighbours with higher centrality than the given tag, in case there are more candidates, choose the most related one, (according to the descendants already under the given tag). G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies
  • 74. Tagging by random walks Mutual information Mutual information and entropy For discrete variables xi and yj with a joint probability distribution given by P(xi , yj ), the mutual information is defined as I(x, y ) ≡ p(xi , yj ) ln i j p(xi , yj ) p(xi )p(yj ) . The entropies of the variables is usually formulated as H(x) = − p(xi ) ln p(xi ), H(y ) = − i p(yj ) ln p(yj ). j Thus, the mutual information can be also given as I(x, y ) = H(x) + H(y ) − H(x, y ). Based on this, the normalised mutual information in general is defined as Inorm (x, y ) ≡ G. Tibély, P. Pollner, T. Vicsek and G. Palla 2I(x, y ) . H(x) + H(y ) Extracting tag hierarchies
  • 75. Tagging by random walks Mutual information Comparing DAGs with mutual information The probability for picking a tag at random from the descendants of tag i in the exact hierarchy, Ge , and in the reconstructed hierarchy Gr : pe (i) = |De (i)| , N −1 pr (i) = |Dr (i)| . N −1 The probability for picking a tag at random from the intersection of the two sets of descendants: |De (i) ∩ Dr (i)| pe,r (i) = N −1 Based on this, the Normalised Mutual Information between the exact- and reconstructed hierarchies: N 2 Ie,r = − pe,r (i) ln i=1 N i=1 N |De (i)| ln pr (i) ln pr (i) i=1 G. Tibély, P. Pollner, T. Vicsek and G. Palla |De (i) ∩ Dr (i)| ln 2 = N pe (i) ln pe (i) + i=1 N pe,r (i) pe (i)pr (i) i=1 |De (i)| N−1 Extracting tag hierarchies |De (i)∩Dr (i)|(N−1) |De (i)|·|Dr (i)| . N |Dr (i)| ln + i=1 |Dr (i)| N−1
  • 76. Tagging by random walks Mutual information Comparing DAGs with mutual information G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies
  • 77. Tagging by random walks Mutual information Comparing DAGs with mutual information G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies
  • 78. Tagging by random walks Mutual information Comparing DAGs with mutual information G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies
  • 79. Tagging by random walks Mutual information Results In numbers Hierarchy (target) Data (input) algorithm A Matching algorithm B links P. H. & H. G.-M. P. Schmitz algorithm A Acceptable algorithm B links P. H. & H. G.-M. P. Schmitz algorithm A Normalised algorithm B Mut. Info. P. H. & H. G.-M. P. Schmitz algorithm A Linearised algorithm B Mut. Info P. H. & H. G.-M. P. Schmitz GO subset proteins 21% 20% 19% 18% 66% 52% 51% 65% 35% 30% 30% 30% 78% 75% 75% 75% G. Tibély, P. Pollner, T. Vicsek and G. Palla user defined simulated tagging 31% 89% 48% 1% 35% 91% 54% 2% 18% 83% 29% 1% 66% 97% 76% 5% Extracting tag hierarchies