Lecture: Semantic Word Clouds

Seman&c
Analysis
in
Language
Technology
http://stp.lingfil.uu.se/~santinim/sais/2014/sais_2014.htm
Semantic Word Clouds
Marina
San(ni
san$nim@stp.lingfil.uu.se
Department
of
Linguis(cs
and
Philology
Uppsala
University,
Uppsala,
Sweden
Autumn
2014
Lect
10:
Seman(c
Word
Clouds
1

Acknowledgements
• Some
slides
borrowed
from
Sergey
Pupyrev.
Lect
10:
Seman(c
Word
Clouds
2

Outline
• Word
Clouds
• 3
early
algorithms
• 3
new
algorithms
• Metrics
&
Quan(ta(ve
Evalua(on
Lect
10:
Seman(c
Word
Clouds
3

Word
Clouds
• Word
clouds
have
become
a
standard
tool
for
abstrac(ng,
visualizing
and
comparing
texts…
• We
could
apply
the
same
or
similar
techniques
to
the
huge
amonts
of
tags
produced
by
users
interac(ng
in
the
social
networks
Lect
10:
Seman(c
Word
Clouds
4

Comparison
&
conceptualiza(on
Tool
Lect
10:
Seman(c
Word
Clouds
5
• Word
Clouds
as
a
tool
for
”conceptualizing”
documents.
Cf
Ontologies
• Ex:
2008,
comparison
of
speeches:
Obama
vs
McCain

Word
Clouds
and
Tag
Clouds…
• …
are
oVen
used
to
represent
importance
among
terms
(ex,
band
popularity)
or
serve
as
a
naviga(on
tool
(ex,
Google
search
results).
Lect
10:
Seman(c
Word
Clouds
6

The
Problem…
• How
to
compute
seman(c-‐preserving
word
clouds
in
which
seman(cally-‐related
words
are
close
to
each
other.
Lect
10:
Seman(c
Word
Clouds
7

Wordle
h^p://www.wordle.net
• Prac(cal
tools,
like
Wordle,
make
word
cloud
visualiza(on
easy.
• Shortoming:
they
do
not
capture
the
rela(onships
between
words
in
any
way
Lect
10:
Seman(c
Word
Clouds
8

Many
word
clouds
are
arranged
randomly
(look
also
at
the
sca^ered
colours)
Lect
10:
Seman(c
Word
Clouds
9

Seman(c
Pa^erns
• Humans
ins(nc(vely
tend
to
pick
up
pa^erns
• Ins(nc(vely,
one
could
say
that
two
words
that
are
close
to
each
other
in
a
word
cloud
are
seman(cally
related.
Lect
10:
Seman(c
Word
Clouds
10

So,
it
makes
sense
to
place
such
related
words
close
to
each
other
(look
also
at
the
color
distribu(on)
Lect
10:
Seman(c
Word
Clouds
11

In
linguis(cs
and
in
LT…
• …
if
a
pair
of
words
oVen
appear
together
in
a
sentence,
then
we
can
assume
that
this
pair
of
words
is
related
seman(cally.
Lect
10:
Seman(c
Word
Clouds
12

Seman(c
word
clouds
have
higher
user
sa(sfac(on
compared
to
other
layouts…
Lect
10:
Seman(c
Word
Clouds
13

All
recent
word
cloud
visualiza(on
tools
aim
to
incoprorate
seman(cs
in
the
layout…
Lect
10:
Seman(c
Word
Clouds
14

…
but
none
of
them
provide
any
guarantee
about
the
quality
of
the
layout
in
terms
of
seman(cs
Lect
10:
Seman(c
Word
Clouds
15

Early
algorithms:
Force-‐Directed
Graph
• Most
of
the
exis(ng
algorithms
are
based
on
force-‐directed
graph
layout.
• Force-‐directed
graph
drawing
algorithms
are
a
class
of
algorithms
for
drawing
graphs
in
an
aesthe(cally
pleasing
way
– A^rac(ve
forces
between
pairs
to
reduce
empty
space
– Repulsive
forces
ensure
that
words
do
not
overlap
– Final
force
preserve
seman(c
rela(ons
between
words.
Force-‐directed
graph
drawing
algorithms
assign
forces
among
the
set
of
edges
and
the
set
of
nodes
of
a
graph
drawing.
Typically,
spring-‐like
a^rac(ve
forces
based
on
Hooke's
law
are
used
to
a^ract
pairs
of
endpoints
of
the
graph's
edges
towards
each
other,
while
simultaneously
repulsive
forces
like
those
of
electrically
charged
par(cles
based
on
Coulomb's
law
are
used
to
separate
all
pairs
of
nodes.
Lect
10:
Seman(c
Word
Clouds
16

Newer
Algorithms:
rectangle
representa(on
of
graphs
• Vertex-‐weighted
and
edge-‐weighed
graph:
– The
ver(ces
of
the
graph
are
the
words
• Their
weight
correspond
to
some
measure
of
importance
(eg.
word
frequencies)
– The
edges
capture
the
seman(c
relatedness
of
pair
of
words
(eg.
co-‐occurrence)
• Their
weight
correspond
to
the
strength
of
the
rela(on
– Each
vertex
can
be
drawn
as
a
box
(rectangle)
with
a
dimension
determing
by
its
weight
– A
realized
adjacency
is
the
sum
of
the
edge
weights
for
all
pairs
of
touching
boxes.
– The
goal
is
to
maximize
the
realized
adjacencies.
Lect
10:
Seman(c
Word
Clouds
17

Experimental
Setup:
1)
Term
Extrac(on
2)
Ranking
3)
Similarity
Conputa(on
Lect
10:
Seman(c
Word
Clouds
18

Early
Algorithms
1. Wordle
(Random)
2. Context-‐Preserving
Word
Cloud
Visualiza(on
(CPWCV)
3. Seam
Carving
Lect
10:
Seman(c
Word
Clouds
19

Wordle
à
Random
•
The
Wordle
algorithm
places
one
word
at
a
(me
in
a
greedy
fashion,
aiming
to
use
space
as
efficiently
as
possible.
• First
the
words
are
sorted
by
weight
in
decreasing
order.
• Then
for
each
word
in
the
order,
a
posi(on
is
picked
at
random.
Lect
10:
Seman(c
Word
Clouds
20

1:
Random
Lect
10:
Seman(c
Word
Clouds
21

2:
Random
Lect
10:
Seman(c
Word
Clouds
22

3:
Random
Lect
10:
Seman(c
Word
Clouds
23

4:
Random
Lect
10:
Seman(c
Word
Clouds
24

5:
Random
Lect
10:
Seman(c
Word
Clouds
25

6:
Random
Lect
10:
Seman(c
Word
Clouds
26

Context-‐Preserving
Word
Cloud
Visualiza(on
(CPWCV)
• First,
a
dissimilarity
matrix
is
computed
and
Mul(dimensional
Scaling
(MDS)
is
performed
• Second,
Mul(dimensional
scaling
(MDS)
is
a
means
of
visualizing
the
level
of
similarity
of
individual
cases
of
a
dataset.
effort
to
create
a
compact
layout
Lect
10:
Seman(c
Word
Clouds
27

1:
Lect
10:
Seman(c
Word
Clouds
28

2:
:
repulsive
force
Lect
10:
Seman(c
Word
Clouds
29

3:
:
a^rac(ve
force
Lect
10:
Seman(c
Word
Clouds
30

Seam
Carving
• Seam
carving
is
a
content-‐aware
image
resizing
technique
• Basically,
an
algorithm
for
image
resizing
• It
was
invented
at
Mitsubishi’s
Lect
10:
Seman(c
Word
Clouds
31

1:
Seam
Carving
Lect
10:
Seman(c
Word
Clouds
32

2:
Seam
Carving
:
space
is
divided
into
regions
Lect
10:
Seman(c
Word
Clouds
33

3:
Seam
Carving
:
empty
paths
trimmed
out
itera(vely
Lect
10:
Seman(c
Word
Clouds
34

4:
Seam
Carving
Lect
10:
Seman(c
Word
Clouds
35

5:
Seam
Carving
Lect
10:
Seman(c
Word
Clouds
36

6:
Seam
Carving:
space
divided
into
regions
Lect
10:
Seman(c
Word
Clouds
37

7:
Seam
Carving
Lect
10:
Seman(c
Word
Clouds
38

3
New
Algorithms
1. Inflate
and
Push
2. Star
Forest
3. Cycle
Cover
Lect
10:
Seman(c
Word
Clouds
39

Inflate-‐and-‐Push
• Simple
heuris(c
method
for
word
layout,
which
aims
to
preserve
seman(c
rela(ons
between
pair
of
words.
Lect
10:
Seman(c
Word
Clouds
40

1:
Inflate
Lect
10:
Seman(c
Word
Clouds
41

2:
Inflate
:
scaling
down
Lect
10:
Seman(c
Word
Clouds
42

3:
Inflate
:
seman(cally-‐related
words
are
placed
close
to
each
other
Lect
10:
Seman(c
Word
Clouds
43

4:
Inflate
:
repulsive
force
to
resolve
overlaps
Lect
10:
Seman(c
Word
Clouds
44

5:
Inflate
Lect
10:
Seman(c
Word
Clouds
45

Star
Forest
• A
star
is
a
tree
and
a
star
forest
is
a
forest
whose
connected
components
are
all
stars.
Lect
10:
Seman(c
Word
Clouds
46

Star
Forest
:
star
=
graph
• Dissimilarity
matrix
à
disjoint
stars
=
star
forest
• A^rac(ve
force
to
get
a
compact
layout
Lect
10:
Seman(c
Word
Clouds
47

Cycle
Cover
• This
algorithm
is
based
on
a
similarity
matrix.
• First,
a
similarity
path(=cycle)
is
created
• Then,
the
op(mal
level
of
compact-‐ness
is
computed
Lect
10:
Seman(c
Word
Clouds
48

Quan(ta(ve
Metrics
Lect
10:
Seman(c
Word
Clouds
49

Criteria
1. Realized
Adjacenies
– how
close
are
similar
words
to
each
other?
2. Distor(on
– how
distant
are
dissimilar
words?
3. Comptactness
– how
well
u(lized
is
the
drawing
area?
4. Uniform
Area
U(liza(on
– uniformity
of
the
distribu(on
(overpopulated
vs
sparse
areas
in
the
word
cloud)
5. Aspect
Ra(o
– width
and
height
of
the
bounding
box
6. Running
Time
– execu(on
(me
Lect
10:
Seman(c
Word
Clouds
50

2
datasets
(1)
WIKI
,
a
set
of
112
plain-‐text
ar(cles
extracted
from
the
English
Wikipedia,
each
consis(ng
of
at
least
200
dis(nct
words
(2)
PAPERS
,
a
set
of
56
research
papers
published
in
conferences
on
experimental
algorithms
(SEA
and
ALENEX)
in
2011-‐2012.
Lect
10:
Seman(c
Word
Clouds
51

Cycle
Cover
wins
Lect
10:
Seman(c
Word
Clouds
52

Seam
Carving
wins
Lect
10:
Seman(c
Word
Clouds
53

Random
wins
Lect
10:
Seman(c
Word
Clouds
54

Inflate
wins
Lect
10:
Seman(c
Word
Clouds
55

Random
and
Seam
Carving
win
Lect
10:
Seman(c
Word
Clouds
56

All
ok
except
Seam
Carving
Lect
10:
Seman(c
Word
Clouds
57

Demo
Lect
10:
Seman(c
Word
Clouds
58

Final
Words
Lect
10:
Seman(c
Word
Clouds
59

The
end
Lect
10:
Seman(c
Word
Clouds
60

Lecture: Semantic Word Clouds

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Ähnlich wie Lecture: Semantic Word Clouds

Ähnlich wie Lecture: Semantic Word Clouds (20)

Mehr von Marina Santini

Mehr von Marina Santini (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Lecture: Semantic Word Clouds