1. Seman&c
Analysis
in
Language
Technology
http://stp.lingfil.uu.se/~santinim/sais/2014/sais_2014.htm
Semantic Word Clouds
Marina
San(ni
san$nim@stp.lingfil.uu.se
Department
of
Linguis(cs
and
Philology
Uppsala
University,
Uppsala,
Sweden
Autumn
2014
Lect
10:
Seman(c
Word
Clouds
1
3. Outline
• Word
Clouds
• 3
early
algorithms
• 3
new
algorithms
• Metrics
&
Quan(ta(ve
Evalua(on
Lect
10:
Seman(c
Word
Clouds
3
4. Word
Clouds
• Word
clouds
have
become
a
standard
tool
for
abstrac(ng,
visualizing
and
comparing
texts…
• We
could
apply
the
same
or
similar
techniques
to
the
huge
amonts
of
tags
produced
by
users
interac(ng
in
the
social
networks
Lect
10:
Seman(c
Word
Clouds
4
5. Comparison
&
conceptualiza(on
Tool
Lect
10:
Seman(c
Word
Clouds
5
• Word
Clouds
as
a
tool
for
”conceptualizing”
documents.
Cf
Ontologies
• Ex:
2008,
comparison
of
speeches:
Obama
vs
McCain
6. Word
Clouds
and
Tag
Clouds…
• …
are
oVen
used
to
represent
importance
among
terms
(ex,
band
popularity)
or
serve
as
a
naviga(on
tool
(ex,
Google
search
results).
Lect
10:
Seman(c
Word
Clouds
6
7. The
Problem…
• How
to
compute
seman(c-‐preserving
word
clouds
in
which
seman(cally-‐related
words
are
close
to
each
other.
Lect
10:
Seman(c
Word
Clouds
7
8. Wordle
h^p://www.wordle.net
• Prac(cal
tools,
like
Wordle,
make
word
cloud
visualiza(on
easy.
• Shortoming:
they
do
not
capture
the
rela(onships
between
words
in
any
way
Lect
10:
Seman(c
Word
Clouds
8
9. Many
word
clouds
are
arranged
randomly
(look
also
at
the
sca^ered
colours)
Lect
10:
Seman(c
Word
Clouds
9
10. Seman(c
Pa^erns
• Humans
ins(nc(vely
tend
to
pick
up
pa^erns
• Ins(nc(vely,
one
could
say
that
two
words
that
are
close
to
each
other
in
a
word
cloud
are
seman(cally
related.
Lect
10:
Seman(c
Word
Clouds
10
11. So,
it
makes
sense
to
place
such
related
words
close
to
each
other
(look
also
at
the
color
distribu(on)
Lect
10:
Seman(c
Word
Clouds
11
12. In
linguis(cs
and
in
LT…
• …
if
a
pair
of
words
oVen
appear
together
in
a
sentence,
then
we
can
assume
that
this
pair
of
words
is
related
seman(cally.
Lect
10:
Seman(c
Word
Clouds
12
13. Seman(c
word
clouds
have
higher
user
sa(sfac(on
compared
to
other
layouts…
Lect
10:
Seman(c
Word
Clouds
13
14. All
recent
word
cloud
visualiza(on
tools
aim
to
incoprorate
seman(cs
in
the
layout…
Lect
10:
Seman(c
Word
Clouds
14
15. …
but
none
of
them
provide
any
guarantee
about
the
quality
of
the
layout
in
terms
of
seman(cs
Lect
10:
Seman(c
Word
Clouds
15
16. Early
algorithms:
Force-‐Directed
Graph
• Most
of
the
exis(ng
algorithms
are
based
on
force-‐directed
graph
layout.
• Force-‐directed
graph
drawing
algorithms
are
a
class
of
algorithms
for
drawing
graphs
in
an
aesthe(cally
pleasing
way
– A^rac(ve
forces
between
pairs
to
reduce
empty
space
– Repulsive
forces
ensure
that
words
do
not
overlap
– Final
force
preserve
seman(c
rela(ons
between
words.
Force-‐directed
graph
drawing
algorithms
assign
forces
among
the
set
of
edges
and
the
set
of
nodes
of
a
graph
drawing.
Typically,
spring-‐like
a^rac(ve
forces
based
on
Hooke's
law
are
used
to
a^ract
pairs
of
endpoints
of
the
graph's
edges
towards
each
other,
while
simultaneously
repulsive
forces
like
those
of
electrically
charged
par(cles
based
on
Coulomb's
law
are
used
to
separate
all
pairs
of
nodes.
Lect
10:
Seman(c
Word
Clouds
16
17. Newer
Algorithms:
rectangle
representa(on
of
graphs
• Vertex-‐weighted
and
edge-‐weighed
graph:
– The
ver(ces
of
the
graph
are
the
words
• Their
weight
correspond
to
some
measure
of
importance
(eg.
word
frequencies)
– The
edges
capture
the
seman(c
relatedness
of
pair
of
words
(eg.
co-‐occurrence)
• Their
weight
correspond
to
the
strength
of
the
rela(on
– Each
vertex
can
be
drawn
as
a
box
(rectangle)
with
a
dimension
determing
by
its
weight
– A
realized
adjacency
is
the
sum
of
the
edge
weights
for
all
pairs
of
touching
boxes.
– The
goal
is
to
maximize
the
realized
adjacencies.
Lect
10:
Seman(c
Word
Clouds
17
18. Experimental
Setup:
1)
Term
Extrac(on
2)
Ranking
3)
Similarity
Conputa(on
Lect
10:
Seman(c
Word
Clouds
18
19. Early
Algorithms
1. Wordle
(Random)
2. Context-‐Preserving
Word
Cloud
Visualiza(on
(CPWCV)
3. Seam
Carving
Lect
10:
Seman(c
Word
Clouds
19
20. Wordle
à
Random
•
The
Wordle
algorithm
places
one
word
at
a
(me
in
a
greedy
fashion,
aiming
to
use
space
as
efficiently
as
possible.
• First
the
words
are
sorted
by
weight
in
decreasing
order.
• Then
for
each
word
in
the
order,
a
posi(on
is
picked
at
random.
Lect
10:
Seman(c
Word
Clouds
20
27. Context-‐Preserving
Word
Cloud
Visualiza(on
(CPWCV)
• First,
a
dissimilarity
matrix
is
computed
and
Mul(dimensional
Scaling
(MDS)
is
performed
• Second,
Mul(dimensional
scaling
(MDS)
is
a
means
of
visualizing
the
level
of
similarity
of
individual
cases
of
a
dataset.
effort
to
create
a
compact
layout
Lect
10:
Seman(c
Word
Clouds
27
31. Seam
Carving
• Seam
carving
is
a
content-‐aware
image
resizing
technique
• Basically,
an
algorithm
for
image
resizing
• It
was
invented
at
Mitsubishi’s
Lect
10:
Seman(c
Word
Clouds
31
39. 3
New
Algorithms
1. Inflate
and
Push
2. Star
Forest
3. Cycle
Cover
Lect
10:
Seman(c
Word
Clouds
39
40. Inflate-‐and-‐Push
• Simple
heuris(c
method
for
word
layout,
which
aims
to
preserve
seman(c
rela(ons
between
pair
of
words.
Lect
10:
Seman(c
Word
Clouds
40
46. Star
Forest
• A
star
is
a
tree
and
a
star
forest
is
a
forest
whose
connected
components
are
all
stars.
Lect
10:
Seman(c
Word
Clouds
46
47. Star
Forest
:
star
=
graph
• Dissimilarity
matrix
à
disjoint
stars
=
star
forest
• A^rac(ve
force
to
get
a
compact
layout
Lect
10:
Seman(c
Word
Clouds
47
48. Cycle
Cover
• This
algorithm
is
based
on
a
similarity
matrix.
• First,
a
similarity
path(=cycle)
is
created
• Then,
the
op(mal
level
of
compact-‐ness
is
computed
Lect
10:
Seman(c
Word
Clouds
48
50. Criteria
1. Realized
Adjacenies
– how
close
are
similar
words
to
each
other?
2. Distor(on
– how
distant
are
dissimilar
words?
3. Comptactness
– how
well
u(lized
is
the
drawing
area?
4. Uniform
Area
U(liza(on
– uniformity
of
the
distribu(on
(overpopulated
vs
sparse
areas
in
the
word
cloud)
5. Aspect
Ra(o
– width
and
height
of
the
bounding
box
6. Running
Time
– execu(on
(me
Lect
10:
Seman(c
Word
Clouds
50
51. 2
datasets
(1)
WIKI
,
a
set
of
112
plain-‐text
ar(cles
extracted
from
the
English
Wikipedia,
each
consis(ng
of
at
least
200
dis(nct
words
(2)
PAPERS
,
a
set
of
56
research
papers
published
in
conferences
on
experimental
algorithms
(SEA
and
ALENEX)
in
2011-‐2012.
Lect
10:
Seman(c
Word
Clouds
51