LDA on Social Bookmarking
Systems: an experiment on
CiteUlike
Introduction to Natural Language Processing, CS2731
Professor Rebecca Hwa
University of Pittsburgh
Denis Parra-Santander
December 16th 2009
1

Outline
2
Outline
Topic ModelingJoke (check mood
of people…)
LDAIntroduction
Sorry I’m
nervous…
Motivation
Smart
Statement…
Definitions
Monte Carlo: a
great place pass
your vacations
DIRICHLET:
[diʀiˈkleː]
Uuuh…
Uuuh…
Experiments
Results
END
Evaluation
method

Topic modeling: Evolution
 LSA [Deerwester et al. 90]: find “latent”
structure or “concepts” in a text corpus.:
◦ Compare texts using a vector-based
representation that is learned from a corpus.
Relies on SVD (for dimensionality reduction)
 PLSA [Hoffman 99] extends LSA by adding
the idea of mixture decomposition derived
from a latent class model.
 LDA [Blei et al. 2004] : extends PLSA by
using a generative model, in particular, by
adding a Dirichlet prior.
3

Document 22
LDA : Generative Model* (I/II)
4
Words:
Information
About catalog
pricing changes
2009 welcome
looking hands-on
science ideas try
kitchen
• LDA assumes that each
word in the document
was generated by a
distribution of topics over
words.
Topic 15:
science
experiment
learning
ideas
practice
information
Topic 9:
catalog
shopping
buy
internet
checkout
cart
• Paired with an inference
mechanism (Gibbs
sampling), learns per-
document distribution
over topics, per-topic
distributions over words
…
*Original slide by Daniel Ramage, Stanford University

LDA I/II : Graphical Model
5
 Graphical model representations
 Compact notation:
Cat
w1 w2 w3 w4 wn
…
*Original slide by Roger Levy, UCSD
Cat
w1
n
“generate a word from Cat n times”
a “plate”

LDA II/II : Graphical Model
6
Nd D
zi
wi
θ (d)
φ (j)
α
β
θ (d) ∼ Dirichlet(α)
zi ∼ Discrete(θ (d) )φ(j) ∼ Dirichlet(β)
wi ∼ Discrete(φ(zi) )
T
distribution over topics
for each document
topic assignment
for each word
distribution over words
for each topic
word generated from
assigned topic
Dirichlet priors

Learning the parameters
7
 Maximum likelihood estimation (EM)
◦ e.g. Hofmann (1999)
 Deterministic approximate algorithms
◦ variational EM; Blei, Ng & Jordan (2001; 2003)
◦ expectation propagation; Minka & Lafferty
(2002)
 Markov chain Monte Carlo
◦ full Gibbs sampler; Pritchard et al. (2000)
◦ collapsed Gibbs sampler; Griffiths & Steyvers
(2004)

My Experiments
 IdentifyTopics in a collection of documents
from a social bookmarking system
(citeULike) [Ramage et al. 2008]
 Objective: Clusterise documents by LDA
 QUESTION: If the documents have, in
addition to title and text, USERTAGS… how
can they help/influence/improve topic
identification/clustering?
8

Tools available
 Many implementations of LDA based on
Gibbs sampling:
 LingPipe (Java)
 Mallet (Java)
 STMT (Scala) – I chose this one
9

The Dataset
 Initially
◦ Corpus: ~45k documents,
◦ Definition of 99 topics (queries)
◦ Gold-std : Identification document-topic by
expert feedback, defining a ground-truth
 But then, then gold-standard and RAM…
◦ Not all documents were relevant
◦ Unable to train model with 45k, 20k and10k
 And then, the tags: not all the documents in
gold-standard had associated tags (#>2)
◦ Finally:Training with 1.1k documents
◦ Experiments on 212 documents
10

Evaluation: Pair-wise precision / recall
11*Original slide by Daniel Ramage, Stanford University

Perplexity
13
# of topics
38 52 99
Content Tags 1860.7642 1880.7974 1270.8032
Title +
text
2526.7589 2447.5477 2755.1329
Using Stanford Topic Modeling Toolbox (STMT)
Training with ~1.1k documents, 80% training, 20% to calculate
pp.

F1 ( & precision/recall)
 F-1, in parenthesis precision and recall
14
# of topics
38 52 99
Tags 0.139
(0.118/0.167)
0.168
(0.187/0.152)
0.215
(0.267/0.18)
Title +
text
0.1252
(0.122/0.128)
0.157
(0.151/0.163)
0.156
(0.198/0.129)

Conclusions
 Results are not the same than “motivational”
paper, though are consistent with their
conclusions (dataset is very domain-specific)
 Pending: combining tags and documents, in
particular MM-LDA
 Importance to NLP: extensions of the model
have been used to:
◦ learn syntactic and semantic factors that guide
word choice
◦ Identify authorship
◦ Many others ()
15

… and to finish …
Thanks!
And…
16

“Invent new worlds and watch your
word;
The adjective, when it doesn’t give life,
kills…”
Ars Poetica
Vicente Huidobro
“Inventa nuevos mundos y cuida tu palabra;
El adjetivo, cuando no da vida, mata…”
17

References
Heinrich, G. (2008). Parameter estimation for text analysis,.Technical report,
University of Leipzig.
Ramage, D., P. Heymann, C. D. Manning, and H. G. Molina (2009). Clustering the
tagged web. In WSDM '09: Proceedings of the Second ACM International
Conference onWeb Search and Data Mining, NewYork, NY, USA, pp. 54-63.
ACM.
Steyvers, M. and T. Griffiths (2007). Probabilistic Topic Models. Lawrence Erlbaum
Associates.
18

LSA: 3 claims (2 match with LDA)
 Semantic Information can be derived from
a word-document co-ocurrence matrix
 Dimensionality reduction is an essential
part of this derivation
 Words and documents can be
represented as points in an Euclidean
Space => different than LDA: semantic
properties of words and docs are
expressed in terms of probabilistic topics
20

21
Parameter estimation and Gibbs
Sampling (3 Slides)

Inverting the generative model
 Maximum likelihood estimation (EM)
◦ e.g. Hofmann (1999)
 Deterministic approximate algorithms
◦ variational EM; Blei, Ng & Jordan (2001; 2003)
◦ expectation propagation; Minka & Lafferty (2002)
 Markov chain Monte Carlo
◦ full Gibbs sampler; Pritchard et al. (2000)
◦ collapsed Gibbs sampler; Griffiths & Steyvers (2004)

The collapsed Gibbs sampler
 Using conjugacy of Dirichlet and multinomial
distributions, integrate out continuous parameters
 Defines a distribution on discrete ensembles z
ΦΦΦ= ∫
∆
dpPP
T
W
)(),|()|( zwzw
ΘΘΘ= ∫
∆
dpPP
D
T
)()|()( zz
∑
=
z
zzw
zzw
wz
)()|(
)()|(
)|(
PP
PP
P
∏
∑
∏
= +Γ
Γ
Γ
+Γ
=
T
j
w
j
w
W
w
j
w
n
Wn
1
)(
)(
)(
)(
)(
)(
β
β
β
β
∏
∑
∏
= +Γ
Γ
Γ
+Γ
=
D
d
j
d
j
T
j
d
j
n
Tn
1
)(
)(
)(
)(
)(
)(
α
α
α
α

The collapsed Gibbs sampler
 Sample each zi conditioned on z-i
 This is nicer than your average Gibbs sampler:
◦ memory: counts can be cached in two sparse matrices
◦ optimization: no special functions, simple arithmetic
◦ the distributions on Φ and Θ are analytic given z and w, and can
later be found for each sample
α
α
β
β
Tn
n
Wn
n
zP i
i
i
i
i
d
d
j
z
z
w
ii
+
+
+
+
∝
••
− )(
)(
)(
)(
),|( zw

Gibbs Sampling from PTM paper
25

26
Extensions and Applications

Nu U
zi
wi
θ(u)
φ (j)
α
β
θ(u)|su=0 ∼ Delta(θ(u-1))
θ(u)|su=1 ∼ Dirichlet(α)
zi ∼ Discrete(θ (u) )
φ(j) ∼ Dirichlet(β)
T
Extension: a model for meetings
su
θ(u-1)
…
(Purver, Kording, Griffiths, & Tenenbaum, 2006)

Sample of ICSI meeting corpus
(25 meetings)
 no it's o_k.
 it's it'll work.
 well i can do that.
 but then i have to end the presentation in the middle so i can go back to open up javabayes.
 o_k fine.
 here let's see if i can.
 alright.
 very nice.
 is that better.
 yeah.
 o_k.
 uh i'll also get rid of this click to add notes.
 o_k. perfect
 NEW TOPIC (not supplied to algorithm)
 so then the features we decided or we decided we were talked about.
 right.
 uh the the prosody the discourse verb choice.
 you know we had a list of things like to go and to visit and what not.
 the landmark-iness of uh.
 i knew you'd like that.

Topic segmentation applied to meetings
Inferred
Segmentation
Inferred Topics

Comparison with human judgments
Topics recovered are much more coherent than those found
using random segmentation, no segmentation, or an HMM

Learning the number of topics
 Can use standard Bayes factor methods to
evaluate models of different dimensionality
◦ e.g. importance sampling via MCMC
 Alternative: nonparametric Bayes
◦ fixed number of topics per document,
unbounded number of topics per corpus
(Blei, Griffiths, Jordan, & Tenenbaum, 2004)
◦ unbounded number of topics for both (the
hierarchical Dirichlet process)
(Teh, Jordan, Beal, & Blei, 2004)

The Author-Topic model
(Rosen-Zvi, Griffiths,Smyth, & Steyvers, 2004)
Nd D
zi
wi
θ (a)
φ (j)
α
β
θ (a) ∼ Dirichlet(α)
zi ∼ Discrete(θ (xi) )
φ(j) ∼ Dirichlet(β)
T
xi
A
xi ∼ Uniform(A(d) )
each author has a
distribution over topics
the author of each word is
chosen uniformly at random

Four example topics from NIPS
WORD PROB. WORD PROB. WORD PROB. WORD PROB.
LIKELIHOOD 0.0539 RECOGNITION 0.0400 REINFORCEMENT 0.0411 KERNEL 0.0683
MIXTURE 0.0509 CHARACTER 0.0336 POLICY 0.0371 SUPPORT 0.0377
EM 0.0470 CHARACTERS 0.0250 ACTION 0.0332 VECTOR 0.0257
DENSITY 0.0398 TANGENT 0.0241 OPTIMAL 0.0208 KERNELS 0.0217
GAUSSIAN 0.0349 HANDWRITTEN 0.0169 ACTIONS 0.0208 SET 0.0205
ESTIMATION 0.0314 DIGITS 0.0159 FUNCTION 0.0178 SVM 0.0204
LOG 0.0263 IMAGE 0.0157 REWARD 0.0165 SPACE 0.0188
MAXIMUM 0.0254 DISTANCE 0.0153 SUTTON 0.0164 MACHINES 0.0168
PARAMETERS 0.0209 DIGIT 0.0149 AGENT 0.0136 REGRESSION 0.0155
ESTIMATE 0.0204 HAND 0.0126 DECISION 0.0118 MARGIN 0.0151
AUTHOR PROB. AUTHOR PROB. AUTHOR PROB. AUTHOR PROB.
Tresp_V 0.0333 Simard_P 0.0694 Singh_S 0.1412 Smola_A 0.1033
Singer_Y 0.0281 Martin_G 0.0394 Barto_A 0.0471 Scholkopf_B 0.0730
Jebara_T 0.0207 LeCun_Y 0.0359 Sutton_R 0.0430 Burges_C 0.0489
Ghahramani_Z 0.0196 Denker_J 0.0278 Dayan_P 0.0324 Vapnik_V 0.0431
Ueda_N 0.0170 Henderson_D 0.0256 Parr_R 0.0314 Chapelle_O 0.0210
Jordan_M 0.0150 Revow_M 0.0229 Dietterich_T 0.0231 Cristianini_N 0.0185
Roweis_S 0.0123 Platt_J 0.0226 Tsitsiklis_J 0.0194 Ratsch_G 0.0172
Schuster_M 0.0104 Keeler_J 0.0192 Randlov_J 0.0167 Laskov_P 0.0169
Xu_L 0.0098 Rashid_M 0.0182 Bradtke_S 0.0161 Tipping_M 0.0153
Saul_L 0.0094 Sackinger_E 0.0132 Schwartz_A 0.0142 Sollich_P 0.0141
TOPIC 19 TOPIC 24 TOPIC 29 TOPIC 87

Who wrote what?
A method1 is described which like the kernel1 trick1 in support1 vector1 machines1 SVMs1 lets
us generalize distance1 based2 algorithms to operate in feature1 spaces usually nonlinearly
related to the input1 spaceThis is done by identifying a class of kernels1 which can be
represented as norm1 based2 distances1 in Hilbert spaces It turns1 out that common kernel1
algorithms such as SVMs1 and kernel1 PCA1 are actually really distance1 based2 algorithms and
can be run2 with that class of kernels1 too As well as providing1 a useful new insight1 into how
these algorithms work the present2 work can form the basis1 for conceiving new algorithms
This paper presents2 a comprehensive approach for model2 based2 diagnosis2 which includes
proposals for characterizing and computing2 preferred2 diagnoses2 assuming that the system2
description2 is augmented with a system2 structure2 a directed2 graph2 explicating the
interconnections between system2 components2 Specifically we first introduce the notion of a
consequence2 which is a syntactically2 unconstrained propositional2 sentence2 that
characterizes all consistency2 based2 diagnoses2 and show2 that standard2 characterizations of
diagnoses2 such as minimal conflicts1 correspond to syntactic2 variations1 on a consequence2
Second we propose a new syntactic2 variation on the consequence2 known as negation2 normal
form NNF and discuss its merits compared to standard variationsThird we introduce a basic
algorithm2 for computing consequences in NNF given a structured system2 description We
show that if the system2 structure2 does not contain cycles2 then there is always a linear size2
consequence2 in NNF which can be computed in linear time2 For arbitrary1 system2 structures2
we show a precise connection between the complexity2 of computing2 consequences and the
topology of the underlying system2 structure2 Finally we present2 an algorithm2 that
enumerates2 the preferred2 diagnoses2 characterized by a consequence2 The algorithm2 is
shown1 to take linear time2 in the size2 of the consequence2 if the preference criterion1 satisfies
some general conditions
Written by
(1) Scholkopf_B
Written by
(2) Darwiche_A

Analysis of PNAS abstracts
 Test topic models with a real database
of scientific papers from PNAS
 All 28,154 abstracts from 1991-2001
 All words occurring in at least five
abstracts, not on “stop” list (20,551)
 Total of 3,026,970 tokens in corpus
(Griffiths & Steyvers, 2004)

FORCE
SURFACE
MOLECULES
SOLUTION
SURFACES
MICROSCOPY
WATER
FORCES
PARTICLES
STRENGTH
POLYMER
IONIC
ATOMIC
AQUEOUS
MOLECULAR
PROPERTIES
LIQUID
SOLUTIONS
BEADS
MECHANICAL
HIV
VIRUS
INFECTED
IMMUNODEFICIENCY
CD4
INFECTION
HUMAN
VIRAL
TAT
GP120
REPLICATION
TYPE
ENVELOPE
AIDS
REV
BLOOD
CCR5
INDIVIDUALS
ENV
PERIPHERAL
MUSCLE
CARDIAC
HEART
SKELETAL
MYOCYTES
VENTRICULAR
MUSCLES
SMOOTH
HYPERTROPHY
DYSTROPHIN
HEARTS
CONTRACTION
FIBERS
FUNCTION
TISSUE
RAT
MYOCARDIAL
ISOLATED
MYOD
FAILURE
STRUCTURE
ANGSTROM
CRYSTAL
RESIDUES
STRUCTURES
STRUCTURAL
RESOLUTION
HELIX
THREE
HELICES
DETERMINED
RAY
CONFORMATION
HELICAL
HYDROPHOBIC
SIDE
DIMENSIONAL
INTERACTIONS
MOLECULE
SURFACE
NEURONS
BRAIN
CORTEX
CORTICAL
OLFACTORY
NUCLEUS
NEURONAL
LAYER
RAT
NUCLEI
CEREBELLUM
CEREBELLAR
LATERAL
CEREBRAL
LAYERS
GRANULE
LABELED
HIPPOCAMPUS
AREAS
THALAMIC
A selection of topics
TUMOR
CANCER
TUMORS
HUMAN
CELLS
BREAST
MELANOMA
GROWTH
CARCINOMA
PROSTATE
NORMAL
CELL
METASTATIC
MALIGNANT
LUNG
CANCERS
MICE
NUDE
PRIMARY
OVARIAN

Cold topics Hot topics
2
SPECIES
GLOBAL
CLIMATE
CO2
WATER
ENVIRONMENTAL
YEARS
MARINE
CARBON
DIVERSITY
OCEAN
EXTINCTION
TERRESTRIAL
COMMUNITY
ABUNDANCE
134
MICE
DEFICIENT
NORMAL
GENE
NULL
MOUSE
TYPE
HOMOZYGOUS
ROLE
KNOCKOUT
DEVELOPMENT
GENERATED
LACKING
ANIMALS
REDUCED
179
APOPTOSIS
DEATH
CELL
INDUCED
BCL
CELLS
APOPTOTIC
CASPASE
FAS
SURVIVAL
PROGRAMMED
MEDIATED
INDUCTION
CERAMIDE
EXPRESSION
37
CDNA
AMINO
SEQUENCE
ACID
PROTEIN
ISOLATED
ENCODING
CLONED
ACIDS
IDENTITY
CLONE
EXPRESSED
ENCODES
RAT
HOMOLOGY
289
KDA
PROTEIN
PURIFIED
MOLECULAR
MASS
CHROMATOGRAPHY
POLYPEPTIDE
GEL
SDS
BAND
APPARENT
LABELED
IDENTIFIED
FRACTION
DETECTED
75
ANTIBODY
ANTIBODIES
MONOCLONAL
ANTIGEN
IGG
MAB
SPECIFIC
EPITOPE
HUMAN
MABS
RECOGNIZED
SERA
EPITOPES
DIRECTED
NEUTRALIZING

38
The effect of Alpha and beta as
hyperparameters

Effects of hyperparameters
 α and β control the relative sparsity of Φ and Θ
◦ smaller α, fewer topics per document
◦ smaller β, fewer words per topic
 Good assignments z compromise in sparsity
logΓ(x)
x
∏
∑
∏
= +Γ
Γ
Γ
+Γ
=
T
j
w
j
w
W
w
j
w
n
Wn
P
1
)(
)(
)(
)(
)(
)(
)|(
β
β
β
β
zw
∏
∑
∏
= +Γ
Γ
Γ
+Γ
=
D
d
j
d
j
T
j
d
j
n
Tn
P
1
)(
)(
)(
)(
)(
)(
)(
α
α
α
α
z

Varying α
decreasing α increases sparsity

Varying β
decreasing β
increases sparsity ?

Multi-Multinomial LDA (MM-LDA)
42

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (17)

Andere mochten auch

Andere mochten auch (15)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)