Non-Bayesian Additive Regularization for Multimodal Topic Modeling of Large Collections

Non-Bayesian Additive Regularization for
Multimodal Topic Modeling of Large
Collections
Konstantin Vorontsov1,3 • Oleksandr Frei4 • Murat Apishev2
Peter Romov3 • Marina Suvorova2 • Anastasia Yanina1
1Moscow Institute of Physics and Technology,
2Moscow State University • 3Yandex • 4Schlumberger
Topic Models: Post-Processing and Applications,
CIKM’15 Workshop
October 19, 2015 • Melbourne, Australia

Probabilistic Topic Model (PTM) generating a text collection
Topic model explains terms w in documents d by topics t:
p(w|d) =
t
p(w|t)p(t|d)
Разработан спектрально-аналитический подход к выявлению размытых протяженных повторов
в геномных последовательностях. Метод основан на разномасштабном оценивании сходства
нуклеотидных последовательностей в пространстве коэффициентов разложения фрагментов
кривых GC- и GA-содержания по классическим ортогональным базисам. Найдены условия
оптимальной аппроксимации, обеспечивающие автоматическое распознавание повторов
различных видов (прямых и инвертированных, а также тандемных) на спектральной матрице
сходства. Метод одинаково хорошо работает на разных масштабах данных. Он позволяет
выявлять следы сегментных дупликаций и мегасателлитные участки в геноме, районы синтении
при сравнении пары геномов. Его можно использовать для детального изучения фрагментов
хромосом (поиска размытых участков с умеренной длиной повторяющегося паттерна).
•( |!)
•("| ):
, … , #$
" , … , "#$
:
0.018 распознавание
0.013 сходство
0.011 паттерн
… … … …
0.023 днк
0.016 геном
0.009 нуклеотид
… … … …
0.014 базис
0.009 спектр
0.006 ортогональный
… … … …
! "" #" $•
Peter Romov (peter@romov.ru) ARTM of Large Multimodal Collections 2 / 28

Inverse problem: text collection → PTM
Given: D is a set (collection) of documents
W is a set (vocabulary) of terms
ndw = how many times term w appears in document d
Find: parameters φwt =p(w|t), θtd =p(t|d) of the topic model
p(w|d) =
t
φwtθtd .
under nonnegativity and normalization constraints
φwt 0,
w∈W
φwt = 1; θtd 0,
t∈T
θtd = 1.
The ill-posed problem of matrix factorization:
ΦΘ = (ΦS)(S−1
Θ) = Φ Θ
for all S such that Φ , Θ are stochastic.

PLSA — Probabilistic Latent Semantic Analysis [Hofmann, 1999]
Constrained maximization of the log-likelihood:
L (Φ, Θ) =
d,w
ndw ln
t
φwtθtd → max
Φ,Θ
EM-algorithm is a simple iteration method for the nonlinear system
E-step:
M-step:



ptdw = norm
t∈T
φwtθtd
φwt = norm
w∈W d∈D
ndw ptdw
θtd = norm
t∈T w∈d
ndw ptdw
where norm
t∈T
xt = max{xt ,0}
s∈T
max{xs ,0} is vector normalization.

Graphical Models and Bayesian Inference
In Bayesian approach, Graphical Models are used to make
sophisticated generative models.
David M. Blei. Probabilistic topic models // Communications of the ACM,
2012. Vol. 55, No. 4., Pp. 77–84.

Graphical Models and Bayesian Inference
In Bayesian approach, a lot of calculus to be done for each model
to go from the problem statement to the solution algorithm:
Yi Wang. Distributed Gibbs Sampling of Latent Dirichlet Allocation: The Gritty
Details. 2008.

ARTM — Additive Regularization of Topic Model
Maximum log-likelihood with additive regularization criterion R:
d,w
ndw ln
t
φwtθtd + R(Φ, Θ) → max
Φ,Θ
EM-algorithm is a simple iteration method for the system
E-step:
M-step:



ptdw = norm
t∈T
φwtθtd
φwt = norm
w∈W d∈D
ndw ptdw + φwt
∂R
∂φwt
θtd = norm
t∈T w∈d
ndw ptdw + θtd
∂R
∂θtd

Example: Latent Dirichlet Allocation [Blei, Ng, Jordan, 2003]
Maximum a posteriori (MAP) with Dirichlet prior:
d,w
ndw ln
t
φwtθtd
log-likelihood L (Φ,Θ)
+
t,w
βw ln φwt +
d,t
αt ln θtd
regularization criterion R(Φ,Θ)
→ max
Φ,Θ
E-step:
M-step:



ptdw = norm
t∈T
φwtθtd
φwt = norm
w∈W d∈D
ndw ptdw + βw
θtd = norm
t∈T w∈d
ndw ptdw + αt

Many Bayesian PTMs can be reinterpreted as regularizers in ARTM
smoothing (LDA) for background and stop-words topics
sparsing (anti-LDA) for domain-speciﬁc topics
topic decorrelation
topic coherence maximization
supervised learning for classiﬁcation and regression
semi-supervised learning
using document citations and links
determining number of topics via entropy sparsing
modeling topical hierarchies
modeling temporal topic dynamics
using vocabularies in multilingual topic models
etc.
Vorontsov K. V., Potapenko A. A. Additive Regularization of Topic Models //
Machine Learning. Volume 101, Issue 1 (2015), Pp. 303-323.

ARTM — Additive Regularization of Topic Model
Maximum log-likelihood with additive combination of regularizers:
d,w
ndw ln
t
φwtθtd +
n
i=1
τi Ri (Φ, Θ) → max
Φ,Θ
,
where τi are regularization coeﬃcients.
E-step:
M-step:



ptdw = norm
t∈T
φwtθtd
φwt = norm
w∈W d∈D
ndw ptdw + φwt
n
i=1
τi
∂Ri
∂φwt
θtd = norm
t∈T w∈d
ndw ptdw + θtd
n
i=1
τi
∂Ri
∂θtd

Assumptions: what topics would be well-interpretable?
Topics S ⊂ T contain domain-speciﬁc terms
p(w|t), t ∈ S are sparse and diﬀerent (weakly correlated)
Topics B ⊂ T contain background terms
p(w|t), t ∈ B are dense and contain common lexis words
ΦW ×T ΘT×D

Smoothing regularization (rethinking LDA)
The non-sparsity assumption for background topics t ∈ B:
φwt are similar to a given distribution βw ;
θtd are similar to a given distribution αt.
t∈B
KLw (βw φwt) → min
Φ
;
d∈D
KLt(αt θtd ) → min
Θ
.
We minimize the sum of these KL-divergences to get a regularizer:
R(Φ, Θ) = β0
t∈B w∈W
βw ln φwt + α0
d∈D t∈B
αt ln θtd → max .
The regularized M-step applied for all t ∈ B coincides with LDA:
φwt ∝ nwt + β0βw , θtd ∝ ntd + α0αt,
which is new non-Bayesian interpretation of LDA [Blei 2003].

Sparsing regularizer (further rethinking LDA)
The sparsity assumption for domain-speciﬁc topics t ∈ S:
distributions φwt, θtd contain many zero probabilities.
We maximize the sum of KL-divergences KL(β φt) and KL(α θd ):
R(Φ, Θ) = −β0
t∈S w∈W
βw ln φwt − α0
d∈D t∈S
αt ln θtd → max .
The regularized M-step gives “anti-LDA”, for all t ∈ S:
φwt ∝ nwt − β0βw +
, θtd ∝ ntd − α0αt +
.
Varadarajan J., Emonet R., Odobez J.-M. A sparsity constraint for topic
models — application to temporal activity mining // NIPS-2010 Workshop on
Practical Applications of Sparse Modeling: Open Issues and New Directions.

Regularization for topics decorrelation
The dissimilarity assumption for domain-specific topics t ∈ S:
if topics are interpretable then they must differ significantly.
We maximize covariances between column vectors φt:
R(Φ) = −
τ
2
t∈S s∈St w∈W
φwtφws → max .
The regularized M-step makes columns of Φ more distant:
φwt ∝ nwt − τφwt
s∈St
φws
+
.
Tan Y., Ou Z. Topic-weak-correlated latent Dirichlet allocation // 7th Int’l
Symp. Chinese Spoken Language Processing (ISCSLP), 2010. — Pp. 224–228.

Example: Combination of sparsing, smoothing, and decorrelation
smoothing background topics B in Φ and Θ
sparsing domain-speciﬁc topics S = TB in Φ and Θ
decorrelation of topics in Φ
R(Φ, Θ) = + β1
t∈B w∈W
βw ln φwt + α1
d∈D t∈B
αt ln θtd
− β0
t∈S w∈W
βw ln φwt − α0
d∈D t∈S
αt ln θtd
− γ
t∈T s∈Tt w∈W
φwtφws
where β0, α0, β1, α1, γ are regularization coeﬃcients.

Multimodal Probabilistic Topic Modeling
Given a text document collection Probabilistic Topic Model ﬁnds:
p(t|d) — topic distribution for each document d,
p(w|t) — term distribution for each topic t.
Topics of documents
Words and keyphrases of topics
doc1:
doc2:
doc3:
doc4:
...
Text documents
Topic
Modeling
D
o
c
u
m
e
n
t
s
T
o
p
i
c
s

Multimodal Topic Model ﬁnds topical distribution for terms p(w|t),
authors p(a|t), time p(y|t),
Topics of documents
doc1:
doc2:
doc3:
doc4:
...
Text documents
Topic
Modeling
D
o
c
u
m
e
n
t
s
T
o
p
i
c
s
Metadata:
Authors
Data Time
Conference
Organization
URL
etc.

Multimodal Topic Model ﬁnds topical distribution for terms p(w|t),
authors p(a|t), time p(y|t), objects on images p(o|t),
linked documents p(d |t), advertising banners p(b|t), users p(u|t)
Topics of documents
doc1:
doc2:
doc3:
doc4:
...
Text documents
Topic
Modeling
D
o
c
u
m
e
n
t
s
T
o
p
i
c
s
Metadata:
Authors
Data Time
Conference
Organization
URL
etc.
Ads Images Links
Users

Multimodal extension of ARTM
W m is a vocabulary of tokens of m-th modality, m ∈ M
W = W 1 · · · W M is a joint vocabulary of all modalities
Maximum multimodal log-likelihood with regularization:
m∈M
λm
d∈D w∈W m
ndw ln
t
φwtθtd + R(Φ, Θ) → max
Φ,Θ
E-step:
M-step:



ptdw = norm
t∈T
φwtθtd
φwt = norm
w∈W m
d∈D
λm(w)ndw ptdw + φwt
∂R
∂φwt
θtd = norm
t∈T w∈d
λm(w)ndw ptdw + θtd
∂R
∂θtd

Example: Multi-lingual topic model of Wikipedia
Top 10 words with p(wt) probabilities (in %) from two-language
topic model, based on Russian and English Wikipedia articles with
mutual interlanguage links.
Topic 68 Topic 79
research 4.56 институт 6.03 goals 4.48 матч 6.02
technology 3.14 университет 3.35 league 3.99 игрок 5.56
engineering 2.63 программа 3.17 club 3.76 сборная 4.51
institute 2.37 учебный 2.75 season 3.49 фк 3.25
science 1.97 технический 2.70 scored 2.72 против 3.20
program 1.60 технология 2.30 cup 2.57 клуб 3.14
Topic 88 Topic 251
opera 7.36 опера 7.82 windows 8.00 windows 6.05
conductor 1.69 оперный 3.13 microsoft 4.03 microsoft 3.76
orchestra 1.14 дирижер 2.82 server 2.93 версия 1.86
wagner 0.97 певец 1.65 software 1.38 приложение 1.86
soprano 0.78 певица 1.51 user 1.03 сервер 1.63
performance 0.78 театр 1.14 security 0.92 server 1.54

Example: Recommending articles from blog
The quality of recommendations for baseline matrix factorization
model, unimodal model with only modality of user likes, and two
multimodal models incorporating words and user-speciﬁed data
(tags and categories).
Model Recall@5 Recall@10 Recall@20
collaborative ﬁltering 0.591 0.652 0.678
likes 0.62 0.59 0.65
likes + words 0.79 0.64 0.68
all modalities 0.80 0.71 0.69

BigARTM project
BigARTM features:
Parallel + Online + Multimodal + Regularized Topic Modeling
Out-of-core one-pass processing of Big Data
Built-in library of regularizers and quality measures
BigARTM community:
Code on GitHub: https://github.com/bigartm
Links to docs, discussion group, builds
http://bigartm.org
BigARTM license and programming environment:
Freely available for commercial usage (BSD 3-Clause license)
Cross-platform — Windows, Linux, Mac OS X (32 bit, 64 bit)
Programming APIs: command-line, C++, and Python

The BigARTM project: parallel architecture
Concurrent processing of batches D = D1 · · · DB
Simple single-threaded code for ProcessBatch
User controls when to update the model in online algorithm
Deterministic (reproducible) results from run to run

Online EM-algorithm for Multi-ARTM
Input: collection D split into batches Db, b = 1, . . . , B;
Output: matrix Φ;
1 initialize φwt for all w ∈ W , t ∈ T;
2 nwt := 0, ñwt := 0 for all w ∈ W , t ∈ T;
3 for all batches Db, b = 1, . . . , B
4 iterate each document d ∈ Db at a constant matrix Φ:
(ñwt) := (ñwt) + ProcessBatch (Db, Φ);
5 if (synchronize) then
6 nwt := nwt + ñdw for all w ∈ W , t ∈ T;
7 φwt := norm
w∈W m
nwt + φwt
∂R
∂φwt
for all w ∈W m, m∈M, t ∈T;
8 ñwt := 0 for all w ∈ W , t ∈ T;

Online EM-algorithm for Multi-ARTM: ProcessBatch
ProcessBatch iterates documents d ∈ Db at a constant matrix Φ.
matrix (ñwt) := ProcessBatch (set of documents Db, matrix Φ)
1 ñwt := 0 for all w ∈ W , t ∈ T;
2 for all d ∈ Db
3 initialize θtd := 1
|T| for all t ∈ T;
4 repeat
5 ptdw := norm
t∈T
φwtθtd for all w ∈ d, t ∈ T;
6 ntd :=
w∈d
λm(w)ndw ptdw for all t ∈ T;
7 θtd := norm
t∈T
ntd + θtd
∂R
∂θtd
for all t ∈ T;
8 until θd converges;
9 ñwt := ñwt + λm(w)ndw ptdw for all w ∈ d, t ∈ T;

BigARTM vs Gensim vs Vowpal Wabbit
3.7M articles from Wikipedia, 100K unique words
procs train inference perplexity
BigARTM 1 35 min 72 sec 4000
Gensim.LdaModel 1 369 min 395 sec 4161
VowpalWabbit.LDA 1 73 min 120 sec 4108
BigARTM 4 9 min 20 sec 4061
Gensim.LdaMulticore 4 60 min 222 sec 4111
BigARTM 8 4.5 min 14 sec 4304
Gensim.LdaMulticore 8 57 min 224 sec 4455
procs = number of parallel threads
inference = time to infer θd for 100K held-out documents
perplexity is calculated on held-out documents.

Running BigARTM in parallel
3.7M articles from Wikipedia, 100K unique words
Amazon EC2 c3.8xlarge (16 physical cores + hyperthreading)
No extra memory cost for adding more threads

Summary / Questions?
1 Additive Regularization
PLSA algorithm with regularization on M-step
simple way to incorporate assumptions on the topic model
2 Multimodal Topic Models
incorporate metadata into topic model
parallel text corpuses
3 BigARTM1: Open source implementation of Multi-ARTM
parallel online algorithm
ready for large collections
supports multimodal collections
collection of implemented regularizers
1
http://bigartm.org/

Non-Bayesian Additive Regularization for Multimodal Topic Modeling of Large Collections

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Viewers also liked

Viewers also liked (17)

Similar to Non-Bayesian Additive Regularization for Multimodal Topic Modeling of Large Collections

Similar to Non-Bayesian Additive Regularization for Multimodal Topic Modeling of Large Collections (20)

More from romovpa

More from romovpa (12)

Recently uploaded

Recently uploaded (20)

Non-Bayesian Additive Regularization for Multimodal Topic Modeling of Large Collections