Recombinant DNA technology (Immunological screening)
Non-Bayesian Additive Regularization for Multimodal Topic Modeling of Large Collections
1. Non-Bayesian Additive Regularization for
Multimodal Topic Modeling of Large
Collections
Konstantin Vorontsov1,3 • Oleksandr Frei4 • Murat Apishev2
Peter Romov3 • Marina Suvorova2 • Anastasia Yanina1
1Moscow Institute of Physics and Technology,
2Moscow State University • 3Yandex • 4Schlumberger
Topic Models: Post-Processing and Applications,
CIKM’15 Workshop
October 19, 2015 • Melbourne, Australia
2. Probabilistic Topic Model (PTM) generating a text collection
Topic model explains terms w in documents d by topics t:
p(w|d) =
t
p(w|t)p(t|d)
Разработан спектрально-аналитический подход к выявлению размытых протяженных повторов
в геномных последовательностях. Метод основан на разномасштабном оценивании сходства
нуклеотидных последовательностей в пространстве коэффициентов разложения фрагментов
кривых GC- и GA-содержания по классическим ортогональным базисам. Найдены условия
оптимальной аппроксимации, обеспечивающие автоматическое распознавание повторов
различных видов (прямых и инвертированных, а также тандемных) на спектральной матрице
сходства. Метод одинаково хорошо работает на разных масштабах данных. Он позволяет
выявлять следы сегментных дупликаций и мегасателлитные участки в геноме, районы синтении
при сравнении пары геномов. Его можно использовать для детального изучения фрагментов
хромосом (поиска размытых участков с умеренной длиной повторяющегося паттерна).
•( |!)
•("| ):
, … , #$
" , … , "#$
:
0.018 распознавание
0.013 сходство
0.011 паттерн
… … … …
0.023 днк
0.016 геном
0.009 нуклеотид
… … … …
0.014 базис
0.009 спектр
0.006 ортогональный
… … … …
! "" #" $•
Peter Romov (peter@romov.ru) ARTM of Large Multimodal Collections 2 / 28
3. Inverse problem: text collection → PTM
Given: D is a set (collection) of documents
W is a set (vocabulary) of terms
ndw = how many times term w appears in document d
Find: parameters φwt =p(w|t), θtd =p(t|d) of the topic model
p(w|d) =
t
φwtθtd .
under nonnegativity and normalization constraints
φwt 0,
w∈W
φwt = 1; θtd 0,
t∈T
θtd = 1.
The ill-posed problem of matrix factorization:
ΦΘ = (ΦS)(S−1
Θ) = Φ Θ
for all S such that Φ , Θ are stochastic.
Peter Romov (peter@romov.ru) ARTM of Large Multimodal Collections 3 / 28
4. PLSA — Probabilistic Latent Semantic Analysis [Hofmann, 1999]
Constrained maximization of the log-likelihood:
L (Φ, Θ) =
d,w
ndw ln
t
φwtθtd → max
Φ,Θ
EM-algorithm is a simple iteration method for the nonlinear system
E-step:
M-step:
ptdw = norm
t∈T
φwtθtd
φwt = norm
w∈W d∈D
ndw ptdw
θtd = norm
t∈T w∈d
ndw ptdw
where norm
t∈T
xt = max{xt ,0}
s∈T
max{xs ,0} is vector normalization.
Peter Romov (peter@romov.ru) ARTM of Large Multimodal Collections 4 / 28
5. Graphical Models and Bayesian Inference
In Bayesian approach, Graphical Models are used to make
sophisticated generative models.
David M. Blei. Probabilistic topic models // Communications of the ACM,
2012. Vol. 55, No. 4., Pp. 77–84.
Peter Romov (peter@romov.ru) ARTM of Large Multimodal Collections 5 / 28
6. Graphical Models and Bayesian Inference
In Bayesian approach, a lot of calculus to be done for each model
to go from the problem statement to the solution algorithm:
Yi Wang. Distributed Gibbs Sampling of Latent Dirichlet Allocation: The Gritty
Details. 2008.
Peter Romov (peter@romov.ru) ARTM of Large Multimodal Collections 6 / 28
7. ARTM — Additive Regularization of Topic Model
Maximum log-likelihood with additive regularization criterion R:
d,w
ndw ln
t
φwtθtd + R(Φ, Θ) → max
Φ,Θ
EM-algorithm is a simple iteration method for the system
E-step:
M-step:
ptdw = norm
t∈T
φwtθtd
φwt = norm
w∈W d∈D
ndw ptdw + φwt
∂R
∂φwt
θtd = norm
t∈T w∈d
ndw ptdw + θtd
∂R
∂θtd
Peter Romov (peter@romov.ru) ARTM of Large Multimodal Collections 7 / 28
8. Example: Latent Dirichlet Allocation [Blei, Ng, Jordan, 2003]
Maximum a posteriori (MAP) with Dirichlet prior:
d,w
ndw ln
t
φwtθtd
log-likelihood L (Φ,Θ)
+
t,w
βw ln φwt +
d,t
αt ln θtd
regularization criterion R(Φ,Θ)
→ max
Φ,Θ
EM-algorithm is a simple iteration method for the system
E-step:
M-step:
ptdw = norm
t∈T
φwtθtd
φwt = norm
w∈W d∈D
ndw ptdw + βw
θtd = norm
t∈T w∈d
ndw ptdw + αt
Peter Romov (peter@romov.ru) ARTM of Large Multimodal Collections 8 / 28
9. Many Bayesian PTMs can be reinterpreted as regularizers in ARTM
smoothing (LDA) for background and stop-words topics
sparsing (anti-LDA) for domain-specific topics
topic decorrelation
topic coherence maximization
supervised learning for classification and regression
semi-supervised learning
using document citations and links
determining number of topics via entropy sparsing
modeling topical hierarchies
modeling temporal topic dynamics
using vocabularies in multilingual topic models
etc.
Vorontsov K. V., Potapenko A. A. Additive Regularization of Topic Models //
Machine Learning. Volume 101, Issue 1 (2015), Pp. 303-323.
Peter Romov (peter@romov.ru) ARTM of Large Multimodal Collections 9 / 28
10. ARTM — Additive Regularization of Topic Model
Maximum log-likelihood with additive combination of regularizers:
d,w
ndw ln
t
φwtθtd +
n
i=1
τi Ri (Φ, Θ) → max
Φ,Θ
,
where τi are regularization coefficients.
EM-algorithm is a simple iteration method for the system
E-step:
M-step:
ptdw = norm
t∈T
φwtθtd
φwt = norm
w∈W d∈D
ndw ptdw + φwt
n
i=1
τi
∂Ri
∂φwt
θtd = norm
t∈T w∈d
ndw ptdw + θtd
n
i=1
τi
∂Ri
∂θtd
Peter Romov (peter@romov.ru) ARTM of Large Multimodal Collections 10 / 28
11. Assumptions: what topics would be well-interpretable?
Topics S ⊂ T contain domain-specific terms
p(w|t), t ∈ S are sparse and different (weakly correlated)
Topics B ⊂ T contain background terms
p(w|t), t ∈ B are dense and contain common lexis words
ΦW ×T ΘT×D
Peter Romov (peter@romov.ru) ARTM of Large Multimodal Collections 11 / 28
12. Smoothing regularization (rethinking LDA)
The non-sparsity assumption for background topics t ∈ B:
φwt are similar to a given distribution βw ;
θtd are similar to a given distribution αt.
t∈B
KLw (βw φwt) → min
Φ
;
d∈D
KLt(αt θtd ) → min
Θ
.
We minimize the sum of these KL-divergences to get a regularizer:
R(Φ, Θ) = β0
t∈B w∈W
βw ln φwt + α0
d∈D t∈B
αt ln θtd → max .
The regularized M-step applied for all t ∈ B coincides with LDA:
φwt ∝ nwt + β0βw , θtd ∝ ntd + α0αt,
which is new non-Bayesian interpretation of LDA [Blei 2003].
Peter Romov (peter@romov.ru) ARTM of Large Multimodal Collections 12 / 28
13. Sparsing regularizer (further rethinking LDA)
The sparsity assumption for domain-specific topics t ∈ S:
distributions φwt, θtd contain many zero probabilities.
We maximize the sum of KL-divergences KL(β φt) and KL(α θd ):
R(Φ, Θ) = −β0
t∈S w∈W
βw ln φwt − α0
d∈D t∈S
αt ln θtd → max .
The regularized M-step gives “anti-LDA”, for all t ∈ S:
φwt ∝ nwt − β0βw +
, θtd ∝ ntd − α0αt +
.
Varadarajan J., Emonet R., Odobez J.-M. A sparsity constraint for topic
models — application to temporal activity mining // NIPS-2010 Workshop on
Practical Applications of Sparse Modeling: Open Issues and New Directions.
Peter Romov (peter@romov.ru) ARTM of Large Multimodal Collections 13 / 28
14. Regularization for topics decorrelation
The dissimilarity assumption for domain-specific topics t ∈ S:
if topics are interpretable then they must differ significantly.
We maximize covariances between column vectors φt:
R(Φ) = −
τ
2
t∈S s∈St w∈W
φwtφws → max .
The regularized M-step makes columns of Φ more distant:
φwt ∝ nwt − τφwt
s∈St
φws
+
.
Tan Y., Ou Z. Topic-weak-correlated latent Dirichlet allocation // 7th Int’l
Symp. Chinese Spoken Language Processing (ISCSLP), 2010. — Pp. 224–228.
Peter Romov (peter@romov.ru) ARTM of Large Multimodal Collections 14 / 28
15. Example: Combination of sparsing, smoothing, and decorrelation
smoothing background topics B in Φ and Θ
sparsing domain-specific topics S = TB in Φ and Θ
decorrelation of topics in Φ
R(Φ, Θ) = + β1
t∈B w∈W
βw ln φwt + α1
d∈D t∈B
αt ln θtd
− β0
t∈S w∈W
βw ln φwt − α0
d∈D t∈S
αt ln θtd
− γ
t∈T s∈Tt w∈W
φwtφws
where β0, α0, β1, α1, γ are regularization coefficients.
Peter Romov (peter@romov.ru) ARTM of Large Multimodal Collections 15 / 28
16. Multimodal Probabilistic Topic Modeling
Given a text document collection Probabilistic Topic Model finds:
p(t|d) — topic distribution for each document d,
p(w|t) — term distribution for each topic t.
Topics of documents
Words and keyphrases of topics
doc1:
doc2:
doc3:
doc4:
...
Text documents
Topic
Modeling
D
o
c
u
m
e
n
t
s
T
o
p
i
c
s
Peter Romov (peter@romov.ru) ARTM of Large Multimodal Collections 16 / 28
17. Multimodal Probabilistic Topic Modeling
Multimodal Topic Model finds topical distribution for terms p(w|t),
authors p(a|t), time p(y|t),
Topics of documents
Words and keyphrases of topics
doc1:
doc2:
doc3:
doc4:
...
Text documents
Topic
Modeling
D
o
c
u
m
e
n
t
s
T
o
p
i
c
s
Metadata:
Authors
Data Time
Conference
Organization
URL
etc.
Peter Romov (peter@romov.ru) ARTM of Large Multimodal Collections 17 / 28
18. Multimodal Probabilistic Topic Modeling
Multimodal Topic Model finds topical distribution for terms p(w|t),
authors p(a|t), time p(y|t), objects on images p(o|t),
linked documents p(d |t), advertising banners p(b|t), users p(u|t)
Topics of documents
Words and keyphrases of topics
doc1:
doc2:
doc3:
doc4:
...
Text documents
Topic
Modeling
D
o
c
u
m
e
n
t
s
T
o
p
i
c
s
Metadata:
Authors
Data Time
Conference
Organization
URL
etc.
Ads Images Links
Users
Peter Romov (peter@romov.ru) ARTM of Large Multimodal Collections 18 / 28
19. Multimodal extension of ARTM
W m is a vocabulary of tokens of m-th modality, m ∈ M
W = W 1 · · · W M is a joint vocabulary of all modalities
Maximum multimodal log-likelihood with regularization:
m∈M
λm
d∈D w∈W m
ndw ln
t
φwtθtd + R(Φ, Θ) → max
Φ,Θ
EM-algorithm is a simple iteration method for the system
E-step:
M-step:
ptdw = norm
t∈T
φwtθtd
φwt = norm
w∈W m
d∈D
λm(w)ndw ptdw + φwt
∂R
∂φwt
θtd = norm
t∈T w∈d
λm(w)ndw ptdw + θtd
∂R
∂θtd
Peter Romov (peter@romov.ru) ARTM of Large Multimodal Collections 19 / 28
20. Example: Multi-lingual topic model of Wikipedia
Top 10 words with p(wt) probabilities (in %) from two-language
topic model, based on Russian and English Wikipedia articles with
mutual interlanguage links.
Topic 68 Topic 79
research 4.56 институт 6.03 goals 4.48 матч 6.02
technology 3.14 университет 3.35 league 3.99 игрок 5.56
engineering 2.63 программа 3.17 club 3.76 сборная 4.51
institute 2.37 учебный 2.75 season 3.49 фк 3.25
science 1.97 технический 2.70 scored 2.72 против 3.20
program 1.60 технология 2.30 cup 2.57 клуб 3.14
Topic 88 Topic 251
opera 7.36 опера 7.82 windows 8.00 windows 6.05
conductor 1.69 оперный 3.13 microsoft 4.03 microsoft 3.76
orchestra 1.14 дирижер 2.82 server 2.93 версия 1.86
wagner 0.97 певец 1.65 software 1.38 приложение 1.86
soprano 0.78 певица 1.51 user 1.03 сервер 1.63
performance 0.78 театр 1.14 security 0.92 server 1.54
Peter Romov (peter@romov.ru) ARTM of Large Multimodal Collections 20 / 28
21. Example: Recommending articles from blog
The quality of recommendations for baseline matrix factorization
model, unimodal model with only modality of user likes, and two
multimodal models incorporating words and user-specified data
(tags and categories).
Model Recall@5 Recall@10 Recall@20
collaborative filtering 0.591 0.652 0.678
likes 0.62 0.59 0.65
likes + words 0.79 0.64 0.68
all modalities 0.80 0.71 0.69
Peter Romov (peter@romov.ru) ARTM of Large Multimodal Collections 21 / 28
22. BigARTM project
BigARTM features:
Parallel + Online + Multimodal + Regularized Topic Modeling
Out-of-core one-pass processing of Big Data
Built-in library of regularizers and quality measures
BigARTM community:
Code on GitHub: https://github.com/bigartm
Links to docs, discussion group, builds
http://bigartm.org
BigARTM license and programming environment:
Freely available for commercial usage (BSD 3-Clause license)
Cross-platform — Windows, Linux, Mac OS X (32 bit, 64 bit)
Programming APIs: command-line, C++, and Python
Peter Romov (peter@romov.ru) ARTM of Large Multimodal Collections 22 / 28
23. The BigARTM project: parallel architecture
Concurrent processing of batches D = D1 · · · DB
Simple single-threaded code for ProcessBatch
User controls when to update the model in online algorithm
Deterministic (reproducible) results from run to run
Peter Romov (peter@romov.ru) ARTM of Large Multimodal Collections 23 / 28
24. Online EM-algorithm for Multi-ARTM
Input: collection D split into batches Db, b = 1, . . . , B;
Output: matrix Φ;
1 initialize φwt for all w ∈ W , t ∈ T;
2 nwt := 0, ˜nwt := 0 for all w ∈ W , t ∈ T;
3 for all batches Db, b = 1, . . . , B
4 iterate each document d ∈ Db at a constant matrix Φ:
(˜nwt) := (˜nwt) + ProcessBatch (Db, Φ);
5 if (synchronize) then
6 nwt := nwt + ˜ndw for all w ∈ W , t ∈ T;
7 φwt := norm
w∈W m
nwt + φwt
∂R
∂φwt
for all w ∈W m, m∈M, t ∈T;
8 ˜nwt := 0 for all w ∈ W , t ∈ T;
Peter Romov (peter@romov.ru) ARTM of Large Multimodal Collections 24 / 28
25. Online EM-algorithm for Multi-ARTM: ProcessBatch
ProcessBatch iterates documents d ∈ Db at a constant matrix Φ.
matrix (˜nwt) := ProcessBatch (set of documents Db, matrix Φ)
1 ˜nwt := 0 for all w ∈ W , t ∈ T;
2 for all d ∈ Db
3 initialize θtd := 1
|T| for all t ∈ T;
4 repeat
5 ptdw := norm
t∈T
φwtθtd for all w ∈ d, t ∈ T;
6 ntd :=
w∈d
λm(w)ndw ptdw for all t ∈ T;
7 θtd := norm
t∈T
ntd + θtd
∂R
∂θtd
for all t ∈ T;
8 until θd converges;
9 ˜nwt := ˜nwt + λm(w)ndw ptdw for all w ∈ d, t ∈ T;
Peter Romov (peter@romov.ru) ARTM of Large Multimodal Collections 25 / 28
26. BigARTM vs Gensim vs Vowpal Wabbit
3.7M articles from Wikipedia, 100K unique words
procs train inference perplexity
BigARTM 1 35 min 72 sec 4000
Gensim.LdaModel 1 369 min 395 sec 4161
VowpalWabbit.LDA 1 73 min 120 sec 4108
BigARTM 4 9 min 20 sec 4061
Gensim.LdaMulticore 4 60 min 222 sec 4111
BigARTM 8 4.5 min 14 sec 4304
Gensim.LdaMulticore 8 57 min 224 sec 4455
procs = number of parallel threads
inference = time to infer θd for 100K held-out documents
perplexity is calculated on held-out documents.
Peter Romov (peter@romov.ru) ARTM of Large Multimodal Collections 26 / 28
27. Running BigARTM in parallel
3.7M articles from Wikipedia, 100K unique words
Amazon EC2 c3.8xlarge (16 physical cores + hyperthreading)
No extra memory cost for adding more threads
Peter Romov (peter@romov.ru) ARTM of Large Multimodal Collections 27 / 28
28. Summary / Questions?
1 Additive Regularization
PLSA algorithm with regularization on M-step
simple way to incorporate assumptions on the topic model
2 Multimodal Topic Models
incorporate metadata into topic model
parallel text corpuses
3 BigARTM1: Open source implementation of Multi-ARTM
parallel online algorithm
ready for large collections
supports multimodal collections
collection of implemented regularizers
1
http://bigartm.org/
Peter Romov (peter@romov.ru) ARTM of Large Multimodal Collections 28 / 28