Bayesian Generalization Error and Real Log Canonical Threshold in Non-negative Matrix Factorization and Latent Dirichlet Allocation

Bayesian Generalization Error and
Real Log Canonical Threshold in
Non-negative Matrix Factorization
and Latent Dirichlet Allocation
NAOKI HAYASHI (1,2)
2020/06/25
(1) NTT DATA MATHEMATICAL SYSTEMS INC.
SIMULATION & MINING DIVISION
(2) TOKYO INSTITUTE OF TECHNOLOGY
SUMIO WATANABE LABORATORY
1

Symbol Notations
𝑞 𝑥 : the true (i.e. data generating) distribution
𝑝 𝑥|𝑤 : statistical model given parameter w
𝜑 𝑤 : prior distribution
𝑋 𝑛 = 𝑋1, … , 𝑋 𝑛 : i.i.d. sample (r.v.) from 𝑞 𝑥
𝑃 𝑋 𝑛|𝑤 : likelihood
𝜓 𝑤 𝑋 𝑛
: posterior distribution
𝑍 𝑋 𝑛 : marginal likelihood (a.k.a. evidence)
𝑝∗
𝑥 : predictive distribution
Note: 𝑃 𝑋 𝑛|𝑤 , 𝜓 𝑤 𝑋 𝑛 , 𝑍 𝑋 𝑛 , 𝑝∗ 𝑥 depend on 𝑋 𝑛
thus, they are random variables in function spaces.
2

Outline
1. Singular Learning Theory
2. Parameter Region Restriction
3. Summary
3

Outline
2.
3.
4

Problem Setting
• Important random variables are the generalization
error 𝐺 𝑛 and the marginal likelihood 𝑍 𝑛 = 𝑍 𝑋 𝑛 .
‒ 𝐺 𝑛 = 𝑞 𝑥 log
𝑞 𝑥
𝑝∗ 𝑥
d𝑥.
 It represents how different between the true and the predictive in
the sense of a new data generating process.
‒ 𝑍 𝑛 = 𝑖=1
𝑛
𝑝 𝑋𝑖 𝑤 𝜑 𝑤 d𝑤 .
 It represents how similar the true to the model in the sense of the
dataset generating process.
5

Problem Setting
• Important random variables are the generalization
error 𝐺 𝑛 and the marginal likelihood 𝑍 𝑛 = 𝑍 𝑋 𝑛 .
‒ 𝐺 𝑛 = 𝑞 𝑥 log
𝑞 𝑥
𝑝∗ 𝑥
d𝑥.
 It represents how different between the true and the predictive in
the sense of a new data generating process.
‒ 𝑍 𝑛 = 𝑖=1
𝑛
𝑝 𝑋𝑖 𝑤 𝜑 𝑤 d𝑤 .
 It represents how different between the true and the model in the
sense of the dataset generating process.
• How do they behave?
6

Regular Case
• From regular learning theory, if the posterior can be
approximated by a normal dist., the followings hold:
‒ 𝔼 𝐺 𝑛 =
𝑑
2𝑛
+ 𝑜
1
𝑛
,
‒ − log 𝑍 𝑛 = 𝑛𝑆 𝑛 +
𝑑
2
log 𝑛 + 𝑂𝑝 1 ,
where 𝑑 is the dim. of params. and 𝑆 𝑛 is the empirical entropy.
• AIC and BIC are based on regular learning theory.
7

Regular Case
• From regular learning theory, if the posterior can be
approximated by a normal dist., the followings hold:
‒ 𝔼 𝐺 𝑛 =
𝑑
2𝑛
+ 𝑜
1
𝑛
,
‒ − log 𝑍 𝑛 = 𝑛𝑆 𝑛 +
𝑑
2
log 𝑛 + 𝑂𝑝 1 ,
where 𝑑 is the dim. of params. and 𝑆 𝑛 is the empirical entropy.
• AIC and BIC are based on regular learning theory.
8
How about singular cases?
(singular = non-regular)

• Hierarchical models and latent variable
models are typical singular models.
• Their likelihood and posterior cannot
be approximated by any normal dist.
‒ Simple example:
the log likelihood is −𝑏2 𝑏 − 𝑎3 2
in w=(a, b) space.
9
Singular Case

• Hierarchical models and latent variable
models are typical singular models.
• Their likelihood and posterior cannot
be approximated by any normal dist.
‒ Simple example:
the log likelihood is −𝑏2 𝑏 − 𝑎3 2
in w=(a, b) space.
10
Singular Case
Regular learning theory cannot clarify the behavior of their
generalization errors and marginal likelihoods.

Singular Case
• Singular learning theory provides a general
theory for the above issue.
• Suppose some technical assumption, the
followings hold, even if the posterior cannot be
approximated by any normal dist.:
‒ 𝔼 𝐺 𝑛 =
𝜆
𝑛
−
𝑚−1
𝑛 log 𝑛
+ 𝑜
1
𝑛 log 𝑛
,
‒ − log 𝑍 𝑛 = 𝑛𝑆 𝑛 + 𝜆 log 𝑛 − 𝑚 − 1 log log 𝑛 + 𝑂𝑝 1 ,
where 𝜆, 𝑚 are constants which depend on 𝑝 𝑥 𝑤 , 𝜑 𝑤 , and 𝑞 𝑥 .
11
[1] Watanabe. 2001

Singular Case
• Singular learning theory provides a general
theory for the above issue.
• Suppose some technical assumption, the
followings hold, even if the posterior cannot be
approximated by any normal dist.:
‒ 𝔼 𝐺 𝑛 =
𝜆
𝑛
−
𝑚−1
𝑛 log 𝑛
+ 𝑜
1
𝑛 log 𝑛
,
‒ − log 𝑍 𝑛 = 𝑛𝑆 𝑛 + 𝜆 log 𝑛 − 𝑚 − 1 log log 𝑛 + 𝑂𝑝 1 ,
where 𝜆, 𝑚 are constants which depend on 𝑝 𝑥 𝑤 , 𝜑 𝑤 , and 𝑞 𝑥 .
12
What are these constants?
[1] Watanabe. 2001

Invariants in Algebraic Geometry
• Def. A real log canonical threshold (RLCT) is defined
by a negative maximum pole of a zeta function
𝜁 𝑧 =
𝑊
𝐾 𝑤 𝑧 𝑏 𝑤 d𝑤 ,
‒ where K(w) and b(w) are non-negative and analytic.
• Thm. Put 𝐾 𝑤 = KL 𝑞||𝑝 and 𝑏 𝑤 = 𝜑 𝑤 ,
then the RLCT is the learning coefficient 𝜆 and the
order of the maximum pole is the multiplicity 𝑚.
13
This is an important result in singular learning theory.
[1] Watanabe. 2001

14
The lines are the set K(w)=0 in the parameter space
and the star is the “deepest” singularity.

15
Corresponding to the
maximum pole of the
zeta function
𝜁 𝑧 =
𝐶
𝑧 + 𝜆 𝑚
+ ⋯
𝐎𝐗 𝐗 𝐗
𝒛 = −𝝀
ℂ

• Properties of 𝜆 and 𝑚 are as follows:
‒ 𝜆 is a positive rational # and 𝑚 is a positive integer.
‒ They are birational invariants.
 We can determine them using blowing-ups, mathematically supported
by Hironaka’s Singularity Resolution Theorem.
• Application of 𝜆 and 𝑚 are as follows:
‒ Nagata showed that an exchange probability in replica
MCMC is represented by using 𝜆.
‒ Drton proposed sBIC which approximates log 𝑍 𝑛 by using the
RLCTs and the multiplicities of candidates 𝑝𝑖, 𝜑, 𝑞 = 𝑝𝑗 :
𝑠𝐵𝐼𝐶𝑖𝑗 "=" loglike 𝑤MLE − 𝜆𝑖𝑗 log 𝑛 + 𝑚𝑖𝑗 − 1 log log 𝑛 .
16
[3] Nagata. 2008
[2] Hironaka. 1964
[4] Drton. 2017-a

> “We can determine them using blowing-ups”
In fact, statistician and machine learning researchers have
studied RLCTs for concreate models:
Besides, Imai has proposed a consistent estimator of an RLCT
([9] Imai. 2019). He’ll talk about it tomorrow at this conf.
17
Singular model Author and year
Gaussian mixture model Yamazaki et. al. in 2003 [5]
Reduced rank regression = MF Aoyagi et. al. in 2005 [6]
Naïve Bayesian network Rusakov et. al. in 2005 [7]
Markov model Zwiernik in 2011 [8]
… …

Outline
1.
1. Introduction
2. Non-negative Matrix Factorization
3. Latent Dirichlet Allocation
4. Effect of Restriction
3.
18

Outline
1.
1. Introduction
3.
19

Motivation
• Sometimes, parameter regions are often
restricted because of interpretability.
1. Non-negative restriction
2. Simplex restriction, etc.
20
Coefficients Coefficients
Non-negative
restriction
Legend
・TVCM
・DM
・Rating
・Reviews
E.g. Logistic regression of purchase existence for a product.
[14] Kohjima. 2016

Motivation
• Sometimes, parameter regions are often
restricted because of interpretability.
1. Non-negative restriction
2. Simplex restriction, etc.
21
Coefficients Coefficients
Non-negative
restriction
Legend
・TVCM
・DM
・Rating
・Reviews
E.g. Logistic regression of purchase existence for a product.
[14] Kohjima. 2016
What happens when restrictions
are added?
How does the generalization
error change?

Motivation
• A future application of clarifying the effect of
restriction to generalization is as follows.
22
CustomerStatistician

Motivation
23
We want to know what happens
when the contributions of
explanatories are restricted to
non-negative. We need high
prediction accuracy and an
interpretable model.

Motivation
24
We want to know what happens
when the contributions of
explanatories are restricted to
non-negative. We need high
prediction accuracy and an
interpretable model.
We can answer it. If the
parameter is restricted to non-
negative, the prediction
performance is reduced by Foo
points when n = Bar.
To achieve the needed accuracy,
we recommend increasing n to
Bar+Baz.

Revisiting Analytic Set
25
Corresponding to the
maximum pole of the
zeta function
𝜁 𝑧 =
𝐶
𝑧 + 𝜆 𝑚
+ ⋯

26
When a restriction is added,

27
The deepest singularity is changed.

28
I.e.

29
I.e.
The RLCT and the multiplicity
become different.
A theoretical simple example is in Appendix.

Recent Studies
Two recent studies of singular learning theory for
parameter restricted models will be introduced.
• Non-negative matrix factorization (NMF)
‒ Based on our previous works:
https://doi.org/10.1016/j.neucom.2017.04.068 [10]
https://doi.org/10.1109/ssci.2017.8280811 [11]
https://doi.org/10.1016/j.neunet.2020.03.009 [12]
• Latent Dirichlet allocation (LDA)
‒ Based on our previous work:
https://doi.org/10.1007/s42979-020-0071-3 [13]
30

Outline
1.
2. Non-negative Matrix Factorization
3.
31

Non-negative matrix factorization
• NMF as a statistical model is formulized as follows.
‒ Data matrices: 𝑋 𝑛 = 𝑋 1 , … , 𝑋 𝑛 ; 𝑀 × 𝑁 × 𝑛
are i.i.d. and subject to 𝑞 𝑋𝑖𝑗 = Poi 𝑋𝑖𝑗| 𝐴0 𝐵0 𝑖𝑗 .
 True factorization: 𝐴; 𝑀 × 𝐻0, 𝐵; 𝐻0 × 𝑁
‒ Set the model 𝑝 𝑋𝑖𝑗|𝐴, 𝐵 = Poi 𝑋𝑖𝑗| 𝐴𝐵 𝑖𝑗 and
the prior 𝜑 𝐴, 𝐵 = Gam 𝐴𝑖𝑘|𝜙 𝐴, 𝜃 𝐴 Gam 𝐵 𝑘𝑗|𝜙 𝐵, 𝜃 𝐵 .
 Learner factorization: 𝐴; 𝑀 × 𝐻, 𝐵; 𝐻 × 𝑁
32
n
X
A
B
𝑃 𝑋, 𝐴, 𝐵 = 𝑃 𝑋 𝐴, 𝐵 𝑃 𝐴 𝑃 𝐵
Poi 𝑥|𝑐 =
𝑐 𝑥
𝑒−𝑐
𝑥!
𝑐 > 0 + 𝛿 𝑥 𝑐 = 0
Gam 𝑎|𝜙, 𝜃 =
𝜃 𝜙
Γ 𝜃
𝑎 𝜙
𝑒−𝜃𝑎

Non-negative matrix factorization
• NMF as a statistical model is formulized as follows.
‒ Data matrices: 𝑋 𝑛 = 𝑋 1 , … , 𝑋 𝑛 ; 𝑀 × 𝑁 × 𝑛
are i.i.d. and subject to 𝑞 𝑋𝑖𝑗 = Poi 𝑋𝑖𝑗| 𝐴0 𝐵0 𝑖𝑗 .
 True factorization: 𝐴; 𝑀 × 𝐻0, 𝐵; 𝐻0 × 𝑁
‒ Set the model 𝑝 𝑋𝑖𝑗|𝐴, 𝐵 = Poi 𝑋𝑖𝑗| 𝐴𝐵 𝑖𝑗 and
the prior 𝜑 𝐴, 𝐵 = Gam 𝐴|𝜙 𝐴, 𝜃 𝐴 Gam 𝐵|𝜙 𝐵, 𝜃 𝐵 .
 Learner factorization: 𝐴; 𝑀 × 𝐻, 𝐵; 𝐻 × 𝑁
33
n
X
A
B
𝑃 𝑋, 𝐴, 𝐵 = 𝑃 𝑋 𝐴, 𝐵 𝑃 𝐴 𝑃 𝐵
Poi 𝑥|𝑐 =
𝑐 𝑥
𝑒−𝑐
𝑥!
𝑐 > 0 + 𝛿 𝑥 𝑐 = 0
Gam 𝑎|𝜙, 𝜃 =
𝜃 𝜙
Γ 𝜃
𝑎 𝜙
𝑒−𝜃𝑎
Matrix X is factorized to a product of
two matrices.
𝑿𝑴
𝑵 𝑯 𝑵
𝑴
𝑨 𝑩𝑯
[14] Kohjima. 2016

RLCT of NMF
• The RLCT of NMF 𝝀 satisfies the following inequality:
𝝀 ≤
𝟏
𝟐
𝑯 − 𝑯 𝟎 𝐦𝐢𝐧 𝑴𝝓 𝑨, 𝑵𝝓 𝑩 + 𝑯 𝟎 𝑴 + 𝑵 − 𝟏 .
The equality holds if 𝑯 = 𝑯 𝟎 = 𝟏 or 𝑯 𝟎 = 𝟎.
• Tighter upper bound is also derived
if 𝝓 𝑨 = 𝝓 𝑩 = 𝟏.
• This result gives a lower bound of
variational approximation error in NMF.
34
[15] Kohjima. 2017
[12] H. 2020-a
[11] H. and Watanabe 2017-b
[10] H. and Watanabe 2017-a
[12] H. 2020-a

Outline
1.
3. Latent Dirichlet Allocation
3.
35

Latent Dirichlet Allocation
• Situation:
‒ LDA treats a bag of words.
‒ Each document (=word list) has some latent topics.
 E.g. A mathematics paper is a document.
It is expected that there exists “name” topic, “math” topic, etc.
In “name” topic, appearance probability of mathematician’s
names may be high.
36
MATH
NAME
…
Riemann,
Lebesgue,
Atiyah,
Hironaka,
…
integral,
measure,
distribution,
singularity,
…
document
topic
word
word

• Documents 𝑧 𝑛 and words 𝑥 𝑛 are observable and
topics 𝑦 𝑛 are not.
• LDA assumes that words occur given documents.
37
MATH
NAME
…
Riemann,
Lebesgue,
Atiyah,
Hironaka,
…
integral,
measure,
distribution,
singularity,
…
document
topic
word
word
n
xyz
𝑥 𝑛
∼ 𝑞 𝑥 𝑧
𝑝 𝑥, 𝑦 𝑧, 𝑤
estimate

38
FOOD
Alice
sushi
NAME
MATH
Riemann
integral
NAME
・
・
・
・
・
・
FOOD pudding
・
・
・
NAME Lebesgue
Data(=word) generating process of LDA
Document 1
Document N

39
FOOD
Alice
sushi
NAME
MATH
Riemann
integral
NAME
・
・
・
・
・
・
FOOD pudding
・
・
・
NAME Lebesgue
Document 1
Document N
Topic proportion 𝑏𝑗 = 𝑏1𝑗, … , 𝑏 𝐻𝑗 is
corresponding to each document.
Topic 1
Topic 2
Topic H

40
FOOD
Alice
sushi
NAME
MATH
Riemann
integral
NAME
・
・
・
・
・
・
FOOD pudding
・
・
・
NAME Lebesgue
Document 1
Document N
Topic proportion 𝑏𝑗 = 𝑏1𝑗, … , 𝑏 𝐻𝑗 is
corresponding to each document.
Word appearance probability
𝑎 𝑘 = 𝑎1𝑘, … , 𝑎 𝑀𝑘 is
corresponding to each topic.
Topic 1
Topic 2
Topic H

• LDA is formulized as follows.
‒ Onehot-encoded words: 𝑥 𝑛
= 𝑥 1
, … , 𝑥 𝑛
; 𝑀 × 𝑛 and
documents: 𝑧 𝑛 = 𝑧 1 , … , 𝑧 𝑛 ; 𝑁 × 𝑛 are i.i.d. and
generated from 𝑞 𝑥|𝑧 𝑞 𝑧 .
‒ Set the model
𝑝 𝑥, 𝑦|𝑧, 𝐴, 𝐵 =
𝑗
𝑁
𝑘
𝐻
𝑏 𝑘𝑗
𝑖
𝑀
𝑎𝑖𝑘
𝑥 𝑖
𝑦 𝑘
𝑧 𝑗
and the prior 𝜑 𝐴, 𝐵 = 𝑘 Dir 𝑎 𝑘|𝜙 𝐴 𝑗 Dir 𝑏 𝑘|𝜙 𝐵 .
 Latent topics: 𝑦 𝑛
= 𝑦
1
, … , 𝑦
𝑛
; 𝐾 × 𝑛.
 Stochastic matrices: 𝐴; 𝑀 × 𝐻, 𝐵; 𝐻 × 𝑁, 𝑘 𝑎𝑖𝑘 = 1, 𝑗 𝑏 𝑘𝑗 = 1.
41
𝑃 𝑋, 𝑌, 𝐴, 𝐵|𝑍 = 𝑃 𝑋, 𝑌 𝑍, 𝐴, 𝐵 𝑃 𝐴 𝑃 𝐵 ; Dir 𝑐|𝜙 =
Γ 𝑘
𝐻
𝜙 𝑘
𝑘
𝐻 𝜙 𝑘
𝑘
𝐻
𝑐 𝑘
𝜙 𝑘−1
, 𝑘 𝑐 𝑘 = 1.

Stochastic Matrix Factorization
• In the NMF, consider formally replacing parameter
non-negative matrices to stochastic matrices.
‒ A stochastic matrix is defined by a non-negative matrix
wherein the sum of the entries in its columns is equal to 1.
Example:
0.1 0.1 0.4 0
0.5 0.1 0.4 0
0.4 0.8 0.2 1
.
• This replaced model is called stochastic matrix
factorization (SMF).
42

Equivalence of LDA and SMF
• Let 𝐾 𝑤 = 𝑧 𝑥 𝑞 𝑥 𝑧 𝑞 𝑧 log
𝑞 𝑥 𝑧
𝑝 𝑥 𝑧, 𝐴, 𝐵
and 𝐻 𝑤 = 𝐴𝐵 − 𝐴 𝑜 𝐵𝑜
2 where 𝑞 𝑥 𝑧 is
𝑦 𝑝 𝑥, 𝑦 𝑧, 𝐴0, 𝐵0 .
• It can be proved that 𝐾 𝑤 ∼ 𝐻 𝑤 , i.e. the RLCT
of LDA is equal to the one of SMF.
43
[13] H. 2020-b

Equivalence of LDA and SMF
• Let 𝐾 𝑤 = 𝑧 𝑥 𝑞 𝑥 𝑧 𝑞 𝑧 log
𝑞 𝑥 𝑧
𝑝 𝑥 𝑧, 𝐴, 𝐵
and 𝐻 𝑤 = 𝐴𝐵 − 𝐴 𝑜 𝐵𝑜
2 where 𝑞 𝑥 𝑧 is
𝑦 𝑝 𝑥, 𝑦 𝑧, 𝐴0, 𝐵0 .
• It can be proved that 𝐾 𝑤 ∼ 𝐻 𝑤 , i.e. the RLCT
of LDA is equal to the one of SMF.
44
Thus, we only have to consider SMF
to determine 𝜆 and 𝑚 of LDA.
[13] H. 2020-b

RLCT of LDA = RLCT of SMF
• If the prior is positive and bounded, the RLCT of LDA 𝝀
satisfies the following inequality:
𝝀 ≤
𝟏
𝟐
𝑯 − 𝑯 𝟎 𝐦𝐢𝐧 𝑴 𝟏, 𝑵 + 𝑯 𝟎 − 1 𝑴 𝟏 + 𝑵 − 𝟐 + 𝑴 𝟏 ,
where 𝑴 𝟏 = 𝑴 − 𝟏.
The equality holds if 𝑯 = 𝑯 𝟎 = 𝟏, 𝟐.
• Also, if 𝑯 = 𝟐 𝐚𝐧𝐝 𝑯 𝟎 = 𝟏,
𝝀 =
1
2
max 𝑀, 𝑁 + 𝑀 − 2 .
45
[13] H. 2020-b

Outline
1.
4. Effect of Restriction
3.
46

Effect of Restriction
• Parameter region of NMF is non-negative matrix.
• Parameter region of SMF is stochastic matrix.
• If the parameter region is real matrix, then the
model is called matrix factorization.
‒ In the non-restricted case, Aoyagi clarified the exact
value of 𝜆 and 𝑚.
• How different?
47
[6] Aoyagi. 2005

• Because of the restriction, the rank of matrix has
different meaning.
‒ In NMF, the minimal inner dimension of the true matrix
factorization is NOT rank; non-negative rank.
‒ The boundary of the parameter region causes that the
usual rank is not equal to the non-negative rank.
‒ Since SMF is a restricted NMF, its minimal factorizations
are also affected.
48
[16] Cohen, et. al. 1993

• Because of the restriction, the rank of matrix has
different meaning.
‒ In NMF, the minimal inner dimension of the true matrix
factorization is NOT rank; non-negative rank.
‒ The boundary of the parameter region causes that the
usual rank is not equal to the non-negative rank.
‒ Since SMF is a restricted NMF, its minimal factorizations
are also affected.
49
𝐻0 = 0 in NMF and 𝐻0 = 1 in SMF
are such cases.
In fact, the RLCTs are not equal to
the one in the non-restricted case.
[16] Cohen, et. al. 1993

• In general, narrowing the parameter region
increases RLCTs; generalization error increases.
‒ NMF is such case. Difference of RLCTs gives the effect.
50

‒ NMF is such case. Difference of RLCTs shows the effect.
• However, restriction with decreasing the
parameter dimension does not increase them.
‒ SMF is such case because simplex constraint obviously
decreases the dimension.
51
0.1 0.1 0.4 0
0.5 0.1 0.4 0
0.4 0.8 0.2 1
𝑎 𝑀𝑘 = 1 −
𝑖=1
𝑀−1
𝑎𝑖𝑘

‒ NMF is such case. Difference of RLCTs gives the effect.
• However, restriction with decreasing the
parameter dimension does not increase them.
‒ SMF is such case because simplex constraint obviously
decrease the dimension.
52
0.1 0.1 0.4 0
0.5 0.1 0.4 0
0.4 0.8 0.2 1
𝑎 𝑀𝑘 = 1 −
𝑖=1
𝑀−1
𝑎𝑖𝑘
One of future work is more
precious evaluation for effect
of parameter region restriction.

3. Summary
• Singular learning theory provides a general theory
for determination of behaviors of Bayesian
generalization error and marginal likelihood by using
algebraic geometry, even if the model is singular.
• In the above theory, as a foundation for clarifying
the effect of parameter region restriction, we derive
upper bounds of the RLCTs for typical restricted
models.
54

References
[1] Watanabe S. Algebraic geometrical methods for hierarchical learning
machines. Neural Netw. 2001;13(4):1049–60.
[2] Hironaka H. Resolution of singularities of an algbraic variety over a field of
characteristic zero. Ann Math. 1964;79:109–326.
[3] Nagata K, Watanabe S. Asymptotic behavior of exchange ratio in exchange
monte carlo method. Neural Netw. 2008;21(7):980–8.
[4] Drton M, Plummer M. A bayesian information criterion for singular models. J
R Stat Soc B. 2017;79:323–80 with discussion.
[5] Yamazaki K, Watanabe S. Singularities in mixture models and upper bounds
of stochastic complexity. Neural Netw. 2003;16(7):1029–38.
[6] Aoyagi M, Watanabe S. Stochastic complexities of reduced rank regression in
bayesian estimation. Neural Netw. 2005;18(7):924–33.
[7] Rusakov D, Geiger D. Asymptotic model selection for naive bayesian
networks. J Mach Learn Res. 2005;6(Jan):1–35.
[8] Zwiernik P. An asymptotic behaviour of the marginal likelihood for general
markov models. J Mach Learn Res. 2011;12(Nov):3283–310.
[9] Imai T. Estimating real log canonical threshold. arXiv1906.01341. 1-28.
55

References
[10] H N, Watanabe S. Upper bound of bayesian generalization error in non-negative
matrix factorization. Neurocomputing. 2017;266C(29 November):21–8.
[11] H, N., & Watanabe, S. (2017a). Tighter upper bound of real log canonical
threshold of non-negative matrix factorization and its application to bayesian
inference. In IEEE symposium series on computational intelligence (IEEE SSCI)
(pp. 718–725).
[12] H N. Variational approximation error in non-negative matrix factorization. Neural
Netw. 2020;126(June):65-75.
[13] H N, Watanabe S. Asymptotic bayesian generalization error in
latent dirichlet allocation and stochastic matrix factorization. SN Computer
Science. 2020:1(2);1–22.
[14] Kohjima M, Matsubayashi T, Sawada H. Multiple data analysis and non-negative
matrix/tensor factorization [I]: multiple data analysis and its advances. IEICE Transaction.
2016:99(6);543-550. In Japanese.
[15] Kohjima M., & Watanabe S. (2017). Phase transition structure of variational
bayesian nonnegative matrix factorization. In International conference on
artificial neural networks (ICANN) (pp. 146–154).
[16] Cohen J.E. , Rothblum U.G. Nonnegative ranks, decompositions, and factoriza-
tions of nonnegative matrices. Linear Algebra Appl. 1993;190:149–168 .
56

Question1
How much is the difference
Q. How much is the difference of the RLCTs between
the restricted case (our result) and the non-restricted
case (Aoyagi’s result)?
A. The effect of restriction occurs in the boundary of
the parameter space.
I cannot answer the exact value of the difference in the
all cases (I haven’t memorized Aoyagi’s result……);
however, in the case our result clarifies the exact value
of the RLCT in restricted case, the difference between
the RLCTs is the exact answer.
58

Question1
Q. How much is the difference of the RLCTs between the
restricted case (our result) and the non-restricted case
(Aoyagi’s result)?
From Aoyagi’s result, in the case 𝐻0 = 0 in NMF (same
dimension and boundary case), the difference is as follows.
diff = 𝜆 𝑁𝑀𝐹 − 𝜆 𝑀𝐹 =
1
2
𝐻 min 𝑀, 𝑁 − 𝜆 𝑀𝐹
Note: Aoyagi’s result assumes the prior is posisitive and
bounded; thus, the hyperparameters in NMF are set to 1 in
the above.
Note2: 𝜆 𝑀𝐹 is changed by the condition of the matrix size.
The next slide describes.
59

Question1
In the case 𝐻0 = 0 , the RLCT in Aoyagi’s result is as
follows. Assume min{M,N}>=2 and min{M,N}>= H>=1.
If N<M+H and M<N+H and H<M+N and M+H+N is even,
𝜆 𝑀𝐹 =
1
8
2𝐻 𝑀 + 𝑁 − 𝑀 − 𝑁 2 − 𝐻2 ,
Else if N<M+H and M<N+H and H<M+N and M+H+N is
odd, 𝜆 𝑀𝐹 =
1
8
2𝐻 𝑀 + 𝑁 − 𝑀 − 𝑁 2 − 𝐻2 + 1 ,
Else if M+H<N (-> M<N), 𝜆 𝑀𝐹 =
1
2
𝑀𝐻,
Else if N+H<M (-> N<M), 𝜆 𝑀𝐹 =
1
2
𝑁𝐻.
60

Question1
In the case 𝐻0 = 0 , the difference in the question is as follows.
Assume min{M,N}>=2 and min{M,N}>= H>=1.
If N<M+H and M<N+H and H<M+N and M+H+N is even,
diff =
1
8
4𝐻 min 𝑀, 𝑁 − 2𝐻 𝑀 + 𝑁 + 𝑀 − 𝑁 2 + 𝐻2 ,
Else if N<M+H and M<N+H and H<M+N and M+H+N is odd,
diff =
1
8
4𝐻 min 𝑀, 𝑁 − 2𝐻 𝑀 + 𝑁 + 𝑀 − 𝑁 2
+ 𝐻2
− 1 ,
Else if M+H<N (-> M<N), diff = 0,
Else if N+H<M (-> N<M), diff = 0.
61
E.g. If M=N=4 and H=2, the difference is
equal to
(1/8)*(4*2*4 - 2*2*8 + 0 + 4)=1/2
by using the first case.

Question2
The exact value of the RLCT of LDA
Q. In the inequality that shows an upper bound of the
RLCT of LDA, the equality hold if H=H_0=1 or 2.
However, the exact value in the case of H=2 and H_0 =
1 is also found. What does it mean?
A. When H=H_0=1 or 2, the upper bound is equal to
the exact value of the RLCT. If H=2 and H_0=1, the
upper bound is strictly larger than the exact value but
we can find the exact value by using the method which
is different from the method to derive the upper bound
and its equality.
62

Appendix
Simple Example
• Let 𝐾 𝑤 = 𝑎𝑏 + 𝑐𝑑 2 and 𝜑 𝑤 ≡ 1,
where 𝑤 ∈ −1,1 4 =: 𝑊.
• Using blowing-up 𝑎, 𝑏, 𝑐, 𝑑 = 𝑎, 𝑏𝑑, 𝑐, 𝑑 , we have
𝐾 𝑤 = 𝑑2 𝑎𝑏 + 𝑐 2 and its Jacobian is 𝑑 .
• Besides, applying linear transformation
𝑎, 𝑏, 𝑐, 𝑑 = 𝑎, 𝑏, 𝑐 + 𝑎𝑏, 𝑑 , 𝐾 𝑤 = 𝑐2 𝑑2.
64

Appendix
Simple Example
• Then, the zeta function is
−1
1
−1
1
𝑎𝑏−1
𝑎𝑏+1
−1
1
𝑐2𝑧
𝑑2𝑧
𝑑 d𝑤
• For any −1 ≤ 𝑎, 𝑏, ≤ 1, we can take a neighborhood (nbhd) of
c=0 as an open set.
• The RLCT does not change when we change the integral
interval from 𝑎𝑏 − 1, 𝑎𝑏 + 1 to −𝜀, 𝜀 for any 𝜀 > 0.
• Thus, we consider
−1
1
−1
1
−𝜀
𝜀
−1
1
𝑐2𝑧 𝑑2𝑧+1 d𝑎d𝑏d𝑐d𝑑 ∝
1
𝑧 + 1 2𝑧 + 1
and obtain 𝜆 =
1
2
, 𝑚 = 1.
65

Appendix
Simple Example
−1
1
−1
1
𝑎𝑏−1
𝑎𝑏+1
−1
1
𝑐2𝑧
𝑑2𝑧
𝑑 d𝑤
• For any −1 ≤ 𝑎, 𝑏, ≤ 1, we can take a neighborhood (nbhd) of
c=0 as an open set.
• The RLCT does not change when we change the integral
interval from 𝑎𝑏 − 1, 𝑎𝑏 + 1 to −𝜀, 𝜀 for any 𝜀 > 0.
• Thus, we consider
−1
1
−1
1
−𝜀
𝜀
−1
1
𝑐2𝑧 𝑑2𝑧+1 d𝑎d𝑏d𝑐d𝑑 ∝
1
𝑧 + 1 2𝑧 + 1
and obtain 𝜆 =
1
2
, 𝑚 = 1.
66
How about it when w is non-negative?

Appendix
Simple Example
• Let 𝐾 𝑤 = 𝑎𝑏 + 𝑐𝑑 2 and 𝜑 𝑤 ≡ 1,
where 𝑤 ∈ 0, 1 4 =: 𝑊.(Non-negative case)
• Using blowing-up 𝑎, 𝑏, 𝑐, 𝑑 = 𝑎, 𝑏𝑑, 𝑐, 𝑑 , we have
𝐾 𝑤 = 𝑑2 𝑎𝑏 + 𝑐 2 and its Jacobian is 𝑑 .
• Besides, applying linear transformation
𝑎, 𝑏, 𝑐, 𝑑 = 𝑎, 𝑏, 𝑐 + 𝑎𝑏, 𝑑 , 𝐾 𝑤 = 𝑐2 𝑑2.
0
1
0
1
𝑎𝑏
𝑎𝑏+1
0
1
𝑐2𝑧 𝑑2𝑧 𝑑 d𝑤 .
67

Appendix
Simple Example
•
•
•
•
68
It looks as same as the previous slide.
However, …

Appendix
Simple Example
0
1
0
1
𝑎𝑏
𝑎𝑏+1
0
1
𝑐2𝑧
𝑑2𝑧
𝑑 d𝑤
• Because of non-negativity, we cannot take any
nbhd of c=0 as an open set.
‒ There is no 𝜀 > 0 such that the RLCT does not change
when we change the integral interval from from
𝑎𝑏, 𝑎𝑏 + 1 to 0, 𝜀 .
• We must consider other method (not directly
calculating zeta function) to determine 𝜆 and 𝑚.
69

Appendix
Simple Example
In this case, we can obtain the exact value of them
by using the following “equivalent lemma”.
70
• Let 𝐾: 𝑊 → ℝ and 𝐻: 𝑊 → ℝ be non-negative and
analytic functions.
• If there exist constants 𝑐1, 𝑐2 such that
𝑐1 𝐾 𝑤 ≤ 𝐻 𝑤 ≤ 𝑐2 𝐾 𝑤 ,
then their RLCTs and their multiplicities are same:
𝜆 𝐾 = 𝜆 𝐻 , 𝑚 𝐾 = 𝑚 𝐻 .
An equivalent relation K~H is defined by the RLCTs
and the multiplicities of K and H are same.

Appendix
Simple Example
Reviting the setting: 𝐾 𝑤 = 𝑎𝑏 + 𝑐𝑑 2, 𝜑 𝑤 ≡ 1,
where 𝑤 = 𝑎, 𝑏, 𝑐, 𝑑 ∈ 0, 1 4 =: 𝑊.
• 2 𝑎2
𝑏2
+ 𝑐2
𝑑2
− 𝑎𝑏 + 𝑐𝑑 2
= 𝑎𝑏 − 𝑐𝑑 2
≥ 0.
• 𝑎𝑏 + 𝑐𝑑 2 − 𝑎2 𝑏2 + 𝑐2 𝑑2 = 2𝑎𝑏𝑐𝑑 ≥ 0.
• Thus, the following holds
𝑎2 𝑏2 + 𝑐2 𝑑2 ≤ 𝐾 𝑤 ≤ 2 𝑎2 𝑏2 + 𝑐2 𝑑2 ,
i.e. the RLCT is same as one of 𝑎2 𝑏2 + 𝑐2 𝑑2.
• By simple calculation, we obtain 𝜆 = 1, 𝑚 = 2.
71

Appendix
Simple Example
Reviting the setting: 𝐾 𝑤 = 𝑎𝑏 + 𝑐𝑑 2, 𝜑 𝑤 ≡ 1,
where 𝑤 = 𝑎, 𝑏, 𝑐, 𝑑 ∈ 0, 1 4 =: 𝑊.
• 2 𝑎2
𝑏2
+ 𝑐2
𝑑2
− 𝑎𝑏 + 𝑐𝑑 2
= 𝑎𝑏 − 𝑐𝑑 2
≥ 0.
• 𝑎𝑏 + 𝑐𝑑 2 − 𝑎2 𝑏2 + 𝑐2 𝑑2 = 2𝑎𝑏𝑐𝑑 ≥ 0.
• Thus, the following holds
𝑎2 𝑏2 + 𝑐2 𝑑2 ≤ 𝐾 𝑤 ≤ 2 𝑎2 𝑏2 + 𝑐2 𝑑2 ,
i.e. the RLCT is same as one of 𝑎2 𝑏2 + 𝑐2 𝑑2.
• By simple calculation, we obtain 𝜆 = 1, 𝑚 = 2.
72
They are different from the non-
restriction case (𝜆 =
1
2
, 𝑚 = 1).

Bayesian Generalization Error and Real Log Canonical Threshold in Non-negative Matrix Factorization and Latent Dirichlet Allocation

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Bayesian Generalization Error and Real Log Canonical Threshold in Non-negative Matrix Factorization and Latent Dirichlet Allocation

Ähnlich wie Bayesian Generalization Error and Real Log Canonical Threshold in Non-negative Matrix Factorization and Latent Dirichlet Allocation (20)

Mehr von Naoki Hayashi

Mehr von Naoki Hayashi (19)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Bayesian Generalization Error and Real Log Canonical Threshold in Non-negative Matrix Factorization and Latent Dirichlet Allocation

Hinweis der Redaktion