SlideShare ist ein Scribd-Unternehmen logo
1 von 50
Downloaden Sie, um offline zu lesen
Minimax optimal alternating minimization
for kernel nonparametric tensor learning
†‡
Taiji Suzuki
joint work with †
Heishiro Kanagawa, ⋄
Hayato Kobayashi, ⋄
Nobuyuki Shimizu
and ⋄
Yukihiro Tagami
†
Tokyo Institute of Technology
Department of Mathematical Computing Sciences
‡
JST, PRESTO and AIP, RIKEN
⋄
Yahoo! Japan.
19th/Jan/2017
PFN 主催 NIPS2016 読み会
1 / 41
Outline
1 Introduction
2 Basics of low rank tensor decomposition
3 Nonparametric tensor estimation
Alternating minimization
Convergence analysis
Real data analysis: multitask learning
2 / 41
Outline
1 Introduction
2 Basics of low rank tensor decomposition
3 Nonparametric tensor estimation
Alternating minimization
Convergence analysis
Real data analysis: multitask learning
3 / 41
High dimensional parameter estimation
Vector
Sparsity
Method
Lasso
Sure Screening
Application
Feature selection
Gene data
analysis
4 / 41
High dimensional parameter estimation
Vector
Sparsity
Matrix
Low rank
Method
Lasso
Sure Screening
Application
Feature selection
Gene data
analysis
Method
PCA
Trace norm reg.
Application
Dim. Reduction
Recommendation
system
Three layer NN
4 / 41
High dimensional parameter estimation
Vector
Sparsity
Matrix
Low rank
Tensor
Low rank
Method
Lasso
Sure Screening
Application
Feature selection
Gene data
analysis
Method
PCA
Trace norm reg.
Application
Dim. Reduction
Recommendation
system
Three layer NN
This study
Higher order relation
4 / 41
“Tensors” in NIPS2016
Zhao Song, David Woodruff, Huan Zhang:
“Sublinear Time Orthogonal Tensor Decomposition”
Shandian Zhe, Kai Zhang, Pengyuan Wang, Kuang-chih Lee, Zenglin Xu,
Yuan Qi, Zoubin Ghahramani
“Distributed Flexible Nonlinear Tensor Factorization”
Guillaume Rabusseau, Hachem Kadri:
“Low-Rank Regression with Tensor Responses”
Chuan-Yung Tsai, Andrew M. Saxe, Andrew M. Saxe, David Cox:
“Tensor Switching Networks”
Tao Wu, Austin R. Benson, David F. Gleich:
“General Tensor Spectral Co-clustering for Higher-Order Data”
Yining Wang, Anima Anandkumar:
“Online and Differentially-Private Tensor Decomposition”
Edwin Stoudenmire, David J. Schwab:
“Supervised Learning with Tensor Networks”
5 / 41
Tensor workshop
Amnon Shashua: On depth efficiency of convolutional networks: the use of
hierarchical tensor decomposition for network design and analysis.
Deep neural network can be formulated as hierarchical Tucker
decomposition. (Cohen et al., 2016; Cohen & Shashua, 2016)
Three layer NN corresponds to (generalized) CP-decomposition.
DNN truly has more expressive power than shallow ones.
A(hy ) =
∑Z
z=1 ay
z (Faz,1
) ⊗g · · · ⊗g (Faz,N
)
Lek-Heng Lim: Tensor network ranks
and other interesting talks.
6 / 41
This presentation
Suzuki, Kanagawa, Kobayashi, Shimizu and Tagami: Minimax optimal alternating
minimization for kernel nonparametric tensor learning. NIPS2016, pp. 3783–3791.
Nonparametric low rank tensor estimation
Alternating minimization method:
efficient computation + nice statistical property.
 
After t iterations, the estimation erro is bounded by
˜O
(
dKn− 1
1+s + dK
(
3
4
)t
)
.
 
Related papers:
Suzuki: Convergence rate of Bayesian tensor estimator and its minimax optimality.
ICML2015, pp. 1273–1282, 2015.
Kanagawa, Suzuki, Kobayashi, Shimizu and Tagami: Gaussian process
nonparametric tensor estimator and its minimax optimality. ICML2016, pp.
1632–1641, 2016.
7 / 41
Error bound comparison
Parametric tensor model (CP-decomposition)
Method Least squares Convex reg. Bayes
via matricization
Error bound
∏K
k=1 Mk
n
dK/2
√∏
k Mk
n
d(
∑K
k=1 Mk ) log(n)
n
K: dimension of the tensor, d: rank, Mk : size
Convex reg.: Tomioka et al. (2011); Tomioka and Suzuki (2013); Zheng and
Tomioka (2015); Mu et al. (2014)
Bayes: Suzuki (2015)
Nonparametric tensor model (CP-decomposition)
Method Naive method Bayes/
Alternating min.
Error bound n− 1
1+Ks dKn− 1
1+s
K: size, s: complexity of the model space
Bayes: Kanagawa et al. (2016)
Alternating minimization: This paper.
8 / 41
Outline
1 Introduction
2 Basics of low rank tensor decomposition
3 Nonparametric tensor estimation
Alternating minimization
Convergence analysis
Real data analysis: multitask learning
9 / 41
Tensor decompositions
CP-decomposition
Tucker decomposition
Tensor train
Tensor network
10 / 41
Tensor rank: CP-rank
=
A
B
C
+ +…+
a1
b1
c1
a2
b2
c2
ad
bd
cd
=
CP-decomposition
Canonical Polyadic decomp. (Hitchcock, 1927; Hitchcock, 1927)
CANDECOMP/PARAFAC (Carroll  Chang, 1970; Harshman, 1970)
Xijk =
∑d
r=1 air bjr ckr =: [[A, B, C]].
CP-decomposition defines CP-rank of a tensor.
CP-decomposition is NP-hard.
(But under a mild assumption, it can be solved efficiently (De Lathauwer,
2006; De Lathauwer et al., 2004; Leurgans et al., 1993))
Orthogonal decomposition does not necessary exist (even for symmetric
tensor).
11 / 41
Tensor rank: Tucker-rank
=
X
G
A
B
C
Tucker-decomposition (Tucker, 1966)
Xijk =
∑r1
l=1
∑r2
m=1
∑r3
n=1 glmnail bjmckn =: [[G; A, B, C]].
G is called core tensor.
Tucker-rank = (r1, r2, r3)
12 / 41
Matricization
Mode-k unfolding:
A(k)
∈ RMk ×N/Mk
, (N =
K∏
k=1
Mk ).
A A
(k)
G
A
B
C
A
Gx2Bx3C
rk = rank(A(k)
) gives the Tucker-rank.
13 / 41
Other tensor decomposition models
Tensor train (Oseledets, 2011)
Ti1,i2,...,iK
=
∑
α1,...,αK−1
G1(i1, α1)G2(α1, i2, α2) · · ·
GK−1(αK−2, iK−1, αK−1)GK (αK−1, iK )
i2 i3 i4
i1
i5
G1 G2 G3 G4 G5
Tensor network
14 / 41
Applications
Recommendation system
Relational data
Multi-task learning
Signal processing (space (2D) × time)
Natural language processing (vector representation of words)
1
13
1
2
2
2
4
2
4
21
3
2
4
1
2
3
2
3
4
2
1
3
2
1
4
1
13
2
4 41
4
1
3
3
2
1
3
2
User
Item
Context
Rating
Prediction
Tensor completion
15 / 41
Other applications
EEG analysis (De Vos et al., 2007)
time × frequency × space
EEG monitoring: Epileptic seizure onset localization
Denoising by tensor train (Phien et al., 2016)
Recovery by different tensor learning methods Casting an image
into a higher-order tensor 16 / 41
Outline
1 Introduction
2 Basics of low rank tensor decomposition
3 Nonparametric tensor estimation
Alternating minimization
Convergence analysis
Real data analysis: multitask learning
17 / 41
Nonparametric tensor regression model
Nonparametric regression model
yi = f (xi) + ϵi.
Goal: Estimate f from the data Dn = {xi , yi }n
i=1.
Nonparametric tensor model:
f (x(1)
, . . . , x(K)
) =
d∑
r=1
f (1)
r (x(1)
) × · · · × f (K)
r (x(K)
)
We suppose that f
(k)
r ∈ H where H is an RKHS.
Parametric tensor model:
f (x(1)
, . . . , x(K)
) =
d∑
r=1
⟨x(1)
, u(1)
r ⟩ × · · · × ⟨x(K)
, u(K)
r ⟩
Matrix case:
∑d
r=1⟨x(1)
, u
(1)
r ⟩⟨x(2)
, u
(2)
r ⟩ = (x(1)
)⊤
(
d∑
r=1
u(1)
r u(2)
r
⊤
)
matrix
x(2)
.
18 / 41
Application: Nonlinear recommendation
f (x(1)
, x(2)
) = x(1)⊤
Ax(2)
=
d∑
r=1
⟨x(1)
, u(1)
r ⟩⟨u(2)
r , x(2)
⟩
x(1)
: User feature,x(2)
: Movie feature.
19 / 41
Application: Nonlinear recommendation
f (x(1)
, x(2)
) =
d∑
r=1
f (1)
r (x(1)
)f (2)
r (x(2)
)
x(1)
: User feature,x(2)
: Movie feature.
19 / 41
Application: Smoothing
(De Vos et al., 2007) (Cao et al., 2016)
min
{u
(k)
r }r,k
X −
d∑
r=1
u(1)
r ⊗ u(2)
r ⊗ u(3)
r
2
+
d∑
r=1
3∑
k=1
u(k)⊤
r Gu(k)
r
Smoothing
t
u1
u2 u3 u4
u5
u⊤
Gu =
∑
j (uj − uj+1)2
Smoothing ⇒ Kernel method
(Zdunek, 2012; Yokota et al., 2015a; Yokota et al., 2015b)
20 / 41
Application: Multi-task learning
Tasktype1
Task type 2
Function (f*)
Related tasks aligned with two indexes (s, t).
f(s,t): the regression function for task (s, t).
fr (x) (r = 1, . . . , d): factors behind tasks that give an expression of f(s,t) as
f(s,t)(x) =
d∑
r=1
βr,(s,t) fr (x)
Latent factor
=
d∑
r=1
αr,sαr,tfr (x)
We estimate αr,s ∈ R, αr,t ∈ R, fr ∈ Hr by using Gaussian process prior.
21 / 41
Estimation methods
f (x(1)
, . . . , x(K)
) =
d∑
r=1
f (1)
r (x(1)
) × · · · × f (K)
r (x(K)
)
1. Alternating minimization (MAP estimator) (NIPS2016)
Repeating convex optimization. Fast computation.
Stronger assumptions are required for minimax optimality.
Local optimality is still problematic.
2. Bayes estimator (ICML2016)
Nice statistical performance. Minimax optimal.
Heavy computation.
3. Convex regularization (Signoretto et al., 2013)  
Question
Estimation error guarantee: How does the error decrease? Is it optimal?
Computational complexity?
Performance on real data?
22 / 41
Outline
1 Introduction
2 Basics of low rank tensor decomposition
3 Nonparametric tensor estimation
Alternating minimization
Convergence analysis
Real data analysis: multitask learning
23 / 41
Alternating minimization method
Update f
(k)
r for a chosen (r, k) while other components are fixed:
F({f (k)
r }r,k ) :=
1
n
n∑
i=1
(
yi −
d∗
∑
r=1
K∏
k=1
f (k)
r (x
(k)
i )
)2
(Empirical error)
ˆf (k)
r ← arg min
f
(k)
r ∈H
{
F(f (k)
r |{ˆf
(k′
)
r′ }(r′,k′)̸=(r,k)) + λ∥f (k)
r ∥2
H
}
.
The objective function is non-convex.
But it is convex w.r.t. one component f
(k)
r (kernel ridge regression).
It should converge to a local optimal (e.g. coordinate descent).
24 / 41
Reproducing Kernel Hilbert Space (RKHS)
kernel function ⇔ Reproducing Kernel Hilbert Space (RKHS)
k(x, x′
) ⇔ Hk
Reproducibility: for f ∈ Hk, the function value at x is recovered as
f (x) = ⟨f , k(x, ·)⟩Hk
.
Representer theorem:
min
f ∈H
1
n
n∑
i=1
(yi − f (xi ))2
+ C∥f ∥2
H
⇐⇒ min
α∈Rn
1
n
n∑
i=1
(
yi −
n∑
j=1
αj k(xj , xi )
)2
+ Cα⊤
Kα,
where Ki,j = k(xi , xj ). ˆf =
∑n
i=1 αi k(xi , ·).
Gaussian kernel
Polynomial kernel
Graph kernel, time series
kernel, ... 25 / 41
Outline
1 Introduction
2 Basics of low rank tensor decomposition
3 Nonparametric tensor estimation
Alternating minimization
Convergence analysis
Real data analysis: multitask learning
26 / 41
Complexity of the RKHS
0  s  1: representing complexity of the model.
Spectrum decomposition:
k(x, x′
) =
∑∞
ℓ=1 µℓϕℓ(x)ϕℓ(x′
),
where {ϕℓ}∞
ℓ=1 is ONS in L2(P).
Spectrum Condition (s)
There exists 0  s  1 such that
µℓ ≤ Cℓ− 1
s (∀ℓ).
s represents the complexity of RKHS.
Large s means complex. Small s means simple.
The optimal learning rate in a single kernel learning setting is
∥ˆf − f ∗
∥2
L2(P) = Op(n− 1
1+s ).
27 / 41
Convergence of alternating minimization method
Assumption
f ∗
satisfies the incoherence condition (its definition is in the next slide).
P(X) = P(X1) × · · · × P(XK ).
Some other technical conditions.
ˆf [t]
: the estimator after the t-th step.
Theorem (Main result)
There exit constants C1, C2 such that, if d(ˆf [0]
, f ∗
) ≤ C1, then with probability
1 − δ, we have
∥ˆf [t]
− f ∗
∥2
L2(P) ≤ C2
(
d∗
K (3/4)
t
Optimization error
+ d∗
Kn− 1
1+s
Estimation error
log(1/δ)
)
.
Linear convergence to a local optimal (log(n) times update is sufficient)
d(ˆf [0]
, f ∗
) = Op(1) =⇒ ∥ˆf [log(n)]
− f ∗
∥2
L2
= Op
(
d∗
Kn− 1
1+s
)
.
If we start from a good initial point, then it achieves the minimax
optimal error.
Naive method: O(n− 1
1+Ks ) (curse of dimensionality).
28 / 41
Details of technical conditions
Incoherence: ∃µ∗
 1 s.t.
|⟨f ∗(k)
r , f
∗(k)
r′ ⟩| ≤ µ∗
∥f ∗(k)
r ∥L2
∥f
∗(k)
r′ ∥L2
(r ̸= r′
).
fr*(k)
fr'*(k)
Lower and upper bound of f ∗
:
0  vmin ≤ ∥f ∗(k)
r ∥L2
≤ vmax (∀r, k).
sup-norm condition: 0  ∃s2  1 s.t.
∥f ∥∞ ≤ C∥f ∥1−s2
L2
∥f ∥s2
H (∀f ∈ H)
29 / 41
Details of technical conditions
Incoherence: ∃µ∗
 1 s.t.
|⟨f ∗(k)
r , f
∗(k)
r′ ⟩| ≤ µ∗
∥f ∗(k)
r ∥L2
∥f
∗(k)
r′ ∥L2
(r ̸= r′
).
fr*(k)
fr'*(k)
Lower and upper bound of f ∗
:
0  vmin ≤ ∥f ∗(k)
r ∥L2
≤ vmax (∀r, k).
sup-norm condition: 0  ∃s2  1 s.t.
∥f ∥∞ ≤ C∥f ∥1−s2
L2
∥f ∥s2
H (∀f ∈ H)
For vr = ∥
∏K
k=1 f
∗(k)
r ∥L2 , ˆvr = ∥
∏K
k=1
ˆf
(k)
r ∥L2 , f
∗∗(k)
r =
f ∗(k)
r
∥f
∗(k)
r ∥L2
,
ˆˆf
(k)
r =
ˆf (k)
r
∥ˆf
(k)
r ∥L2
,
d(ˆf , f ∗
) = max
(r,k)
{|ˆvr − vr | + vr ∥ˆˆf (k)
r − f ∗∗(k)
r ∥L2
}.
29 / 41
Illustration of the theoretical result
True
Pred. Error
Emp. Error
True
Pred. Risk
Emp. Risk
True
Pred. Risk
Emp. Risk
Small sample Large sample
The predictive risk shapes like a convex function locally around the true
function.
The empirical risk gets closer to the predictive one as the sample size
increases.
Technique: Local Rademacher complexity
30 / 41
Tools used in the proof
Rademacher complexity
H: function space
Ex
[
sup
h∈H
1
n
n∑
i=1
h(xi )
Empirical error
− E[h]
Predictive error
]
≤2Ex,σ
[
sup
h∈H
1
n
n∑
i=1
σi h(xi )
]
≤
C
√
n
where {σi }n
i=1 are i.i.d. Rademacher random va-
riables (P(σi = 1) = P(σi = −1)).
Pred. Risk
Emp. Risk
Uniform bound
Local Rademacher complexity + peeling device:
Utilize strong convexity of the squared loss
Ex,σ
[
sup
h∈H
|1
n
∑n
i=1 σi h(xi )|
∥h∥L2 + λ
]
≤ C
λ− s
2
√
n
∨
λ− 1
2
n− 1
1+s
Tighter around the true function
Pred. Risk
Emp. Risk
Uniform bound
31 / 41
Convergence of alternating minimization
0 5 10 15 20 25
Number of iterations
10-4
10-3
10-2
10-1
100relativeMSE
n=400
n=800
n=1200
n=1600
n=2000
n=2400
n=2800
Relative MSE E[∥ˆf [t]
− f ∗
∥2
] v.s. the number of iteration t
for different sample sizes n.
32 / 41
Minimax optimality
The derived upper bound is minimax optimal (up to a constant).
A set of tensors with rank d∗
:
H(d∗,K)(R) :=
{
f =
d∗
∑
r=1
K∏
k=1
f (k)
r f (k)
r ∈ H(r,k)(R)
}
.
Theorem (Minimax risk)
inf
ˆf
sup
f ∗∈H(d∗,K)(R)
E[∥f ∗
− ˆf ∥2
L2(PX )] ≳ d∗
Kn− 1
1+s ,
where inf is taken over all estimators ˆf .
The Bayes estimator attains the minimax risk.
33 / 41
Issue of local optimality
The convergence is only proven for a good initial solution that is sufficiently close
to the optimal one.
True
Pred. Risk
Emp. Risk
Question: Does the algorithm converge to the global optimal?
→ Open question.
34 / 41
NIPS2016 papers about local optimality
■ Every local minimum of the matrix completion problem is the global minimum
(with high probability).
min
U∈RM×k
∑
(i,j)∈E
(Yi,j − (UU⊤
)i,j )2
Rong Ge, Jason D. Lee, Tengyu Ma: “Matrix Completion has No Spurious
Local Minimum.”
Srinadh Bhojanapalli, Behnam Neyshabur, Nati Srebro: “Global Optimality of
Local Search for Low Rank Matrix Recovery.”
■ Deep NN also satisfies a similar property.
Kenji Kawaguchi: “Deep Learning without Poor Local Minima.”
(Essentially, the proof is valid only for linear deep neural network.)
Strictly saddle function: Every critical point has a negative curvature direction
or is the global optimum.
Trust region method (Conn et al., 2000), noisy stochastic gradient (Ge et al.,
2015) can reach the global optimum for strictly saddle obj (Sun et al., 2015).
35 / 41
Outline
1 Introduction
2 Basics of low rank tensor decomposition
3 Nonparametric tensor estimation
Alternating minimization
Convergence analysis
Real data analysis: multitask learning
36 / 41
Numerical experiments on Real data
Multi-task learnnig [Nonlinear regression]
Restaurant data: multi-task learning, 138 customers × 3 aspects (138 × 3
tasks)
We want to predict the rating (3 level) of the restaurant for each customer
and each aspect.
Each restaurant is described by a 44-dimensional feature vector.
37 / 41
Numerical experiments on Real data
Multi-task learnnig [Nonlinear regression]
Restaurant data: multi-task learning, 138 customers × 3 aspects (138 × 3
tasks)
We want to predict the rating (3 level) of the restaurant for each customer
and each aspect.
Each restaurant is described by a 44-dimensional feature vector.
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
500 1000 1500 2000 2500
MSE
Sample size
GRBF
GRBF(2)+lin(1)
GRBF(1)+lin(2)
linear
scaled latent
ALS (Best)
ALS (50)
Nonprametric methods (Bayes and alternating minimization) achieved the best
performance.
37 / 41
Multi-task learnnig [Nonlinear regression]
Restaurant data: multi-task learning, 138 customers × 3 aspects (138 × 3
tasks) with a different kernel between tasks.
500 1000 1500 2000 2500
Sample size
0.35
0.40
0.45
0.50
0.55
0.60
MSE
AMP(RBF)
AMP(Linear)
Lin(2)+RBF(1)
Lin(1)+RBF(2)
GP-MTL
GP-MTL: Gaussian process method (Bayes).
AMP: alternating minimization method.
The Bayes method is slightly better.
38 / 41
Numerical experiments on Real data
Multi-task learnnig [Nonlinear regression]
School data: multi-task learning, 139 school × 3 years (139 × 3 tasks)
28
30
32
34
36
38
40
2000 3000 4000 5000 6000 7000 8000 900010000
Explainedvariance
Sample size
GRBF
GRBF(2)+lin(1)
GRBF(1)+lin(2)
linear
scaled latent
ALS (50)
ALS (Best)
Explained variance = 100 ×
Var(Y ) − MSE
Var(Y )
Nonprametric Bayes method and the alternating minimization method achieved
the best performance.
39 / 41
Online shopping sales prediction
Predict the online shopping (Yahoo! shopping) sales.
shop × item × customer (508 shops, 100 items)
Predict the number of certain items that a customer will buy in a shop.
A customer is represented by a feature vector.
We construct a kernel defined by the nearest neighbor graph between shops.
10
11
12
13
14
15
16
4000 6000 8000 10000 12000 14000
MSE
Sample size
GP-MTL(cosdis)
GP-MTL(cossim)
AMP(cosdis)
AMP(cossim)
Figuur: Sales prediction of online shop. Comparison between different metrics between
shops.
40 / 41
Summary
Convergence rate of nonlinear tensor estimator was given.
The alternating minimization method achieve the minimax optimality.
The theoretical analysis requires some strong assumptions, for example, on
the choice initial guess.
Estimation error of the alternating minimization procedure
∥ˆf [t]
− f ∗
∥2
L2(Π) ≤ C
(
dKn− 1
1+s + dK(3/4)t
)
.
where ˆf [t]
is the solution of the t-th update during the procedure.
41 / 41
Cao, W., Wang, Y., Sun, J., Meng, D., Yang, C., Cichocki, A.,  Xu, Z. (2016).
Total variation regularized tensor RPCA for background subtraction from
compressive measurements. IEEE Transactions on Image Processing, 25,
4075–4090.
Carroll, J. D.,  Chang, J.-J. (1970). Analysis of individual differences in
multidimensional scaling via an n-way generalization of“ eckart-young ”
decomposition. Psychometrika, 35, 283–319.
Cohen, N., Sharir, O.,  Shashua, A. (2016). On the expressive power of deep
learning: A tensor analysis. The 29th Annual Conference on Learning Theory
(pp. 698–728).
Cohen, N.,  Shashua, A. (2016). Convolutional rectifier networks as generalized
tensor decompositions. Proceedings of the 33th International Conference on
Machine Learning (pp. 955–963).
Conn, A. R., Gould, N. I.,  Toint, P. L. (2000). Trust region methods, vol. 1.
Siam.
De Lathauwer, L. (2006). A link between the canonical decomposition in
multilinear algebra and simultaneous matrix diagonalization. SIAM journal on
Matrix Analysis and Applications, 28, 642–666.
De Lathauwer, L., De Moor, B.,  Vandewalle, J. (2004). Computation of the
canonical decomposition by means of a simultaneous generalized schur
decomposition. SIAM journal on Matrix Analysis and Applications, 26, 295–327.
41 / 41
De Vos, M., Vergult, A., De Lathauwer, L., De Clercq, W., Van Huffel, S.,
Dupont, P., Palmini, A.,  Van Paesschen, W. (2007). Canonical
decomposition of ictal scalp eeg reliably detects the seizure onset zone.
NeuroImage, 37, 844–854.
Ge, R., Huang, F., Jin, C.,  Yuan, Y. (2015). Escaping from saddle
points—online stochastic gradient for tensor decomposition. Proceedings of
The 28th Conference on Learning Theory (pp. 797–842).
Harshman, R. A. (1970). Foundations of the PARAFAC procedure: Models and
conditions for an “explanatory” multi-modal factor analysis. UCLA Working
Papers in Phonetics, 16, 1–84.
Hitchcock, F. L. (1927). Multilple invariants and generalized rank of a p-way
matrix or tensor. Journal of Mathematics and Physics, 7, 39–79.
Kanagawa, H., Suzuki, T., Kobayashi, H., Shimizu, N.,  Tagami, Y. (2016).
Gaussian process nonparametric tensor estimator and its minimax optimality.
Proceedings of the 33rd International Conference on Machine Learning
(ICML2016) (pp. 1632–1641).
Leurgans, S., Ross, R.,  Abel, R. (1993). A decomposition for three-way arrays.
SIAM Journal on Matrix Analysis and Applications, 14, 1064–1083.
Mu, C., Huang, B., Wright, J.,  Goldfarb, D. (2014). Square deal: Lower
bounds and improved relaxations for tensor recovery. Proceedings of the 31th
International Conference on Machine Learning (pp. 73–81).
41 / 41
Oseledets, I. V. (2011). Tensor-train decomposition. SIAM Journal on Scientific
Computing, 33, 2295–2317.
Phien, H. N., Tuan, H. D., Bengua, J. A.,  Do, M. N. (2016). Efficient tensor
completion: Low-rank tensor train. arXiv preprint arXiv:1601.01083.
Signoretto, M., Lathauwer, L. D.,  Suykens, J. A. K. (2013). Learning tensors in
reproducing kernel Hilbert spaces with multilinear spectral penalties. CoRR,
abs/1310.4977.
Sun, J., Qu, Q.,  Wright, J. (2015). When are nonconvex problems not scary?
arXiv preprint arXiv:1510.06096.
Suzuki, T. (2015). Convergence rate of Bayesian tensor estimator and its minimax
optimality. Proceedings of the 32nd International Conference on Machine
Learning (ICML2015) (pp. 1273–1282).
Tomioka, R.,  Suzuki, T. (2013). Convex tensor decomposition via structured
schatten norm regularization. Advances in Neural Information Processing
Systems 26 (pp. 1331–1339). NIPS2013.
Tomioka, R., Suzuki, T., Hayashi, K.,  Kashima, H. (2011). Statistical
performance of convex tensor decomposition. Advances in Neural Information
Processing Systems 24 (pp. 972–980). NIPS2011.
Tucker, L. R. (1966). Some mathematical notes on three-mode factor analysis.
Psychometrika, 31, 279–311.
41 / 41
Yokota, T., Zdunek, R., Cichocki, A.,  Yamashita, Y. (2015a). Smooth
nonnegative matrix and tensor factorizations for robust multi-way data analysis.
Signal Processing, 113, 234–249.
Yokota, T., Zhao, Q.,  Cichocki, A. (2015b). Smooth parafac decomposition for
tensor completion. arXiv preprint arXiv:1505.06611.
Zdunek, R. (2012). Approximation of feature vectors in nonnegative matrix
factorization with gaussian radial basis functions. International Conference on
Neural Information Processing (pp. 616–623).
Zheng, Q.,  Tomioka, R. (2015). Interpolating convex and non-convex tensor
decompositions via the subspace norm. Advances in Neural Information
Processing Systems (pp. 3088–3095).
41 / 41

Weitere ähnliche Inhalte

Was ist angesagt?

K-means, EM and Mixture models
K-means, EM and Mixture modelsK-means, EM and Mixture models
K-means, EM and Mixture models
Vu Pham
 
CVPR2010: Advanced ITinCVPR in a Nutshell: part 6: Mixtures
CVPR2010: Advanced ITinCVPR in a Nutshell: part 6: MixturesCVPR2010: Advanced ITinCVPR in a Nutshell: part 6: Mixtures
CVPR2010: Advanced ITinCVPR in a Nutshell: part 6: Mixtures
zukun
 

Was ist angesagt? (20)

Lecture 12 (Image transformation)
Lecture 12 (Image transformation)Lecture 12 (Image transformation)
Lecture 12 (Image transformation)
 
The world of loss function
The world of loss functionThe world of loss function
The world of loss function
 
Macrocanonical models for texture synthesis
Macrocanonical models for texture synthesisMacrocanonical models for texture synthesis
Macrocanonical models for texture synthesis
 
ARCHITECTURAL CONDITIONING FOR DISENTANGLEMENT OF OBJECT IDENTITY AND POSTURE...
ARCHITECTURAL CONDITIONING FOR DISENTANGLEMENT OF OBJECT IDENTITY AND POSTURE...ARCHITECTURAL CONDITIONING FOR DISENTANGLEMENT OF OBJECT IDENTITY AND POSTURE...
ARCHITECTURAL CONDITIONING FOR DISENTANGLEMENT OF OBJECT IDENTITY AND POSTURE...
 
Lecture 19: Implementation of Histogram Image Operation
Lecture 19: Implementation of Histogram Image OperationLecture 19: Implementation of Histogram Image Operation
Lecture 19: Implementation of Histogram Image Operation
 
Analysis_molf
Analysis_molfAnalysis_molf
Analysis_molf
 
Brief intro : Invariance and Equivariance
Brief intro : Invariance and EquivarianceBrief intro : Invariance and Equivariance
Brief intro : Invariance and Equivariance
 
Dynamic response of structures with uncertain properties
Dynamic response of structures with uncertain propertiesDynamic response of structures with uncertain properties
Dynamic response of structures with uncertain properties
 
Quantitative Propagation of Chaos for SGD in Wide Neural Networks
Quantitative Propagation of Chaos for SGD in Wide Neural NetworksQuantitative Propagation of Chaos for SGD in Wide Neural Networks
Quantitative Propagation of Chaos for SGD in Wide Neural Networks
 
Mgm
MgmMgm
Mgm
 
SchNet: A continuous-filter convolutional neural network for modeling quantum...
SchNet: A continuous-filter convolutional neural network for modeling quantum...SchNet: A continuous-filter convolutional neural network for modeling quantum...
SchNet: A continuous-filter convolutional neural network for modeling quantum...
 
Second Order Perturbations During Inflation Beyond Slow-roll
Second Order Perturbations During Inflation Beyond Slow-rollSecond Order Perturbations During Inflation Beyond Slow-roll
Second Order Perturbations During Inflation Beyond Slow-roll
 
G234247
G234247G234247
G234247
 
K-means, EM and Mixture models
K-means, EM and Mixture modelsK-means, EM and Mixture models
K-means, EM and Mixture models
 
CVPR2010: Advanced ITinCVPR in a Nutshell: part 6: Mixtures
CVPR2010: Advanced ITinCVPR in a Nutshell: part 6: MixturesCVPR2010: Advanced ITinCVPR in a Nutshell: part 6: Mixtures
CVPR2010: Advanced ITinCVPR in a Nutshell: part 6: Mixtures
 
Adaptive dynamic programming algorithm for uncertain nonlinear switched systems
Adaptive dynamic programming algorithm for uncertain nonlinear switched systemsAdaptive dynamic programming algorithm for uncertain nonlinear switched systems
Adaptive dynamic programming algorithm for uncertain nonlinear switched systems
 
Implicit schemes for wave models
Implicit schemes for wave modelsImplicit schemes for wave models
Implicit schemes for wave models
 
02 2d systems matrix
02 2d systems matrix02 2d systems matrix
02 2d systems matrix
 
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
 
Particle filtering
Particle filteringParticle filtering
Particle filtering
 

Andere mochten auch

PAC-Bayesian Bound for Gaussian Process Regression and Multiple Kernel Additi...
PAC-Bayesian Bound for Gaussian Process Regression and Multiple Kernel Additi...PAC-Bayesian Bound for Gaussian Process Regression and Multiple Kernel Additi...
PAC-Bayesian Bound for Gaussian Process Regression and Multiple Kernel Additi...
Taiji Suzuki
 

Andere mochten auch (20)

機械学習におけるオンライン確率的最適化の理論
機械学習におけるオンライン確率的最適化の理論機械学習におけるオンライン確率的最適化の理論
機械学習におけるオンライン確率的最適化の理論
 
Sparse estimation tutorial 2014
Sparse estimation tutorial 2014Sparse estimation tutorial 2014
Sparse estimation tutorial 2014
 
Introduction of "TrailBlazer" algorithm
Introduction of "TrailBlazer" algorithmIntroduction of "TrailBlazer" algorithm
Introduction of "TrailBlazer" algorithm
 
Jokyokai
JokyokaiJokyokai
Jokyokai
 
PAC-Bayesian Bound for Gaussian Process Regression and Multiple Kernel Additi...
PAC-Bayesian Bound for Gaussian Process Regression and Multiple Kernel Additi...PAC-Bayesian Bound for Gaussian Process Regression and Multiple Kernel Additi...
PAC-Bayesian Bound for Gaussian Process Regression and Multiple Kernel Additi...
 
Ibis2016
Ibis2016Ibis2016
Ibis2016
 
統計的学習理論チュートリアル: 基礎から応用まで (Ibis2012)
統計的学習理論チュートリアル: 基礎から応用まで (Ibis2012)統計的学習理論チュートリアル: 基礎から応用まで (Ibis2012)
統計的学習理論チュートリアル: 基礎から応用まで (Ibis2012)
 
Interaction Networks for Learning about Objects, Relations and Physics
Interaction Networks for Learning about Objects, Relations and PhysicsInteraction Networks for Learning about Objects, Relations and Physics
Interaction Networks for Learning about Objects, Relations and Physics
 
Conditional Image Generation with PixelCNN Decoders
Conditional Image Generation with PixelCNN DecodersConditional Image Generation with PixelCNN Decoders
Conditional Image Generation with PixelCNN Decoders
 
InfoGAN: Interpretable Representation Learning by Information Maximizing Gen...
InfoGAN: Interpretable Representation Learning by Information Maximizing Gen...InfoGAN: Interpretable Representation Learning by Information Maximizing Gen...
InfoGAN: Interpretable Representation Learning by Information Maximizing Gen...
 
Introduction of “Fairness in Learning: Classic and Contextual Bandits”
Introduction of “Fairness in Learning: Classic and Contextual Bandits”Introduction of “Fairness in Learning: Classic and Contextual Bandits”
Introduction of “Fairness in Learning: Classic and Contextual Bandits”
 
Improving Variational Inference with Inverse Autoregressive Flow
Improving Variational Inference with Inverse Autoregressive FlowImproving Variational Inference with Inverse Autoregressive Flow
Improving Variational Inference with Inverse Autoregressive Flow
 
Learning to learn by gradient descent by gradient descent
Learning to learn by gradient descent by gradient descentLearning to learn by gradient descent by gradient descent
Learning to learn by gradient descent by gradient descent
 
Value iteration networks
Value iteration networksValue iteration networks
Value iteration networks
 
Fast and Probvably Seedings for k-Means
Fast and Probvably Seedings for k-MeansFast and Probvably Seedings for k-Means
Fast and Probvably Seedings for k-Means
 
Differential privacy without sensitivity [NIPS2016読み会資料]
Differential privacy without sensitivity [NIPS2016読み会資料]Differential privacy without sensitivity [NIPS2016読み会資料]
Differential privacy without sensitivity [NIPS2016読み会資料]
 
Safe and Efficient Off-Policy Reinforcement Learning
Safe and Efficient Off-Policy Reinforcement LearningSafe and Efficient Off-Policy Reinforcement Learning
Safe and Efficient Off-Policy Reinforcement Learning
 
Dual Learning for Machine Translation (NIPS 2016)
Dual Learning for Machine Translation (NIPS 2016)Dual Learning for Machine Translation (NIPS 2016)
Dual Learning for Machine Translation (NIPS 2016)
 
Matching networks for one shot learning
Matching networks for one shot learningMatching networks for one shot learning
Matching networks for one shot learning
 
有名論文から学ぶディープラーニング 2016.03.25
有名論文から学ぶディープラーニング 2016.03.25有名論文から学ぶディープラーニング 2016.03.25
有名論文から学ぶディープラーニング 2016.03.25
 

Ähnlich wie Minimax optimal alternating minimization \\ for kernel nonparametric tensor learning

slides_low_rank_matrix_optim_farhad
slides_low_rank_matrix_optim_farhadslides_low_rank_matrix_optim_farhad
slides_low_rank_matrix_optim_farhad
Farhad Gholami
 
A New Enhanced Method of Non Parametric power spectrum Estimation.
A New Enhanced Method of Non Parametric power spectrum Estimation.A New Enhanced Method of Non Parametric power spectrum Estimation.
A New Enhanced Method of Non Parametric power spectrum Estimation.
CSCJournals
 
presentation
presentationpresentation
presentation
jie ren
 
New data structures and algorithms for \\post-processing large data sets and ...
New data structures and algorithms for \\post-processing large data sets and ...New data structures and algorithms for \\post-processing large data sets and ...
New data structures and algorithms for \\post-processing large data sets and ...
Alexander Litvinenko
 

Ähnlich wie Minimax optimal alternating minimization \\ for kernel nonparametric tensor learning (20)

Random Matrix Theory and Machine Learning - Part 4
Random Matrix Theory and Machine Learning - Part 4Random Matrix Theory and Machine Learning - Part 4
Random Matrix Theory and Machine Learning - Part 4
 
MVPA with SpaceNet: sparse structured priors
MVPA with SpaceNet: sparse structured priorsMVPA with SpaceNet: sparse structured priors
MVPA with SpaceNet: sparse structured priors
 
Decomposition and Denoising for moment sequences using convex optimization
Decomposition and Denoising for moment sequences using convex optimizationDecomposition and Denoising for moment sequences using convex optimization
Decomposition and Denoising for moment sequences using convex optimization
 
Tucker tensor analysis of Matern functions in spatial statistics
Tucker tensor analysis of Matern functions in spatial statistics Tucker tensor analysis of Matern functions in spatial statistics
Tucker tensor analysis of Matern functions in spatial statistics
 
Doubly Accelerated Stochastic Variance Reduced Gradient Methods for Regulariz...
Doubly Accelerated Stochastic Variance Reduced Gradient Methods for Regulariz...Doubly Accelerated Stochastic Variance Reduced Gradient Methods for Regulariz...
Doubly Accelerated Stochastic Variance Reduced Gradient Methods for Regulariz...
 
ENBIS 2018 presentation on Deep k-Means
ENBIS 2018 presentation on Deep k-MeansENBIS 2018 presentation on Deep k-Means
ENBIS 2018 presentation on Deep k-Means
 
QMC: Operator Splitting Workshop, Using Sequences of Iterates in Inertial Met...
QMC: Operator Splitting Workshop, Using Sequences of Iterates in Inertial Met...QMC: Operator Splitting Workshop, Using Sequences of Iterates in Inertial Met...
QMC: Operator Splitting Workshop, Using Sequences of Iterates in Inertial Met...
 
sdof-1211798306003307-8.pptx
sdof-1211798306003307-8.pptxsdof-1211798306003307-8.pptx
sdof-1211798306003307-8.pptx
 
slides_low_rank_matrix_optim_farhad
slides_low_rank_matrix_optim_farhadslides_low_rank_matrix_optim_farhad
slides_low_rank_matrix_optim_farhad
 
Large-scale structure non-Gaussianities with modal methods (Ascona)
Large-scale structure non-Gaussianities with modal methods (Ascona)Large-scale structure non-Gaussianities with modal methods (Ascona)
Large-scale structure non-Gaussianities with modal methods (Ascona)
 
A New Enhanced Method of Non Parametric power spectrum Estimation.
A New Enhanced Method of Non Parametric power spectrum Estimation.A New Enhanced Method of Non Parametric power spectrum Estimation.
A New Enhanced Method of Non Parametric power spectrum Estimation.
 
Fourier_Pricing_ICCF_2022.pdf
Fourier_Pricing_ICCF_2022.pdfFourier_Pricing_ICCF_2022.pdf
Fourier_Pricing_ICCF_2022.pdf
 
Computing the masses of hyperons and charmed baryons from Lattice QCD
Computing the masses of hyperons and charmed baryons from Lattice QCDComputing the masses of hyperons and charmed baryons from Lattice QCD
Computing the masses of hyperons and charmed baryons from Lattice QCD
 
Random Matrix Theory and Machine Learning - Part 3
Random Matrix Theory and Machine Learning - Part 3Random Matrix Theory and Machine Learning - Part 3
Random Matrix Theory and Machine Learning - Part 3
 
OPTIMIZED RATE ALLOCATION OF HYPERSPECTRAL IMAGES IN COMPRESSED DOMAIN USING ...
OPTIMIZED RATE ALLOCATION OF HYPERSPECTRAL IMAGES IN COMPRESSED DOMAIN USING ...OPTIMIZED RATE ALLOCATION OF HYPERSPECTRAL IMAGES IN COMPRESSED DOMAIN USING ...
OPTIMIZED RATE ALLOCATION OF HYPERSPECTRAL IMAGES IN COMPRESSED DOMAIN USING ...
 
presentation
presentationpresentation
presentation
 
Computational methods and vibrational properties applied to materials modeling
Computational methods and vibrational properties applied to materials modelingComputational methods and vibrational properties applied to materials modeling
Computational methods and vibrational properties applied to materials modeling
 
Chapter 2-2.pdf
Chapter 2-2.pdfChapter 2-2.pdf
Chapter 2-2.pdf
 
New data structures and algorithms for \\post-processing large data sets and ...
New data structures and algorithms for \\post-processing large data sets and ...New data structures and algorithms for \\post-processing large data sets and ...
New data structures and algorithms for \\post-processing large data sets and ...
 
Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Cat...
Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Cat...Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Cat...
Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Cat...
 

Mehr von Taiji Suzuki

[ICLR2021 (spotlight)] Benefit of deep learning with non-convex noisy gradien...
[ICLR2021 (spotlight)] Benefit of deep learning with non-convex noisy gradien...[ICLR2021 (spotlight)] Benefit of deep learning with non-convex noisy gradien...
[ICLR2021 (spotlight)] Benefit of deep learning with non-convex noisy gradien...
Taiji Suzuki
 
[NeurIPS2020 (spotlight)] Generalization bound of globally optimal non convex...
[NeurIPS2020 (spotlight)] Generalization bound of globally optimal non convex...[NeurIPS2020 (spotlight)] Generalization bound of globally optimal non convex...
[NeurIPS2020 (spotlight)] Generalization bound of globally optimal non convex...
Taiji Suzuki
 

Mehr von Taiji Suzuki (8)

[ICLR2021 (spotlight)] Benefit of deep learning with non-convex noisy gradien...
[ICLR2021 (spotlight)] Benefit of deep learning with non-convex noisy gradien...[ICLR2021 (spotlight)] Benefit of deep learning with non-convex noisy gradien...
[ICLR2021 (spotlight)] Benefit of deep learning with non-convex noisy gradien...
 
[NeurIPS2020 (spotlight)] Generalization bound of globally optimal non convex...
[NeurIPS2020 (spotlight)] Generalization bound of globally optimal non convex...[NeurIPS2020 (spotlight)] Generalization bound of globally optimal non convex...
[NeurIPS2020 (spotlight)] Generalization bound of globally optimal non convex...
 
深層学習の数理:カーネル法, スパース推定との接点
深層学習の数理:カーネル法, スパース推定との接点深層学習の数理:カーネル法, スパース推定との接点
深層学習の数理:カーネル法, スパース推定との接点
 
Iclr2020: Compression based bound for non-compressed network: unified general...
Iclr2020: Compression based bound for non-compressed network: unified general...Iclr2020: Compression based bound for non-compressed network: unified general...
Iclr2020: Compression based bound for non-compressed network: unified general...
 
数学で解き明かす深層学習の原理
数学で解き明かす深層学習の原理数学で解き明かす深層学習の原理
数学で解き明かす深層学習の原理
 
深層学習の数理
深層学習の数理深層学習の数理
深層学習の数理
 
はじめての機械学習
はじめての機械学習はじめての機械学習
はじめての機械学習
 
Jokyokai2
Jokyokai2Jokyokai2
Jokyokai2
 

Kürzlich hochgeladen

Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Sérgio Sacani
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdf
PirithiRaju
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
PirithiRaju
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...
RohitNehra6
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOST
Sérgio Sacani
 

Kürzlich hochgeladen (20)

Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdf
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​
 
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxPhysiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
 
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
 
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencyHire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
 
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
 
Green chemistry and Sustainable development.pptx
Green chemistry  and Sustainable development.pptxGreen chemistry  and Sustainable development.pptx
Green chemistry and Sustainable development.pptx
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdf
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOST
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)
 
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)
 
Botany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfBotany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdf
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on Io
 
fundamental of entomology all in one topics of entomology
fundamental of entomology all in one topics of entomologyfundamental of entomology all in one topics of entomology
fundamental of entomology all in one topics of entomology
 

Minimax optimal alternating minimization \\ for kernel nonparametric tensor learning

  • 1. Minimax optimal alternating minimization for kernel nonparametric tensor learning †‡ Taiji Suzuki joint work with † Heishiro Kanagawa, ⋄ Hayato Kobayashi, ⋄ Nobuyuki Shimizu and ⋄ Yukihiro Tagami † Tokyo Institute of Technology Department of Mathematical Computing Sciences ‡ JST, PRESTO and AIP, RIKEN ⋄ Yahoo! Japan. 19th/Jan/2017 PFN 主催 NIPS2016 読み会 1 / 41
  • 2. Outline 1 Introduction 2 Basics of low rank tensor decomposition 3 Nonparametric tensor estimation Alternating minimization Convergence analysis Real data analysis: multitask learning 2 / 41
  • 3. Outline 1 Introduction 2 Basics of low rank tensor decomposition 3 Nonparametric tensor estimation Alternating minimization Convergence analysis Real data analysis: multitask learning 3 / 41
  • 4. High dimensional parameter estimation Vector Sparsity Method Lasso Sure Screening Application Feature selection Gene data analysis 4 / 41
  • 5. High dimensional parameter estimation Vector Sparsity Matrix Low rank Method Lasso Sure Screening Application Feature selection Gene data analysis Method PCA Trace norm reg. Application Dim. Reduction Recommendation system Three layer NN 4 / 41
  • 6. High dimensional parameter estimation Vector Sparsity Matrix Low rank Tensor Low rank Method Lasso Sure Screening Application Feature selection Gene data analysis Method PCA Trace norm reg. Application Dim. Reduction Recommendation system Three layer NN This study Higher order relation 4 / 41
  • 7. “Tensors” in NIPS2016 Zhao Song, David Woodruff, Huan Zhang: “Sublinear Time Orthogonal Tensor Decomposition” Shandian Zhe, Kai Zhang, Pengyuan Wang, Kuang-chih Lee, Zenglin Xu, Yuan Qi, Zoubin Ghahramani “Distributed Flexible Nonlinear Tensor Factorization” Guillaume Rabusseau, Hachem Kadri: “Low-Rank Regression with Tensor Responses” Chuan-Yung Tsai, Andrew M. Saxe, Andrew M. Saxe, David Cox: “Tensor Switching Networks” Tao Wu, Austin R. Benson, David F. Gleich: “General Tensor Spectral Co-clustering for Higher-Order Data” Yining Wang, Anima Anandkumar: “Online and Differentially-Private Tensor Decomposition” Edwin Stoudenmire, David J. Schwab: “Supervised Learning with Tensor Networks” 5 / 41
  • 8. Tensor workshop Amnon Shashua: On depth efficiency of convolutional networks: the use of hierarchical tensor decomposition for network design and analysis. Deep neural network can be formulated as hierarchical Tucker decomposition. (Cohen et al., 2016; Cohen & Shashua, 2016) Three layer NN corresponds to (generalized) CP-decomposition. DNN truly has more expressive power than shallow ones. A(hy ) = ∑Z z=1 ay z (Faz,1 ) ⊗g · · · ⊗g (Faz,N ) Lek-Heng Lim: Tensor network ranks and other interesting talks. 6 / 41
  • 9. This presentation Suzuki, Kanagawa, Kobayashi, Shimizu and Tagami: Minimax optimal alternating minimization for kernel nonparametric tensor learning. NIPS2016, pp. 3783–3791. Nonparametric low rank tensor estimation Alternating minimization method: efficient computation + nice statistical property. After t iterations, the estimation erro is bounded by ˜O ( dKn− 1 1+s + dK ( 3 4 )t ) . Related papers: Suzuki: Convergence rate of Bayesian tensor estimator and its minimax optimality. ICML2015, pp. 1273–1282, 2015. Kanagawa, Suzuki, Kobayashi, Shimizu and Tagami: Gaussian process nonparametric tensor estimator and its minimax optimality. ICML2016, pp. 1632–1641, 2016. 7 / 41
  • 10. Error bound comparison Parametric tensor model (CP-decomposition) Method Least squares Convex reg. Bayes via matricization Error bound ∏K k=1 Mk n dK/2 √∏ k Mk n d( ∑K k=1 Mk ) log(n) n K: dimension of the tensor, d: rank, Mk : size Convex reg.: Tomioka et al. (2011); Tomioka and Suzuki (2013); Zheng and Tomioka (2015); Mu et al. (2014) Bayes: Suzuki (2015) Nonparametric tensor model (CP-decomposition) Method Naive method Bayes/ Alternating min. Error bound n− 1 1+Ks dKn− 1 1+s K: size, s: complexity of the model space Bayes: Kanagawa et al. (2016) Alternating minimization: This paper. 8 / 41
  • 11. Outline 1 Introduction 2 Basics of low rank tensor decomposition 3 Nonparametric tensor estimation Alternating minimization Convergence analysis Real data analysis: multitask learning 9 / 41
  • 13. Tensor rank: CP-rank = A B C + +…+ a1 b1 c1 a2 b2 c2 ad bd cd = CP-decomposition Canonical Polyadic decomp. (Hitchcock, 1927; Hitchcock, 1927) CANDECOMP/PARAFAC (Carroll Chang, 1970; Harshman, 1970) Xijk = ∑d r=1 air bjr ckr =: [[A, B, C]]. CP-decomposition defines CP-rank of a tensor. CP-decomposition is NP-hard. (But under a mild assumption, it can be solved efficiently (De Lathauwer, 2006; De Lathauwer et al., 2004; Leurgans et al., 1993)) Orthogonal decomposition does not necessary exist (even for symmetric tensor). 11 / 41
  • 14. Tensor rank: Tucker-rank = X G A B C Tucker-decomposition (Tucker, 1966) Xijk = ∑r1 l=1 ∑r2 m=1 ∑r3 n=1 glmnail bjmckn =: [[G; A, B, C]]. G is called core tensor. Tucker-rank = (r1, r2, r3) 12 / 41
  • 15. Matricization Mode-k unfolding: A(k) ∈ RMk ×N/Mk , (N = K∏ k=1 Mk ). A A (k) G A B C A Gx2Bx3C rk = rank(A(k) ) gives the Tucker-rank. 13 / 41
  • 16. Other tensor decomposition models Tensor train (Oseledets, 2011) Ti1,i2,...,iK = ∑ α1,...,αK−1 G1(i1, α1)G2(α1, i2, α2) · · · GK−1(αK−2, iK−1, αK−1)GK (αK−1, iK ) i2 i3 i4 i1 i5 G1 G2 G3 G4 G5 Tensor network 14 / 41
  • 17. Applications Recommendation system Relational data Multi-task learning Signal processing (space (2D) × time) Natural language processing (vector representation of words) 1 13 1 2 2 2 4 2 4 21 3 2 4 1 2 3 2 3 4 2 1 3 2 1 4 1 13 2 4 41 4 1 3 3 2 1 3 2 User Item Context Rating Prediction Tensor completion 15 / 41
  • 18. Other applications EEG analysis (De Vos et al., 2007) time × frequency × space EEG monitoring: Epileptic seizure onset localization Denoising by tensor train (Phien et al., 2016) Recovery by different tensor learning methods Casting an image into a higher-order tensor 16 / 41
  • 19. Outline 1 Introduction 2 Basics of low rank tensor decomposition 3 Nonparametric tensor estimation Alternating minimization Convergence analysis Real data analysis: multitask learning 17 / 41
  • 20. Nonparametric tensor regression model Nonparametric regression model yi = f (xi) + ϵi. Goal: Estimate f from the data Dn = {xi , yi }n i=1. Nonparametric tensor model: f (x(1) , . . . , x(K) ) = d∑ r=1 f (1) r (x(1) ) × · · · × f (K) r (x(K) ) We suppose that f (k) r ∈ H where H is an RKHS. Parametric tensor model: f (x(1) , . . . , x(K) ) = d∑ r=1 ⟨x(1) , u(1) r ⟩ × · · · × ⟨x(K) , u(K) r ⟩ Matrix case: ∑d r=1⟨x(1) , u (1) r ⟩⟨x(2) , u (2) r ⟩ = (x(1) )⊤ ( d∑ r=1 u(1) r u(2) r ⊤ ) matrix x(2) . 18 / 41
  • 21. Application: Nonlinear recommendation f (x(1) , x(2) ) = x(1)⊤ Ax(2) = d∑ r=1 ⟨x(1) , u(1) r ⟩⟨u(2) r , x(2) ⟩ x(1) : User feature,x(2) : Movie feature. 19 / 41
  • 22. Application: Nonlinear recommendation f (x(1) , x(2) ) = d∑ r=1 f (1) r (x(1) )f (2) r (x(2) ) x(1) : User feature,x(2) : Movie feature. 19 / 41
  • 23. Application: Smoothing (De Vos et al., 2007) (Cao et al., 2016) min {u (k) r }r,k X − d∑ r=1 u(1) r ⊗ u(2) r ⊗ u(3) r 2 + d∑ r=1 3∑ k=1 u(k)⊤ r Gu(k) r Smoothing t u1 u2 u3 u4 u5 u⊤ Gu = ∑ j (uj − uj+1)2 Smoothing ⇒ Kernel method (Zdunek, 2012; Yokota et al., 2015a; Yokota et al., 2015b) 20 / 41
  • 24. Application: Multi-task learning Tasktype1 Task type 2 Function (f*) Related tasks aligned with two indexes (s, t). f(s,t): the regression function for task (s, t). fr (x) (r = 1, . . . , d): factors behind tasks that give an expression of f(s,t) as f(s,t)(x) = d∑ r=1 βr,(s,t) fr (x) Latent factor = d∑ r=1 αr,sαr,tfr (x) We estimate αr,s ∈ R, αr,t ∈ R, fr ∈ Hr by using Gaussian process prior. 21 / 41
  • 25. Estimation methods f (x(1) , . . . , x(K) ) = d∑ r=1 f (1) r (x(1) ) × · · · × f (K) r (x(K) ) 1. Alternating minimization (MAP estimator) (NIPS2016) Repeating convex optimization. Fast computation. Stronger assumptions are required for minimax optimality. Local optimality is still problematic. 2. Bayes estimator (ICML2016) Nice statistical performance. Minimax optimal. Heavy computation. 3. Convex regularization (Signoretto et al., 2013)   Question Estimation error guarantee: How does the error decrease? Is it optimal? Computational complexity? Performance on real data? 22 / 41
  • 26. Outline 1 Introduction 2 Basics of low rank tensor decomposition 3 Nonparametric tensor estimation Alternating minimization Convergence analysis Real data analysis: multitask learning 23 / 41
  • 27. Alternating minimization method Update f (k) r for a chosen (r, k) while other components are fixed: F({f (k) r }r,k ) := 1 n n∑ i=1 ( yi − d∗ ∑ r=1 K∏ k=1 f (k) r (x (k) i ) )2 (Empirical error) ˆf (k) r ← arg min f (k) r ∈H { F(f (k) r |{ˆf (k′ ) r′ }(r′,k′)̸=(r,k)) + λ∥f (k) r ∥2 H } . The objective function is non-convex. But it is convex w.r.t. one component f (k) r (kernel ridge regression). It should converge to a local optimal (e.g. coordinate descent). 24 / 41
  • 28. Reproducing Kernel Hilbert Space (RKHS) kernel function ⇔ Reproducing Kernel Hilbert Space (RKHS) k(x, x′ ) ⇔ Hk Reproducibility: for f ∈ Hk, the function value at x is recovered as f (x) = ⟨f , k(x, ·)⟩Hk . Representer theorem: min f ∈H 1 n n∑ i=1 (yi − f (xi ))2 + C∥f ∥2 H ⇐⇒ min α∈Rn 1 n n∑ i=1 ( yi − n∑ j=1 αj k(xj , xi ) )2 + Cα⊤ Kα, where Ki,j = k(xi , xj ). ˆf = ∑n i=1 αi k(xi , ·). Gaussian kernel Polynomial kernel Graph kernel, time series kernel, ... 25 / 41
  • 29. Outline 1 Introduction 2 Basics of low rank tensor decomposition 3 Nonparametric tensor estimation Alternating minimization Convergence analysis Real data analysis: multitask learning 26 / 41
  • 30. Complexity of the RKHS 0 s 1: representing complexity of the model. Spectrum decomposition: k(x, x′ ) = ∑∞ ℓ=1 µℓϕℓ(x)ϕℓ(x′ ), where {ϕℓ}∞ ℓ=1 is ONS in L2(P). Spectrum Condition (s) There exists 0 s 1 such that µℓ ≤ Cℓ− 1 s (∀ℓ). s represents the complexity of RKHS. Large s means complex. Small s means simple. The optimal learning rate in a single kernel learning setting is ∥ˆf − f ∗ ∥2 L2(P) = Op(n− 1 1+s ). 27 / 41
  • 31. Convergence of alternating minimization method Assumption f ∗ satisfies the incoherence condition (its definition is in the next slide). P(X) = P(X1) × · · · × P(XK ). Some other technical conditions. ˆf [t] : the estimator after the t-th step. Theorem (Main result) There exit constants C1, C2 such that, if d(ˆf [0] , f ∗ ) ≤ C1, then with probability 1 − δ, we have ∥ˆf [t] − f ∗ ∥2 L2(P) ≤ C2 ( d∗ K (3/4) t Optimization error + d∗ Kn− 1 1+s Estimation error log(1/δ) ) . Linear convergence to a local optimal (log(n) times update is sufficient) d(ˆf [0] , f ∗ ) = Op(1) =⇒ ∥ˆf [log(n)] − f ∗ ∥2 L2 = Op ( d∗ Kn− 1 1+s ) . If we start from a good initial point, then it achieves the minimax optimal error. Naive method: O(n− 1 1+Ks ) (curse of dimensionality). 28 / 41
  • 32. Details of technical conditions Incoherence: ∃µ∗ 1 s.t. |⟨f ∗(k) r , f ∗(k) r′ ⟩| ≤ µ∗ ∥f ∗(k) r ∥L2 ∥f ∗(k) r′ ∥L2 (r ̸= r′ ). fr*(k) fr'*(k) Lower and upper bound of f ∗ : 0 vmin ≤ ∥f ∗(k) r ∥L2 ≤ vmax (∀r, k). sup-norm condition: 0 ∃s2 1 s.t. ∥f ∥∞ ≤ C∥f ∥1−s2 L2 ∥f ∥s2 H (∀f ∈ H) 29 / 41
  • 33. Details of technical conditions Incoherence: ∃µ∗ 1 s.t. |⟨f ∗(k) r , f ∗(k) r′ ⟩| ≤ µ∗ ∥f ∗(k) r ∥L2 ∥f ∗(k) r′ ∥L2 (r ̸= r′ ). fr*(k) fr'*(k) Lower and upper bound of f ∗ : 0 vmin ≤ ∥f ∗(k) r ∥L2 ≤ vmax (∀r, k). sup-norm condition: 0 ∃s2 1 s.t. ∥f ∥∞ ≤ C∥f ∥1−s2 L2 ∥f ∥s2 H (∀f ∈ H) For vr = ∥ ∏K k=1 f ∗(k) r ∥L2 , ˆvr = ∥ ∏K k=1 ˆf (k) r ∥L2 , f ∗∗(k) r = f ∗(k) r ∥f ∗(k) r ∥L2 , ˆˆf (k) r = ˆf (k) r ∥ˆf (k) r ∥L2 , d(ˆf , f ∗ ) = max (r,k) {|ˆvr − vr | + vr ∥ˆˆf (k) r − f ∗∗(k) r ∥L2 }. 29 / 41
  • 34. Illustration of the theoretical result True Pred. Error Emp. Error True Pred. Risk Emp. Risk True Pred. Risk Emp. Risk Small sample Large sample The predictive risk shapes like a convex function locally around the true function. The empirical risk gets closer to the predictive one as the sample size increases. Technique: Local Rademacher complexity 30 / 41
  • 35. Tools used in the proof Rademacher complexity H: function space Ex [ sup h∈H 1 n n∑ i=1 h(xi ) Empirical error − E[h] Predictive error ] ≤2Ex,σ [ sup h∈H 1 n n∑ i=1 σi h(xi ) ] ≤ C √ n where {σi }n i=1 are i.i.d. Rademacher random va- riables (P(σi = 1) = P(σi = −1)). Pred. Risk Emp. Risk Uniform bound Local Rademacher complexity + peeling device: Utilize strong convexity of the squared loss Ex,σ [ sup h∈H |1 n ∑n i=1 σi h(xi )| ∥h∥L2 + λ ] ≤ C λ− s 2 √ n ∨ λ− 1 2 n− 1 1+s Tighter around the true function Pred. Risk Emp. Risk Uniform bound 31 / 41
  • 36. Convergence of alternating minimization 0 5 10 15 20 25 Number of iterations 10-4 10-3 10-2 10-1 100relativeMSE n=400 n=800 n=1200 n=1600 n=2000 n=2400 n=2800 Relative MSE E[∥ˆf [t] − f ∗ ∥2 ] v.s. the number of iteration t for different sample sizes n. 32 / 41
  • 37. Minimax optimality The derived upper bound is minimax optimal (up to a constant). A set of tensors with rank d∗ : H(d∗,K)(R) := { f = d∗ ∑ r=1 K∏ k=1 f (k) r f (k) r ∈ H(r,k)(R) } . Theorem (Minimax risk) inf ˆf sup f ∗∈H(d∗,K)(R) E[∥f ∗ − ˆf ∥2 L2(PX )] ≳ d∗ Kn− 1 1+s , where inf is taken over all estimators ˆf . The Bayes estimator attains the minimax risk. 33 / 41
  • 38. Issue of local optimality The convergence is only proven for a good initial solution that is sufficiently close to the optimal one. True Pred. Risk Emp. Risk Question: Does the algorithm converge to the global optimal? → Open question. 34 / 41
  • 39. NIPS2016 papers about local optimality ■ Every local minimum of the matrix completion problem is the global minimum (with high probability). min U∈RM×k ∑ (i,j)∈E (Yi,j − (UU⊤ )i,j )2 Rong Ge, Jason D. Lee, Tengyu Ma: “Matrix Completion has No Spurious Local Minimum.” Srinadh Bhojanapalli, Behnam Neyshabur, Nati Srebro: “Global Optimality of Local Search for Low Rank Matrix Recovery.” ■ Deep NN also satisfies a similar property. Kenji Kawaguchi: “Deep Learning without Poor Local Minima.” (Essentially, the proof is valid only for linear deep neural network.) Strictly saddle function: Every critical point has a negative curvature direction or is the global optimum. Trust region method (Conn et al., 2000), noisy stochastic gradient (Ge et al., 2015) can reach the global optimum for strictly saddle obj (Sun et al., 2015). 35 / 41
  • 40. Outline 1 Introduction 2 Basics of low rank tensor decomposition 3 Nonparametric tensor estimation Alternating minimization Convergence analysis Real data analysis: multitask learning 36 / 41
  • 41. Numerical experiments on Real data Multi-task learnnig [Nonlinear regression] Restaurant data: multi-task learning, 138 customers × 3 aspects (138 × 3 tasks) We want to predict the rating (3 level) of the restaurant for each customer and each aspect. Each restaurant is described by a 44-dimensional feature vector. 37 / 41
  • 42. Numerical experiments on Real data Multi-task learnnig [Nonlinear regression] Restaurant data: multi-task learning, 138 customers × 3 aspects (138 × 3 tasks) We want to predict the rating (3 level) of the restaurant for each customer and each aspect. Each restaurant is described by a 44-dimensional feature vector. 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 500 1000 1500 2000 2500 MSE Sample size GRBF GRBF(2)+lin(1) GRBF(1)+lin(2) linear scaled latent ALS (Best) ALS (50) Nonprametric methods (Bayes and alternating minimization) achieved the best performance. 37 / 41
  • 43. Multi-task learnnig [Nonlinear regression] Restaurant data: multi-task learning, 138 customers × 3 aspects (138 × 3 tasks) with a different kernel between tasks. 500 1000 1500 2000 2500 Sample size 0.35 0.40 0.45 0.50 0.55 0.60 MSE AMP(RBF) AMP(Linear) Lin(2)+RBF(1) Lin(1)+RBF(2) GP-MTL GP-MTL: Gaussian process method (Bayes). AMP: alternating minimization method. The Bayes method is slightly better. 38 / 41
  • 44. Numerical experiments on Real data Multi-task learnnig [Nonlinear regression] School data: multi-task learning, 139 school × 3 years (139 × 3 tasks) 28 30 32 34 36 38 40 2000 3000 4000 5000 6000 7000 8000 900010000 Explainedvariance Sample size GRBF GRBF(2)+lin(1) GRBF(1)+lin(2) linear scaled latent ALS (50) ALS (Best) Explained variance = 100 × Var(Y ) − MSE Var(Y ) Nonprametric Bayes method and the alternating minimization method achieved the best performance. 39 / 41
  • 45. Online shopping sales prediction Predict the online shopping (Yahoo! shopping) sales. shop × item × customer (508 shops, 100 items) Predict the number of certain items that a customer will buy in a shop. A customer is represented by a feature vector. We construct a kernel defined by the nearest neighbor graph between shops. 10 11 12 13 14 15 16 4000 6000 8000 10000 12000 14000 MSE Sample size GP-MTL(cosdis) GP-MTL(cossim) AMP(cosdis) AMP(cossim) Figuur: Sales prediction of online shop. Comparison between different metrics between shops. 40 / 41
  • 46. Summary Convergence rate of nonlinear tensor estimator was given. The alternating minimization method achieve the minimax optimality. The theoretical analysis requires some strong assumptions, for example, on the choice initial guess. Estimation error of the alternating minimization procedure ∥ˆf [t] − f ∗ ∥2 L2(Π) ≤ C ( dKn− 1 1+s + dK(3/4)t ) . where ˆf [t] is the solution of the t-th update during the procedure. 41 / 41
  • 47. Cao, W., Wang, Y., Sun, J., Meng, D., Yang, C., Cichocki, A., Xu, Z. (2016). Total variation regularized tensor RPCA for background subtraction from compressive measurements. IEEE Transactions on Image Processing, 25, 4075–4090. Carroll, J. D., Chang, J.-J. (1970). Analysis of individual differences in multidimensional scaling via an n-way generalization of“ eckart-young ” decomposition. Psychometrika, 35, 283–319. Cohen, N., Sharir, O., Shashua, A. (2016). On the expressive power of deep learning: A tensor analysis. The 29th Annual Conference on Learning Theory (pp. 698–728). Cohen, N., Shashua, A. (2016). Convolutional rectifier networks as generalized tensor decompositions. Proceedings of the 33th International Conference on Machine Learning (pp. 955–963). Conn, A. R., Gould, N. I., Toint, P. L. (2000). Trust region methods, vol. 1. Siam. De Lathauwer, L. (2006). A link between the canonical decomposition in multilinear algebra and simultaneous matrix diagonalization. SIAM journal on Matrix Analysis and Applications, 28, 642–666. De Lathauwer, L., De Moor, B., Vandewalle, J. (2004). Computation of the canonical decomposition by means of a simultaneous generalized schur decomposition. SIAM journal on Matrix Analysis and Applications, 26, 295–327. 41 / 41
  • 48. De Vos, M., Vergult, A., De Lathauwer, L., De Clercq, W., Van Huffel, S., Dupont, P., Palmini, A., Van Paesschen, W. (2007). Canonical decomposition of ictal scalp eeg reliably detects the seizure onset zone. NeuroImage, 37, 844–854. Ge, R., Huang, F., Jin, C., Yuan, Y. (2015). Escaping from saddle points—online stochastic gradient for tensor decomposition. Proceedings of The 28th Conference on Learning Theory (pp. 797–842). Harshman, R. A. (1970). Foundations of the PARAFAC procedure: Models and conditions for an “explanatory” multi-modal factor analysis. UCLA Working Papers in Phonetics, 16, 1–84. Hitchcock, F. L. (1927). Multilple invariants and generalized rank of a p-way matrix or tensor. Journal of Mathematics and Physics, 7, 39–79. Kanagawa, H., Suzuki, T., Kobayashi, H., Shimizu, N., Tagami, Y. (2016). Gaussian process nonparametric tensor estimator and its minimax optimality. Proceedings of the 33rd International Conference on Machine Learning (ICML2016) (pp. 1632–1641). Leurgans, S., Ross, R., Abel, R. (1993). A decomposition for three-way arrays. SIAM Journal on Matrix Analysis and Applications, 14, 1064–1083. Mu, C., Huang, B., Wright, J., Goldfarb, D. (2014). Square deal: Lower bounds and improved relaxations for tensor recovery. Proceedings of the 31th International Conference on Machine Learning (pp. 73–81). 41 / 41
  • 49. Oseledets, I. V. (2011). Tensor-train decomposition. SIAM Journal on Scientific Computing, 33, 2295–2317. Phien, H. N., Tuan, H. D., Bengua, J. A., Do, M. N. (2016). Efficient tensor completion: Low-rank tensor train. arXiv preprint arXiv:1601.01083. Signoretto, M., Lathauwer, L. D., Suykens, J. A. K. (2013). Learning tensors in reproducing kernel Hilbert spaces with multilinear spectral penalties. CoRR, abs/1310.4977. Sun, J., Qu, Q., Wright, J. (2015). When are nonconvex problems not scary? arXiv preprint arXiv:1510.06096. Suzuki, T. (2015). Convergence rate of Bayesian tensor estimator and its minimax optimality. Proceedings of the 32nd International Conference on Machine Learning (ICML2015) (pp. 1273–1282). Tomioka, R., Suzuki, T. (2013). Convex tensor decomposition via structured schatten norm regularization. Advances in Neural Information Processing Systems 26 (pp. 1331–1339). NIPS2013. Tomioka, R., Suzuki, T., Hayashi, K., Kashima, H. (2011). Statistical performance of convex tensor decomposition. Advances in Neural Information Processing Systems 24 (pp. 972–980). NIPS2011. Tucker, L. R. (1966). Some mathematical notes on three-mode factor analysis. Psychometrika, 31, 279–311. 41 / 41
  • 50. Yokota, T., Zdunek, R., Cichocki, A., Yamashita, Y. (2015a). Smooth nonnegative matrix and tensor factorizations for robust multi-way data analysis. Signal Processing, 113, 234–249. Yokota, T., Zhao, Q., Cichocki, A. (2015b). Smooth parafac decomposition for tensor completion. arXiv preprint arXiv:1505.06611. Zdunek, R. (2012). Approximation of feature vectors in nonnegative matrix factorization with gaussian radial basis functions. International Conference on Neural Information Processing (pp. 616–623). Zheng, Q., Tomioka, R. (2015). Interpolating convex and non-convex tensor decompositions via the subspace norm. Advances in Neural Information Processing Systems (pp. 3088–3095). 41 / 41