Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.
Nächste SlideShare
×

# Estimating structured vector autoregressive models

2.816 Aufrufe

Veröffentlicht am

ICML2016読み会の資料です

Veröffentlicht in: Technologie
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Als Erste(r) kommentieren

### Estimating structured vector autoregressive models

1. 1. ICML2016 Estimating Structured Vector Autoregressive Models NEC ICML2016 2016/7/21
2. 2. • Vector Autoregressive   • • i.i.d. • →( ) Lasso Group-Lasso i.i.d. • →( ) R ( ) • R ap- y in lar- (1) uch , is dis- ion ani, hine ume nancial time series (Tsay, 2005) to modeling the dynamical systems (Ljung, 1998) and estimating brain function con- nectivity (Valdes-Sosa et al., 2005), among others. A VAR model of order d is deﬁned as xt = A1xt 1 + A2xt 2 + · · · + Adxt d + ✏t , (2) where xt 2 Rp denotes a multivariate time series, Ak 2 Rp⇥p , k = 1, . . . , d are the parameters of the model, and d 1 is the order of the model. In this work, we as- sume that the noise ✏t 2 Rp follows a Gaussian distribu- tion, ✏t ⇠ N(0, ⌃), with E(✏t✏T t ) = ⌃ and E(✏t✏T t+⌧ ) = 0, for ⌧ 6= 0. The VAR process is assumed to be stable and stationary (Lutkepohl, 2007), while the noise covariance matrix ⌃ is assumed to be positive deﬁnite with bounded largest eigenvalue, i.e., ⇤min(⌃) > 0 and ⇤max(⌃) < 1. In the current context, the parameters {Ak} are assumed vec(Y ) = (Ip⇥p ⌦ X)vec(B) + vec(E) y = Z + ✏, where y 2 RNp , Z = (Ip⇥p ⌦ X) 2 RNp⇥dp2 , 2 Rdp2 , ✏ 2 RNp , and ⌦ is the Kronecker product. The covari- ance matrix of the noise ✏ is now E[✏✏T ] = ⌃ ⌦ IN⇥N . Consequently, the regularized estimator takes the form ˆ = argmin 2Rdp2 1 N ||y Z ||2 2 + N R( ), (4) where R( ) can be any vector norm, separable along the rows of matrices Ak. Speciﬁcally, if we denote = [ T 1 . . . T p ]T and Ak(i, :) as the row of matrix Ak for k = 1, . . . , d, then our assumption is equivalent to R( )= pX i=1 R i = pX i=1 R ✓h A1(i, :)T . . .Ad(i, :)T iT ◆ . (5) where A all the for 2 of origi Pd k=1 A and Sec 2.3. Pro In what trix X i results w bounds Deﬁne we assu is distri matrix C C = 2
3. 3.   -   Lasso  [Basu15]   VAR i.i.d. [Banerjee14] VAR   3
4. 4. • i.i.d. • • • high probability ||Z ||2 || ||2 p N = O(1) is a pos- he estimation er- ded by k k2  ote that the norm assumed to be the rom our assump- und on the regu- + w2 (⌦R) N2 ⌘ . As nd d grows and 2 ✏1) + log(p)) we (1+✏1) w2 (⌦R) N2 ◆ . ion, we will show some ⌫ > 0 and p 1 , where Sdp 1 deﬁned as ⌦Ej = R( j) o , for r > T , for j is of size isfy N O N + N2 . With high then the restricted eigenvalue condition ||Z || || ||2 for 2 cone(⌦E) holds, so that  = O(1 itive constant. Moreover, the norm of the est ror in optimization problem (4) is bounded b O ⇣ w(⌦R) p N + w2 (⌦R) N2 ⌘ (cone(⌦Ej )). Note tha compatibility constant (cone(⌦Ej )) is assume same for all j = 1, . . . , p, which follows from o tion in (5). Consider now Theorem 3.3 and the bound on larization parameter N O ⇣ w(⌦R) p N + w2 ( N the dimensionality of the problem p and d the number of samples N increases, the ﬁrst t will dominate the second one w2 (⌦R) N2 . This c by computing N for which the two terms bec 2 ↑ w, the regularization parameter nd on R⇤ [ 1 N ZT ✏]  ↵, for lish the required relationship 2 Rdp |R(u)  1}, and deﬁne be a Gaussian width of set ny ✏1 > 0 and ✏2 > 0 with ( min(✏2 2, ✏1) + log(p)) we (⌦R) p N + c1(1+✏1) w2 (⌦R) N2 ◆ e constants. alue condition, we will show ⌫, for some ⌫ > 0 and L indicate stronger dependency in the data, thus more samples for the RE conditions to hold with h ability. Analyzing Theorems 3.3 and 3.4 we can interpre tablished results as follows. As the size and di ality N, p and d of the problem increase, we e the scale of the results and use the order notatio note the constants. Select a number of sample N O(w2 (⇥)) and let the regularization param isfy N O ⇣ w(⌦R) p N + w2 (⌦R) N2 ⌘ . With high pr then the restricted eigenvalue condition ||Z ||2 || ||2 for 2 cone(⌦E) holds, so that  = O(1) itive constant. Moreover, the norm of the estim ror in optimization problem (4) is bounded by O ⇣ w(⌦R) p N + w2 (⌦R) N2 ⌘ (cone(⌦Ej )). Note that compatibility constant (cone(⌦Ej )) is assumed same for all j = 1, . . . , p, which follows from our tion in (5). y ⌘ > 0, inf 2cone(⌦E ) ||(Ip⇥p⌦X) ||2 || ||2 ⌫, L 2 p M cw(⇥) ⌘ and c, c1, c2 are nts, and L, M are deﬁned in (9) and (13). 3.4, we can choose ⌘ = 1 2 p NL and set 2 p M cw(⇥) ⌘ and since p N > 0 ed, we can establish a lower bound on the ples N: p N > 2 p M+cw(⇥) p L/2 = O(w(⇥)). bound and using (9) and (13), we can con- 3.3 W sep ize su OW 3. To ram VAR h= 1 X(h) = E[Xj,:XT j+h,:] 2 Rdp⇥dp , then we can write inf 1ldp !2[0,2⇡] ⇤l[⇢X(!)]  ⇤k[CU ] 1kNdp  sup 1ldp !2[0,2⇡] ⇤l[⇢X(!)]. The closed form expression of spectral density is ⇢X(!) = I Ae i! 1 ⌃E h I Ae i! 1 i⇤ , where ⌃E is the covariance matrix of a noise vector and A are as deﬁned in expression (6). Thus, an upper bound on CU can be obtained as ⇤max[CU ]  ⇤max(⌃) ⇤min(A) , where we deﬁned ⇤min(A) = min !2[0,2⇡] ⇤min(A(!)) for A(!) = I AT ei! I Ae i! . (12) Referring back to covariance matrix Qa in (11), we get ⇤max[Qa]  ⇤max(⌃)/⇤min(A) = M. (13) We note that for a general VAR model, there might not exist closed-form expressions for ⇤max(A) and ⇤min(A). How- ever, for some special cases there are results establishing the bounds on these quantities (e.g., see Proposition 2.2 in Lemma 3.2 Assume tha condition holds ||Z || for 2 cone(⌦E) an cone(⌦E) is a cone of th || ||2  1 + r where (cone(⌦E)) is a ﬁned as (cone(⌦E)) = Note that the above error and (16) hold, then the (17). However, the result tities, involving Z and ✏, the following we establis Estimating Structured VAR density is that it has a closed form expression (see Section 9.4 of (Priestley, 1981)) ⇢(!)= I dX k=1 Ake ki! ! 1 ⌃ 2 4 I dX k=1 Ake ki! ! 1 3 5 ⇤ , where ⇤ denotes a Hermitian of a matrix. Therefore, from (8) we can establish the following lower bound ⇤min[CX] ⇤min(⌃)/⇤max(A) = L, (9) 3. Regularized Es Denote by = ˆ optimization problem rameter. The focus of under which the optim tees on the accuracy of term is bounded: || || lish such conditions, w et al., 2014). Speciﬁca4
5. 5. • : p vector VAR(d) • • - n - ) h s - n , e e nancial time series (Tsay, 2005) to modeling the dynamical systems (Ljung, 1998) and estimating brain function con- nectivity (Valdes-Sosa et al., 2005), among others. A VAR model of order d is deﬁned as xt = A1xt 1 + A2xt 2 + · · · + Adxt d + ✏t , (2) where xt 2 Rp denotes a multivariate time series, Ak 2 Rp⇥p , k = 1, . . . , d are the parameters of the model, and d 1 is the order of the model. In this work, we as- sume that the noise ✏t 2 Rp follows a Gaussian distribu- tion, ✏t ⇠ N(0, ⌃), with E(✏t✏T t ) = ⌃ and E(✏t✏T t+⌧ ) = 0, for ⌧ 6= 0. The VAR process is assumed to be stable and stationary (Lutkepohl, 2007), while the noise covariance matrix ⌃ is assumed to be positive deﬁnite with bounded largest eigenvalue, i.e., ⇤min(⌃) > 0 and ⇤max(⌃) < 1. In the current context, the parameters {Ak} are assumed timation problems of the form: ˆ = argmin 2Rq 1 M ky Z k2 2 + M R( ) , (1) {(yi, zi), i = 1, . . . , M}, yi 2 R, zi 2 Rq , such = [yT 1 , . . . , yT M ]T and Z = [zT 1 , . . . , zT M ]T , is ining set of M independently and identically dis- d (i.i.d.) samples, M > 0 is a regularization eter, and R(·) denotes a suitable norm (Tibshirani, dings of the 33rd International Conference on Machine g, New York, NY, USA, 2016. JMLR: W&CP volume pyright 2016 by the author(s). model of order d is deﬁned as xt = A1xt 1 + A2xt 2 + · where xt 2 Rp denotes a mult Rp⇥p , k = 1, . . . , d are the par d 1 is the order of the mod sume that the noise ✏t 2 Rp fo tion, ✏t ⇠ N (0, ⌃), with E(✏t✏T t for ⌧ 6= 0. The VAR process is stationary (Lutkepohl, 2007), w matrix ⌃ is assumed to be posi largest eigenvalue, i.e., ⇤min(⌃) In the current context, the para Estimating Structured VAR characterizing sample complexity and error bounds. 2.1. Regularized Estimator To estimate the parameters of the VAR model, we trans- form the model in (2) into the form suitable for regularized estimator (1). Let (x0, x1, . . . , xT ) denote the T + 1 sam- ples generated by the stable VAR model in (2), then stack- ing them together we obtain 2 6 6 6 4 xT d xT d+1 ... xT T 3 7 7 7 5 = 2 6 6 6 4 xT d 1 xT d 2 . . . xT 0 xT d xT d 1 . . . xT 1 ... ... ... ... xT T 1 xT T 2 . . . xT T d 3 7 7 7 5 2 6 6 6 4 AT 1 AT 2 ... AT d 3 7 7 7 5 + 2 6 6 6 4 ✏T d ✏T d+1 ... ✏T T 3 7 7 7 5 which can also be compactly written as Y = XB + E, (3) where Y 2 RN⇥p , X 2 RN⇥dp , B 2 Rdp⇥p , and E 2 RN⇥p for N = T d+1. Vectorizing (column-wise) each matrix in (3), we get vec(Y ) = (Ip⇥p ⌦ X)vec(B) + vec(E) y = Z + ✏, of matrix X (and conseque dependencies, following (B utilize the spectral represen VAR models to control the d 2.2. Stability of VAR Mode Since VAR models are (line analysis we need to establi VAR model (2) is stable, i.e not diverge over time. For u venient to rewrite VAR mod alent VAR model of order 1 2 6 6 6 4 xt xt 1 ... xt (d 1) 3 7 7 7 5 = 2 6 6 6 6 6 4 A1 A2 . . . I 0 . . . 0 I . . . ... ... ... 0 0 . . . | {z A where A 2 Rdp⇥dp . There all the eigenvalues of A sa for 2 C, | | < 1. Equi of original parameters Ak, P (1). Let (x0, x1, . . . , xT ) denote the T + 1 sam- ated by the stable VAR model in (2), then stack- ogether we obtain = 2 6 6 6 4 xT d 1 xT d 2 . . . xT 0 xT d xT d 1 . . . xT 1 ... ... ... ... xT T 1 xT T 2 . . . xT T d 3 7 7 7 5 2 6 6 6 4 AT 1 AT 2 ... AT d 3 7 7 7 5 + 2 6 6 6 4 ✏T d ✏T d+1 ... ✏T T 3 7 7 7 5 also be compactly written as Y = XB + E, (3) 2 RN⇥p , X 2 RN⇥dp , B 2 Rdp⇥p , and E 2 N = T d+1. Vectorizing (column-wise) each (3), we get ec(Y ) = (Ip⇥p ⌦ X)vec(B) + vec(E) Since VAR mo analysis we ne VAR model (2 not diverge ove venient to rewr alent VAR mod 2 6 6 6 4 xt xt 1 ... xt (d 1) 3 7 7 7 5 = 2 6 6 6 6 6 4 | where A 2 R all the eigenva for 2 C, | les generated by the stable VAR model in (2), then stack- ng them together we obtain 2 6 6 6 4 xT d xT d+1 ... xT T 3 7 7 7 5 = 2 6 6 6 4 xT d 1 xT d 2 . . . xT 0 xT d xT d 1 . . . xT 1 ... ... ... ... xT T 1 xT T 2 . . . xT T d 3 7 7 7 5 2 6 6 6 4 AT 1 AT 2 ... AT d 3 7 7 7 5 + 2 6 6 6 4 ✏T d ✏T d+1 ... ✏T T 3 7 7 7 5 hich can also be compactly written as Y = XB + E, (3) here Y 2 RN⇥p , X 2 RN⇥dp , B 2 Rdp⇥p , and E 2 N⇥p for N = T d+1. Vectorizing (column-wise) each matrix in (3), we get vec(Y ) = (Ip⇥p ⌦ X)vec(B) + vec(E) Since V analysi VAR m not div venient alent V 2 6 6 6 4 x xt xt ( where all the ples generated by the stable VAR model in (2), then stack- ing them together we obtain 2 6 6 6 4 xT d xT d+1 ... xT T 3 7 7 7 5 = 2 6 6 6 4 xT d 1 xT d 2 . . . xT 0 xT d xT d 1 . . . xT 1 ... ... ... ... xT T 1 xT T 2 . . . xT T d 3 7 7 7 5 2 6 6 6 4 AT 1 AT 2 ... AT d 3 7 7 7 5 + 2 6 6 6 4 ✏T d ✏T d+1 ... ✏T T 3 7 7 7 5 which can also be compactly written as Y = XB + E, (3) where Y 2 RN⇥p , X 2 RN⇥dp , B 2 Rdp⇥p , and E 2 RN⇥p for N = T d+1. Vectorizing (column-wise) each matrix in (3), we get vec(Y ) = (Ip⇥p ⌦ X)vec(B) + vec(E) ples generated by the stable VAR model in (2), then stack- ing them together we obtain 2 6 6 6 4 xT d xT d+1 ... xT T 3 7 7 7 5 = 2 6 6 6 4 xT d 1 xT d 2 . . . xT 0 xT d xT d 1 . . . xT 1 ... ... ... ... xT T 1 xT T 2 . . . xT T d 3 7 7 7 5 2 6 6 6 4 AT 1 AT 2 ... AT d 3 7 7 7 5 + 2 6 6 6 4 ✏T d ✏T d+1 ... ✏T T 3 7 7 7 5 which can also be compactly written as Y = XB + E, (3) where Y 2 RN⇥p , X 2 RN⇥dp , B 2 Rdp⇥p , and E 2 RN⇥p for N = T d+1. Vectorizing (column-wise) each matrix in (3), we get vec(Y ) = (Ip⇥p ⌦ X)vec(B) + vec(E) ing them together we obtain 2 6 6 6 4 xT d xT d+1 ... xT T 3 7 7 7 5 = 2 6 6 6 4 xT d 1 xT d 2 . . . xT 0 xT d xT d 1 . . . xT 1 ... ... ... ... xT T 1 xT T 2 . . . xT T d 3 7 7 7 5 2 6 6 6 4 AT 1 AT 2 ... AT d 3 7 7 7 5 + 2 6 6 6 4 ✏T d ✏T d+1 ... ✏T T 3 7 7 7 5 which can also be compactly written as Y = XB + E, (3) where Y 2 RN⇥p , X 2 RN⇥dp , B 2 Rdp⇥p , and E 2 RN⇥p for N = T d+1. Vectorizing (column-wise) each matrix in (3), we get vec(Y ) = (Ip⇥p ⌦ X)vec(B) + vec(E) y = Z + ✏, where y 2 RNp , Z = (Ip⇥p ⌦ X) 2 RNp⇥dp2 , 2 Rdp2 , ✏ 2 RNp , and ⌦ is the Kronecker product. The covari- analysis we VAR model not diverge o venient to re alent VAR m 2 6 6 6 4 xt xt 1 ... xt (d 1) 3 7 7 7 5 where A 2 all the eigen for 2 C, of original p Pd k=1 Ak 1 k and Section xT T xT T 1 xT T 2 . . . xT T d AT d ✏T T which can also be compactly written as Y = XB + E, (3) where Y 2 RN⇥p , X 2 RN⇥dp , B 2 Rdp⇥p , and E 2 RN⇥p for N = T d+1. Vectorizing (column-wise) each matrix in (3), we get vec(Y ) = (Ip⇥p ⌦ X)vec(B) + vec(E) y = Z + ✏, where y 2 RNp , Z = (Ip⇥p ⌦ X) 2 RNp⇥dp2 , 2 Rdp2 , ✏ 2 RNp , and ⌦ is the Kronecker product. The covari- ance matrix of the noise ✏ is now E[✏✏T ] = ⌃ ⌦ IN⇥N . Consequently, the regularized estimator takes the form ˆ = argmin 2Rdp2 1 N ||y Z ||2 2 + N R( ), (4) 2 6 6 6 4 xt xt 1 ... xt (d 1) 3 7 7 7 5 = where A 2 all the eigen for 2 C, of original p Pd k=1 Ak 1 k and Section 2.3. Propert In what follo trix X in (3) results will th i.i.d. 5
6. 6.   2 ⌦E     Lasso regularization norms. I the main ideas of our proof te delegated to the supplement. To establish lower bound on N , we derive an upper bou some ↵ > 0, which will estab N ↵ R⇤ [ 1 N ZT ✏]. Theorem 3.3 Let ⌦R = {u 2 w(⌦R) = E[ sup u2⌦R hg, ui] to ⌦R for g ⇠ N(0, I). For a probability at least 1 c exp can establish that R⇤  1 N ZT ✏  ✓ c2(1+✏2) w Thm4.3 Thm4.4 Lem.4.1 Lem.4.2 o hold with high prob- e can interpret the es- e size and dimension- crease, we emphasize order notations to de- er of samples at least ization parameter sat- With high probability tion ||Z ||2 || ||2 p N  = O(1) is a pos- of the estimation er- bounded by k k2  )). Note that the norm ction 3.4 we will present ique, with all the details regularization parameter n R⇤ [ 1 N ZT ✏]  ↵, for the required relationship p |R(u)  1}, and deﬁne a Gaussian width of set > 0 and ✏2 > 0 with min(✏2 2, ✏1) + log(p)) we ) + c1(1+✏1) w2 (⌦R) N2 ◆ nstants. condition, we will show N as showing that large values of M and small va L indicate stronger dependency in the data, thus req more samples for the RE conditions to hold with high ability. Analyzing Theorems 3.3 and 3.4 we can interpret tablished results as follows. As the size and dime ality N, p and d of the problem increase, we emp the scale of the results and use the order notations note the constants. Select a number of samples a N O(w2 (⇥)) and let the regularization paramet isfy N O ⇣ w(⌦R) p N + w2 (⌦R) N2 ⌘ . With high prob then the restricted eigenvalue condition ||Z ||2 || ||2 for 2 cone(⌦E) holds, so that  = O(1) is itive constant. Moreover, the norm of the estimat ror in optimization problem (4) is bounded by k O ⇣ w(⌦R) p N + w2 (⌦R) N2 ⌘ (cone(⌦Ej )). Note that the compatibility constant (cone(⌦Ej )) is assumed to same for all j = 1, . . . , p, which follows from our as o understand the bound on s of M and small values of y in the data, thus requiring ions to hold with high prob- 3.4 we can interpret the es- As the size and dimension- m increase, we emphasize e the order notations to de- number of samples at least gularization parameter sat- R) ⌘ . With high probability condition ||Z ||2 || ||2 p N that  = O(1) is a pos- norm of the estimation er- 4) is bounded by k k2  ⌦Ej )). Note that the norm ⌦Ej )) is assumed to be the 6
7. 7.   2 ⌦E     Lasso regularization norms. I the main ideas of our proof te delegated to the supplement. To establish lower bound on N , we derive an upper bou some ↵ > 0, which will estab N ↵ R⇤ [ 1 N ZT ✏]. Theorem 3.3 Let ⌦R = {u 2 w(⌦R) = E[ sup u2⌦R hg, ui] to ⌦R for g ⇠ N(0, I). For a probability at least 1 c exp can establish that R⇤  1 N ZT ✏  ✓ c2(1+✏2) w Thm4.3 Thm4.4 Lem.4.1 o hold with high prob- e can interpret the es- e size and dimension- crease, we emphasize order notations to de- er of samples at least ization parameter sat- With high probability tion ||Z ||2 || ||2 p N  = O(1) is a pos- of the estimation er- bounded by k k2  )). Note that the norm ction 3.4 we will present ique, with all the details regularization parameter n R⇤ [ 1 N ZT ✏]  ↵, for the required relationship p |R(u)  1}, and deﬁne a Gaussian width of set > 0 and ✏2 > 0 with min(✏2 2, ✏1) + log(p)) we ) + c1(1+✏1) w2 (⌦R) N2 ◆ nstants. condition, we will show N as showing that large values of M and small va L indicate stronger dependency in the data, thus req more samples for the RE conditions to hold with high ability. Analyzing Theorems 3.3 and 3.4 we can interpret tablished results as follows. As the size and dime ality N, p and d of the problem increase, we emp the scale of the results and use the order notations note the constants. Select a number of samples a N O(w2 (⇥)) and let the regularization paramet isfy N O ⇣ w(⌦R) p N + w2 (⌦R) N2 ⌘ . With high prob then the restricted eigenvalue condition ||Z ||2 || ||2 for 2 cone(⌦E) holds, so that  = O(1) is itive constant. Moreover, the norm of the estimat ror in optimization problem (4) is bounded by k O ⇣ w(⌦R) p N + w2 (⌦R) N2 ⌘ (cone(⌦Ej )). Note that the compatibility constant (cone(⌦Ej )) is assumed to same for all j = 1, . . . , p, which follows from our as Lem.4.2 o understand the bound on s of M and small values of y in the data, thus requiring ions to hold with high prob- 3.4 we can interpret the es- As the size and dimension- m increase, we emphasize e the order notations to de- number of samples at least gularization parameter sat- R) ⌘ . With high probability condition ||Z ||2 || ||2 p N that  = O(1) is a pos- norm of the estimation er- 4) is bounded by k k2  ⌦Ej )). Note that the norm ⌦Ej )) is assumed to be the 7
8. 8. Restricted Eigenvalue Condition •   (well-conditioned) • cone Lasso-type • R⇤  1 N ZT ✏  ✓ c2(1+✏2) w(⌦R) p N + c1(1+✏1) w2 (⌦R) N2 ◆ where c, c1 and c2 are positive constants. To establish restricted eigenvalue condition, we will show that inf 2cone(⌦E ) ||(Ip⇥p⌦X) ||2 || ||2 ⌫, for some ⌫ > 0 and then set p N = ⌫. Theorem 3.4 Let ⇥ = cone(⌦Ej ) Sdp 1 , where Sdp 1 is a unit sphere. The error set ⌦Ej is deﬁned as ⌦Ej =n j 2 Rdp R( ⇤ j + j)  R( ⇤ j ) + 1 r R( j) o , for r > 1, j = 1, . . . , p, and = [ T 1 , . . . , T p ]T , for j is of size dp ⇥ 1, and ⇤ = [ ⇤T 1 . . . ⇤T p ]T , for ⇤ j 2 Rdp . The set ⌦Ej is a part of the decomposition in ⌦E = ⌦E1 ⇥ · · · ⇥ ⌦E due to the assumption on the row-wise separability of for itiv ror O ⇣ com sam tion Con lari the the wil by w(⌦ p w(⌦ low   ( )⌦E N ↵ R [N Z ✏]. Theorem 3.3 Let ⌦R = {u 2 Rd w(⌦R) = E[ sup u2⌦R hg, ui] to be ⌦R for g ⇠ N(0, I). For any ✏ probability at least 1 c exp( m can establish that R⇤  1 N ZT ✏  ✓ c2(1+✏2) w(⌦R p N where c, c1 and c2 are positive co To establish restricted eigenvalue that inf 2cone(⌦E ) ||(Ip⇥p⌦X) ||2 || ||2 then set p N = ⌫. Theorem 3.4 Let ⇥ = cone(⌦Ej⌦E 8
9. 9. Restricted Error Set • • • r=2 ⇤ ⇤ + ⌦E (L1)(L1) conecone ce matrix 1.2 of the (11) XT N,: ⇤T 2 ng all the n order to Qa), ob- y stacking as in (8), process in ⇡], where write X(!)]. 1 i⇤ , for some constant r > 1, where R⇤ ⇥ 1 N ZT ✏ ⇤ is a dual form of the vector norm R(·), which is deﬁned as R⇤ [ 1 N ZT ✏] = sup R(U)1 ⌦ 1 N ZT ✏, U ↵ , for U 2 Rdp2 , where U = [uT 1 , uT 2 , . . . , uT p ]T and ui 2 Rdp . Then the error vector k k2 belongs to the set ⌦E= ⇢ 2 Rdp2 R( ⇤ + )  R( ⇤ )+ 1 r R( ) . (15) The second condition in (Banerjee et al., 2014) establishes the upper bound on the estimation error. Lemma 3.2 Assume that the restricted eigenvalue (RE) condition holds ||Z ||2 || ||2 p N, (16) ting Structured VAR tion 3 5 ⇤ , rom (9) r (10) need 3. Regularized Estimation Guarantees Denote by = ˆ ⇤ the error between the solution of optimization problem (4) and ⇤ , the true value of the pa- rameter. The focus of our work is to determine conditions under which the optimization problem in (4) has guaran- tees on the accuracy of the obtained solution, i.e., the error term is bounded: || ||2  for some known . To estab- lish such conditions, we utilize the framework of (Banerjee et al., 2014). Speciﬁcally, estimation error analysis is based on the following known results adapted to our settings. The ﬁrst one characterizes the restricted error set ⌦E, where the error belongs. Lemma 3.1 Assume that  (r>1) ⌦E 2⌦E 9
10. 10.   2 ⌦E     Lasso regularization norms. I the main ideas of our proof te delegated to the supplement. To establish lower bound on N , we derive an upper bou some ↵ > 0, which will estab N ↵ R⇤ [ 1 N ZT ✏]. Theorem 3.3 Let ⌦R = {u 2 w(⌦R) = E[ sup u2⌦R hg, ui] to ⌦R for g ⇠ N(0, I). For a probability at least 1 c exp can establish that R⇤  1 N ZT ✏  ✓ c2(1+✏2) w Thm4.3 Thm4.4 Lem.4.1 Lem.4.2 o hold with high prob- e can interpret the es- e size and dimension- crease, we emphasize order notations to de- er of samples at least ization parameter sat- With high probability tion ||Z ||2 || ||2 p N  = O(1) is a pos- of the estimation er- bounded by k k2  )). Note that the norm ction 3.4 we will present ique, with all the details regularization parameter n R⇤ [ 1 N ZT ✏]  ↵, for the required relationship p |R(u)  1}, and deﬁne a Gaussian width of set > 0 and ✏2 > 0 with min(✏2 2, ✏1) + log(p)) we ) + c1(1+✏1) w2 (⌦R) N2 ◆ nstants. condition, we will show N as showing that large values of M and small va L indicate stronger dependency in the data, thus req more samples for the RE conditions to hold with high ability. Analyzing Theorems 3.3 and 3.4 we can interpret tablished results as follows. As the size and dime ality N, p and d of the problem increase, we emp the scale of the results and use the order notations note the constants. Select a number of samples a N O(w2 (⇥)) and let the regularization paramet isfy N O ⇣ w(⌦R) p N + w2 (⌦R) N2 ⌘ . With high prob then the restricted eigenvalue condition ||Z ||2 || ||2 for 2 cone(⌦E) holds, so that  = O(1) is itive constant. Moreover, the norm of the estimat ror in optimization problem (4) is bounded by k O ⇣ w(⌦R) p N + w2 (⌦R) N2 ⌘ (cone(⌦Ej )). Note that the compatibility constant (cone(⌦Ej )) is assumed to same for all j = 1, . . . , p, which follows from our as o understand the bound on s of M and small values of y in the data, thus requiring ions to hold with high prob- 3.4 we can interpret the es- As the size and dimension- m increase, we emphasize e the order notations to de- number of samples at least gularization parameter sat- R) ⌘ . With high probability condition ||Z ||2 || ||2 p N that  = O(1) is a pos- norm of the estimation er- 4) is bounded by k k2  ⌦Ej )). Note that the norm ⌦Ej )) is assumed to be the 10
11. 11. • 4.2 • • • k k2 ond condition in (Banerjee et al., 2014) establishes er bound on the estimation error. a 3.2 Assume that the restricted eigenvalue (RE) on holds ||Z ||2 || ||2 p N, (16) 2 cone(⌦E) and some constant  > 0, where E) is a cone of the error set, then || ||2  1 + r r N  (cone(⌦E)), (17) (cone(⌦E)) is a norm compatibility constant, de- (cone(⌦E)) = sup R(U) ||U||2 . , density of the VAR process in X(h)e hi! , ! 2 [0, 2⇡], where Rdp⇥dp , then we can write ⇤k[CU ] kNdp  sup 1ldp !2[0,2⇡] ⇤l[⇢X(!)]. on of spectral density is 1 ⌃E h I Ae i! 1 i⇤ , e matrix of a noise vector and A on (6). Thus, an upper bound on max[CU ]  ⇤max(⌃) ⇤min(A) , where we n 2⇡] ⇤min(A(!)) for AT ei! I Ae i! . (12) nce matrix Qa in (11), we get ax(⌃)/⇤min(A) = M. (13) The second condition in (Banerjee et al., 2014 the upper bound on the estimation error. Lemma 3.2 Assume that the restricted eigen condition holds ||Z ||2 || ||2 p N, for 2 cone(⌦E) and some constant  > cone(⌦E) is a cone of the error set, then || ||2  1 + r r N  (cone(⌦E)), where (cone(⌦E)) is a norm compatibility c ﬁned as (cone(⌦E)) = sup U2cone(⌦E ) R(U) ||U||2 . Note that the above error bound is deterministic rocess in ⇡], where rite (!)]. 1 i⇤ , or and A bound on where we (12) e get (13) the upper bound on the estimation error. Lemma 3.2 Assume that the restricted eigenvalue (RE) condition holds ||Z ||2 || ||2 p N, (16) for 2 cone(⌦E) and some constant  > 0, where cone(⌦E) is a cone of the error set, then || ||2  1 + r r N  (cone(⌦E)), (17) where (cone(⌦E)) is a norm compatibility constant, de- ﬁned as (cone(⌦E)) = sup U2cone(⌦E ) R(U) ||U||2 . The second condition in (Banerjee et al., 2014) establishes he upper bound on the estimation error. Lemma 3.2 Assume that the restricted eigenvalue (RE) ondition holds ||Z ||2 || ||2 p N, (16) or 2 cone(⌦E) and some constant  > 0, where one(⌦E) is a cone of the error set, then || ||2  1 + r r N  (cone(⌦E)), (17) where (cone(⌦E)) is a norm compatibility constant, de- ﬁned as (cone(⌦E)) = sup U2cone(⌦E ) R(U) ||U||2 . the upper bound on the estimation error. Lemma 3.2 Assume that the restricted eigenvalue (RE) condition holds ||Z ||2 || ||2 p N, (16) for 2 cone(⌦E) and some constant  > 0, where cone(⌦E) is a cone of the error set, then || ||2  1 + r r N  (cone(⌦E)), (17) where (cone(⌦E)) is a norm compatibility constant, de- ﬁned as (cone(⌦E)) = sup U2cone(⌦E ) R(U) ||U||2 .   E k = 1, . . . , d. Since L1 is decomposable, it can that (cone(⌦Ej ))  4 p s. Next, since ⌦R Rdp |R(u)  1}, then using Lemma 3 in (Baner 2014) and Gaussian width results in (Chandrasekap d A on we 12) 13) Lemma 3.2 Assume that the restricted eigenvalue (RE) condition holds ||Z ||2 || ||2 p N, (16) for 2 cone(⌦E) and some constant  > 0, where cone(⌦E) is a cone of the error set, then || ||2  1 + r r N  (cone(⌦E)), (17) where (cone(⌦E)) is a norm compatibility constant, de- ﬁned as (cone(⌦E)) = sup U2cone(⌦E ) R(U) ||U||2 . Note that the above error bound is deterministic, i.e., if (14) s: 11
12. 12. • • • • Lasso k k2 or 2 cone(⌦E) and some constant  > 0, where one(⌦E) is a cone of the error set, then || ||2  1 + r r N  (cone(⌦E)), (17) where (cone(⌦E)) is a norm compatibility constant, de- ﬁned as (cone(⌦E)) = sup U2cone(⌦E ) R(U) ||U||2 . Note that the above error bound is deterministic, i.e., if (14) nd (16) hold, then the error satisﬁes the upper bound in 17). However, the results are deﬁned in terms of the quan- ities, involving Z and ✏, which are random. Therefore, in he following we establish high probability bounds on the egularization parameter in (14) and RE condition in (16). etails meter , for nship eﬁne of set with ) we R) ◆ show 0 and dp 1 L indicate stronger dependency in the data, thus requiring more samples for the RE conditions to hold with high prob- ability. Analyzing Theorems 3.3 and 3.4 we can interpret the es- tablished results as follows. As the size and dimension- ality N, p and d of the problem increase, we emphasize the scale of the results and use the order notations to de- note the constants. Select a number of samples at least N O(w2 (⇥)) and let the regularization parameter sat- isfy N O ⇣ w(⌦R) p N + w2 (⌦R) N2 ⌘ . With high probability then the restricted eigenvalue condition ||Z ||2 || ||2 p N for 2 cone(⌦E) holds, so that  = O(1) is a pos- itive constant. Moreover, the norm of the estimation er- ror in optimization problem (4) is bounded by k k2  O ⇣ w(⌦R) p N + w2 (⌦R) N2 ⌘ (cone(⌦Ej )). Note that the norm compatibility constant (cone(⌦Ej )) is assumed to be the same for all j = 1, . . . , p, which follows from our assump- tion in (5). Consider now Theorem 3.3 and the bound on the regu- larization parameter N O ⇣ w(⌦R) p N + w2 (⌦R) N2 ⌘ . As the dimensionality of the problem p and d grows and ze and dimension- ase, we emphasize er notations to de- of samples at least tion parameter sat- th high probability n ||Z ||2 || ||2 p N = O(1) is a pos- the estimation er- unded by k k2  Note that the norm s assumed to be the s from our assump- ound on the regu- ) + w2 (⌦R) ⌘ . As N equired relationship u)  1}, and deﬁne aussian width of set 0 and ✏2 > 0 with ✏2 2, ✏1) + log(p)) we c1(1+✏1) w2 (⌦R) N2 ◆ nts. dition, we will show or some ⌫ > 0 and Sdp 1 , where Sdp 1 s deﬁned as ⌦Ej =o Analyzing Theorems 3.3 and 3.4 we can interpret the es- tablished results as follows. As the size and dimension- ality N, p and d of the problem increase, we emphasize the scale of the results and use the order notations to de- note the constants. Select a number of samples at least N O(w2 (⇥)) and let the regularization parameter sat- isfy N O ⇣ w(⌦R) p N + w2 (⌦R) N2 ⌘ . With high probability then the restricted eigenvalue condition ||Z ||2 || ||2 p N for 2 cone(⌦E) holds, so that  = O(1) is a pos- itive constant. Moreover, the norm of the estimation er- ror in optimization problem (4) is bounded by k k2  O ⇣ w(⌦R) p N + w2 (⌦R) N2 ⌘ (cone(⌦Ej )). Note that the norm compatibility constant (cone(⌦Ej )) is assumed to be the same for all j = 1, . . . , p, which follows from our assump- tion in (5). Consider now Theorem 3.3 and the bound on the regu- larization parameter N O ⇣ w(⌦R) p N + w2 (⌦R) N2 ⌘ . As the dimensionality of the problem p and d grows and the number of samples N increases, the ﬁrst term w(⌦R) p N 2 □ 0 and Sdp 1 Ej = r r > of size he set · · · ⇥ lity of ui] to d u 2 2⌘2 + ⌫, c2 are Consider now Theorem 3.3 and the bound on the regu- larization parameter N O ⇣ w(⌦R) p N + w2 (⌦R) N2 ⌘ . As the dimensionality of the problem p and d grows and the number of samples N increases, the ﬁrst term w(⌦R) p N will dominate the second one w2 (⌦R) N2 . This can be seen by computing N for which the two terms become equal w(⌦R) p N = w2 (⌦R) N2 , which happens at N = w 2 3 (⌦R) < w(⌦R). Therefore, we can rewrite our results as fol- lows: once the restricted eigenvalue condition holds and N O ⇣ w(⌦R) p N ⌘ , the error norm is upper-bounded by k k2  O ⇣ w(⌦R) p N ⌘ (cone(⌦Ej )). 3.3. Special Cases While the presented results are valid for any norm R(·), Estimating Structured VAR k = 1, . . . , d. Since L1 is decomposable, it can be shown hat (cone(⌦Ej ))  4 p s. Next, since ⌦R = {u 2 Rdp |R(u)  1}, then using Lemma 3 in (Banerjee et al., 2014) and Gaussian width results in (Chandrasekaran et al., 2012), we can establish that w(⌦R)  O( p log(dp)). Therefore, based on Theorem 4.3 and the discussion at the nd of Section 3.2, the bound on the regularization parame- er takes the form N O ⇣p log(dp)/N ⌘ . Hence, the es- ⇣p ⌘ quence of weights and | |(1) quence of absolute values of , In (Chen & Banerjee, 2015) it O( p log(dp)/¯c), where ¯c is and the norm compatibility co 2c2 1 p s/¯c. Therefore, based on O ⇣p log(dp)/(¯cN) ⌘ and the e by k k2  O ⇣ 2c1 ¯c p s log(dp) Estimating Structured VAR ce L1 is decomposable, it can be shown ))  4 p s. Next, since ⌦R = {u 2 then using Lemma 3 in (Banerjee et al., an width results in (Chandrasekaran et al., stablish that w(⌦R)  O( p log(dp)). on Theorem 4.3 and the discussion at the , the bound on the regularization parame- N O ⇣p log(dp)/N ⌘ . Hence, the es- ounded by k k2  O ⇣p s log(dp)/N ⌘ (log(dp)). quence of weights and | |(1) quence of absolute values of In (Chen & Banerjee, 2015) it O( p log(dp)/¯c), where ¯c is and the norm compatibility co 2c2 1 p s/¯c. Therefore, based on O ⇣p log(dp)/(¯cN) ⌘ and the by k k2  O ⇣ 2c1 ¯c p s log(dp) We note that the bound obtained is similar to the bound obtaine k = 1, . . . , d. Since L1 is decompo that (cone(⌦Ej ))  4 p s. Nex Rdp |R(u)  1}, then using Lemma 2014) and Gaussian width results in 2012), we can establish that w(⌦ Therefore, based on Theorem 4.3 an end of Section 3.2, the bound on the ter takes the form N O ⇣p log(d timation error is bounded by k k2  as long as N > O(log(dp)).  k = 1, . . . , d. Since L1 is decomposable, it ca that (cone(⌦Ej ))  4 p s. Next, since ⌦R Rdp |R(u)  1}, then using Lemma 3 in (Ban 2014) and Gaussian width results in (Chandrase 2012), we can establish that w(⌦R)  O( Therefore, based on Theorem 4.3 and the discu end of Section 3.2, the bound on the regularizat⇣p ⌘12
13. 13.   2 ⌦E     Lasso regularization norms. I the main ideas of our proof te delegated to the supplement. To establish lower bound on N , we derive an upper bou some ↵ > 0, which will estab N ↵ R⇤ [ 1 N ZT ✏]. Theorem 3.3 Let ⌦R = {u 2 w(⌦R) = E[ sup u2⌦R hg, ui] to ⌦R for g ⇠ N(0, I). For a probability at least 1 c exp can establish that R⇤  1 N ZT ✏  ✓ c2(1+✏2) w o understand the bound on s of M and small values of y in the data, thus requiring ions to hold with high prob- 3.4 we can interpret the es- As the size and dimension- m increase, we emphasize e the order notations to de- number of samples at least gularization parameter sat- R) ⌘ . With high probability condition ||Z ||2 || ||2 p N that  = O(1) is a pos- norm of the estimation er- 4) is bounded by k k2  ⌦Ej )). Note that the norm ⌦Ej )) is assumed to be the Thm4.3 Thm4.4 Lem.4.1 Lem.4.2 o hold with high prob- e can interpret the es- e size and dimension- crease, we emphasize order notations to de- er of samples at least ization parameter sat- With high probability tion ||Z ||2 || ||2 p N  = O(1) is a pos- of the estimation er- bounded by k k2  )). Note that the norm ction 3.4 we will present ique, with all the details regularization parameter n R⇤ [ 1 N ZT ✏]  ↵, for the required relationship p |R(u)  1}, and deﬁne a Gaussian width of set > 0 and ✏2 > 0 with min(✏2 2, ✏1) + log(p)) we ) + c1(1+✏1) w2 (⌦R) N2 ◆ nstants. condition, we will show N as showing that large values of M and small va L indicate stronger dependency in the data, thus req more samples for the RE conditions to hold with high ability. Analyzing Theorems 3.3 and 3.4 we can interpret tablished results as follows. As the size and dime ality N, p and d of the problem increase, we emp the scale of the results and use the order notations note the constants. Select a number of samples a N O(w2 (⇥)) and let the regularization paramet isfy N O ⇣ w(⌦R) p N + w2 (⌦R) N2 ⌘ . With high prob then the restricted eigenvalue condition ||Z ||2 || ||2 for 2 cone(⌦E) holds, so that  = O(1) is itive constant. Moreover, the norm of the estimat ror in optimization problem (4) is bounded by k O ⇣ w(⌦R) p N + w2 (⌦R) N2 ⌘ (cone(⌦Ej )). Note that the compatibility constant (cone(⌦Ej )) is assumed to same for all j = 1, . . . , p, which follows from our as13
14. 14. ( ) : R - • R - • • → ) can be any vector norm, separable along the matrices Ak. Speciﬁcally, if we denote = T ]T and Ak(i, :) as the row of matrix Ak for , d, then our assumption is equivalent to pX =1 R i = pX i=1 R ✓h A1(i, :)T . . .Ad(i, :)T iT ◆ . (5) clutter and without loss of generality, we assume R(·) to be the same for each row i. Since the ecouples across rows, it is straightforward to ex- analysis to the case when a different norm is used ow of Ak, e.g., L1 for row one, L2 for row two, t norm (Argyriou et al., 2012) for row three, etc. hat within a row, the norm need not be decompos- results wil bounds fo Deﬁne an we assum is distribu matrix CX CX = 2 6 6 4 where ( since CX bounded a where R( ) can be any vector norm, separable along the rows of matrices Ak. Speciﬁcally, if we denote = [ T 1 . . . T p ]T and Ak(i, :) as the row of matrix Ak for k = 1, . . . , d, then our assumption is equivalent to R( )= pX i=1 R i = pX i=1 R ✓h A1(i, :)T . . .Ad(i, :)T iT ◆ . (5) To reduce clutter and without loss of generality, we assume the norm R(·) to be the same for each row i. Since the analysis decouples across rows, it is straightforward to ex- tend our analysis to the case when a different norm is used for each row of Ak, e.g., L1 for row one, L2 for row two, K-support norm (Argyriou et al., 2012) for row three, etc. Observe that within a row, the norm need not be decompos- A2A1 A3 A4 ˆ = argmin 2Rdp2 1 N ||y Z ||2 2 + N R( ), (4) ) can be any vector norm, separable along the matrices Ak. Speciﬁcally, if we denote = T ]T and Ak(i, :) as the row of matrix Ak for , d, then our assumption is equivalent to pX =1 R i = pX i=1 R ✓h A1(i, :)T . . .Ad(i, :)T iT ◆ . (5) clutter and without loss of generality, we assume R(·) to be the same for each row i. Since the ecouples across rows, it is straightforward to ex- analysis to the case when a different norm is used In what fo trix X in results wil bounds fo Deﬁne an we assum is distribu matrix CX CX = 2 6 6 4 where ( ˆ = argmin 2Rdp2 1 N ||y Z ||2 2 + N R( ), (4) where R( ) can be any vector norm, separable along the rows of matrices Ak. Speciﬁcally, if we denote = [ T 1 . . . T p ]T and Ak(i, :) as the row of matrix Ak for k = 1, . . . , d, then our assumption is equivalent to R( )= pX i=1 R i = pX i=1 R ✓h A1(i, :)T . . .Ad(i, :)T iT ◆ . (5) To reduce clutter and without loss of generality, we assume the norm R(·) to be the same for each row i. Since the analysis decouples across rows, it is straightforward to ex- tend our analysis to the case when a different norm is used A2A1 A3 A4 = argmin 2Rdp2 N ||y Z ||2 + N R( ), (4) ere R( ) can be any vector norm, separable along the ws of matrices Ak. Speciﬁcally, if we denote = . . . T p ]T and Ak(i, :) as the row of matrix Ak for = 1, . . . , d, then our assumption is equivalent to ( )= pX i=1 R i = pX i=1 R ✓h A1(i, :)T . . .Ad(i, :)T iT ◆ . (5) reduce clutter and without loss of generality, we assume norm R(·) to be the same for each row i. Since the lysis decouples across rows, it is straightforward to ex- d our analysis to the case when a different norm is used each row of Ak, e.g., L1 for row one, L2 for row two, support norm (Argyriou et al., 2012) for row three, etc. trix resu bou Deﬁ we is d mat C whe sinc bou on ap- ially in regular- (1) q , such T M ]T , is lly dis- rization shirani, Machine volume ranging from describing the behavior of economic and ﬁ- nancial time series (Tsay, 2005) to modeling the dynamical systems (Ljung, 1998) and estimating brain function con- nectivity (Valdes-Sosa et al., 2005), among others. A VAR model of order d is deﬁned as xt = A1xt 1 + A2xt 2 + · · · + Adxt d + ✏t , (2) where xt 2 Rp denotes a multivariate time series, Ak 2 Rp⇥p , k = 1, . . . , d are the parameters of the model, and d 1 is the order of the model. In this work, we as- sume that the noise ✏t 2 Rp follows a Gaussian distribu- tion, ✏t ⇠ N(0, ⌃), with E(✏t✏T t ) = ⌃ and E(✏t✏T t+⌧ ) = 0, for ⌧ 6= 0. The VAR process is assumed to be stable and stationary (Lutkepohl, 2007), while the noise covariance matrix ⌃ is assumed to be positive deﬁnite with bounded largest eigenvalue, i.e., ⇤min(⌃) > 0 and ⇤max(⌃) < 1. In the current context, the parameters {Ak} are assumed Aj Theorem 4.3 Let ⌦R = {u 2 R |R(u)  1} of set ⌦R for g ⇠ N(0, I). For any ✏1 > 0 an log(p)) we can establish that R⇤  1 N ZT ✏  ✓ c2( where c, c1 and c2 are positive constants. To establish restricted eigenvalue condition, ⌫ > 0 and then set p N = ⌫. Theorem 4.4 Let ⇥ = cone(⌦Ej ) Sdp 1, w ⌦Ej = n j 2 Rdp R( ⇤ j + j)  R( ⇤ j ) + for j is of size dp ⇥ 1, and ⇤ = [ ⇤T 1 . . . ⇤ p in ⌦E = ⌦E1 ⇥ · · · ⇥ ⌦Ep due to the assum 14
15. 15. • 3.4 • • ⇤T 1 . . . ⇤T p ]T , for ⇤ j 2 Rdp . The set decomposition in ⌦E = ⌦E1 ⇥ · · · ⇥ mption on the row-wise separability of Also deﬁne w(⇥) = E[sup u2⇥ hg, ui] to of set ⇥ for g ⇠ N(0, I) and u 2 bability at least 1 c1 exp( c2⌘2 + > 0, inf 2cone(⌦E ) ||(Ip⇥p⌦X) ||2 || ||2 ⌫, 2 p M cw(⇥) ⌘ and c, c1, c2 are nd L, M are deﬁned in (9) and (13). we can choose ⌘ = 1 2 p NL and set p M cw(⇥) ⌘ and since p N > 0 e can establish a lower bound on the N: p N > 2 p M+cw(⇥) p L/2 = O(w(⇥)). d and using (9) and (13), we can con- er of samples needed to satisfy the condition is smaller if ⇤min(A) and w(⌦R) p N = w (⌦R) N2 , which happens at N w(⌦R). Therefore, we can rewrite ou lows: once the restricted eigenvalue con N O ⇣ w(⌦R) p N ⌘ , the error norm is u k k2  O ⇣ w(⌦R) p N ⌘ (cone(⌦Ej )). 3.3. Special Cases While the presented results are valid fo separable along the rows of Ak, it is instr ize our analysis to a few popular regula such as L1 and Group Lasso, Sparse G OWL norms. 3.3.1. LASSO To establish results for L1 norm, we ass rameter ⇤ is s-sparse, which in our case resent the largest number of non-zero ele i = 1, . . . , p, i.e., the combined i-th r ⌦Ej is a part of the decomposition in ⌦E = ⌦Ep due to the assumption on the row-wise norm R(·) in (5). Also deﬁne w(⇥) = E be a Gaussian width of set ⇥ for g ⇠ N Rdp . Then with probability at least 1 c log(p)), for any ⌘ > 0, inf 2cone(⌦E ) ||(Ip⇥p || where ⌫ = p NL 2 p M cw(⇥) ⌘ a positive constants, and L, M are deﬁned in 3.2. Discussion From Theorem 3.4, we can choose ⌘ = 1 2p N = p NL 2 p M cw(⇥) ⌘ and s must be satisﬁed, we can establish a lowe number of samples N: p N > 2 p M+cw(⇥ p L/2 Examining this bound and using (9) and (1 clude that the number of samples needed restricted eigenvalue condition is smaller i 1, j = 1, . . . , p, and = [ 1 , . . . , p ] , for j is of size dp ⇥ 1, and ⇤ = [ ⇤T 1 . . . ⇤T p ]T , for ⇤ j 2 Rdp . The set ⌦Ej is a part of the decomposition in ⌦E = ⌦E1 ⇥ · · · ⇥ ⌦Ep due to the assumption on the row-wise separability of norm R(·) in (5). Also deﬁne w(⇥) = E[sup u2⇥ hg, ui] to be a Gaussian width of set ⇥ for g ⇠ N(0, I) and u 2 Rdp . Then with probability at least 1 c1 exp( c2⌘2 + log(p)), for any ⌘ > 0, inf 2cone(⌦E ) ||(Ip⇥p⌦X) ||2 || ||2 ⌫, where ⌫ = p NL 2 p M cw(⇥) ⌘ and c, c1, c2 are positive constants, and L, M are deﬁned in (9) and (13). 3.2. Discussion From Theorem 3.4, we can choose ⌘ = 1 2 p NL and set p N = p NL 2 p M cw(⇥) ⌘ and since p N > 0 must be satisﬁed, we can establish a lower bound on the number of samples N: p N > 2 p M+cw(⇥) p L/2 = O(w(⇥)). Examining this bound and using (9) and (13), we can con- clude that the number of samples needed to satisfy the restricted eigenvalue condition is smaller if ⇤min(A) and w( p w( low N k 3.3 W se ize su OW 3. To ra re i phere. The error set ⌦Ej is deﬁned as ⌦Ej = dp R( ⇤ j + j)  R( ⇤ j ) + 1 r R( j) o , for r > . . , p, and = [ T 1 , . . . , T p ]T , for j is of size nd ⇤ = [ ⇤T 1 . . . ⇤T p ]T , for ⇤ j 2 Rdp . The set part of the decomposition in ⌦E = ⌦E1 ⇥ · · · ⇥ o the assumption on the row-wise separability of ) in (5). Also deﬁne w(⇥) = E[sup u2⇥ hg, ui] to ssian width of set ⇥ for g ⇠ N (0, I) and u 2 n with probability at least 1 c1 exp( c2⌘2 + or any ⌘ > 0, inf 2cone(⌦E ) ||(Ip⇥p⌦X) ||2 || ||2 ⌫, = p NL 2 p M cw(⇥) ⌘ and c, c1, c2 are onstants, and L, M are deﬁned in (9) and (13). ssion orem 3.4, we can choose ⌘ = 1 2 p NL and set p NL 2 p M cw(⇥) ⌘ and since p N > 0 atisﬁed, we can establish a lower bound on the p 2 p M+cw(⇥) the number of samples N will dominate the second o by computing N for whic w(⌦R) p N = w2 (⌦R) N2 , which w(⌦R). Therefore, we c lows: once the restricted e N O ⇣ w(⌦R) p N ⌘ , the er k k2  O ⇣ w(⌦R) p N ⌘ (con 3.3. Special Cases While the presented result separable along the rows of ize our analysis to a few such as L1 and Group La OWL norms. 3.3.1. LASSO To establish results for L is a unit sphere. The error set ⌦Ej is deﬁned as ⌦Ej =n j 2 Rdp R( ⇤ j + j)  R( ⇤ j ) + 1 r R( j) o , for r > 1, j = 1, . . . , p, and = [ T 1 , . . . , T p ]T , for j is of size dp ⇥ 1, and ⇤ = [ ⇤T 1 . . . ⇤T p ]T , for ⇤ j 2 Rdp . The set ⌦Ej is a part of the decomposition in ⌦E = ⌦E1 ⇥ · · · ⇥ ⌦Ep due to the assumption on the row-wise separability of norm R(·) in (5). Also deﬁne w(⇥) = E[sup u2⇥ hg, ui] to be a Gaussian width of set ⇥ for g ⇠ N(0, I) and u 2 Rdp . Then with probability at least 1 c1 exp( c2⌘2 + log(p)), for any ⌘ > 0, inf 2cone(⌦E ) ||(Ip⇥p⌦X) ||2 || ||2 ⌫, where ⌫ = p NL 2 p M cw(⇥) ⌘ and c, c1, c2 are positive constants, and L, M are deﬁned in (9) and (13). 3.2. Discussion From Theorem 3.4, we can choose ⌘ = 1 2 p NL and set p N = p NL 2 p M cw(⇥) ⌘ and since p N > 0 the num will dom by comp w(⌦R) p N = w(⌦R). lows: on N O k k2  3.3. Spec While th separable ize our a such as OWL no 3.3.1. L deﬁne w(⇥) = E[sup u2⇥ hg, ui] to be a Gaussian width of set ⇥ for g ⇠ N probability at least 1 c1 exp( c2⌘2 + log(p)), for any ⌘ > 0 inf ||(Ip⇥p ⌦ X) ||2 ⌫, : Gaussian width n this Section we present the main results of our work, followed by the discussion on their prop lustrating some special cases based on popular Lasso and Group Lasso regularization norms. I 4 we will present the main ideas of our proof technique, with all the details delegated to the App nd D. o establish lower bound on the regularization parameter N , we derive an upper bound on R⇤[ 1 N Z or some ↵ > 0, which will establish the required relationship N ↵ R⇤[ 1 N ZT ✏]. heorem 4.3 Let ⌦R = {u 2 Rdp|R(u)  1}, and deﬁne w(⌦R) = E[ sup u2⌦R hg, ui] to be a Gaus set ⌦R for g ⇠ N(0, I). For any ✏1 > 0 and ✏2 > 0 with probability at least 1 c exp( min g(p)) we can establish that R⇤  1 N ZT ✏  ✓ c2(1+✏2) w(⌦R) p N + c1(1+✏1) w2(⌦R) N2 ◆ here c, c1 and c2 are positive constants. o establish restricted eigenvalue condition, we will show that inf 2cone(⌦E) ||(Ip⇥p⌦X) ||2 || ||2 ⌫, > 0 and then set p N = ⌫. heorem 4.4 Let ⇥ = cone(⌦Ej ) Sdp 1, where Sdp 1 is a unit sphere. The error set ⌦Ej is Ej = n j 2 Rdp R( ⇤ j + j)  R( ⇤ j ) + 1 r R( j) o , for r > 1, j = 1, . . . , p, and = [ T 1 , . ] to be a Gaussian width of set ⇥ for g ⇠ N(0, I) and u 2 Rdp. Then with p( c2⌘2 + log(p)), for any ⌘ > 0 inf ||(Ip⇥p ⌦ X) ||2 ⌫, ing Theorems 3.3 and 3.4 we can interpret the es- ed results as follows. As the size and dimension- , p and d of the problem increase, we emphasize le of the results and use the order notations to de- e constants. Select a number of samples at least O(w2 (⇥)) and let the regularization parameter sat- N O ⇣ w(⌦R) p N + w2 (⌦R) N2 ⌘ . With high probability e restricted eigenvalue condition ||Z ||2 || ||2 p N 2 cone(⌦E) holds, so that  = O(1) is a pos- onstant. Moreover, the norm of the estimation er- optimization problem (4) is bounded by k k2  ⌦R) N + w2 (⌦R) N2 ⌘ (cone(⌦Ej )). Note that the norm ibility constant (cone(⌦Ej )) is assumed to be the or all j = 1, . . . , p, which follows from our assump- (5). er now Theorem 3.3 and the bound on the regu-⇣ w(⌦R) p w2 (⌦R) ⌘ 15
16. 16. • p   •   • • • ion on the row-wise separability of o deﬁne w(⇥) = E[sup u2⇥ hg, ui] to set ⇥ for g ⇠ N(0, I) and u 2 bility at least 1 c1 exp( c2⌘2 + 0, inf 2cone(⌦E) ||(Ip⇥p⌦X) ||2 || ||2 ⌫, p M cw(⇥) ⌘ and c, c1, c2 are L, M are deﬁned in (9) and (13). can choose ⌘ = 1 2 p NL and set cw(⇥) ⌘ and since p N > 0 an establish a lower bound on the p N > 2 p M+cw(⇥) p L/2 = O(w(⇥)). nd using (9) and (13), we can con- of samples needed to satisfy the ndition is smaller if ⇤min(A) and lows: once the restricted eigenvalue condi N O ⇣ w(⌦R) p N ⌘ , the error norm is upp k k2  O ⇣ w(⌦R) p N ⌘ (cone(⌦Ej )). 3.3. Special Cases While the presented results are valid for a separable along the rows of Ak, it is instruc ize our analysis to a few popular regulariz such as L1 and Group Lasso, Sparse Gro OWL norms. 3.3.1. LASSO To establish results for L1 norm, we assum rameter ⇤ is s-sparse, which in our case is resent the largest number of non-zero elem i = 1, . . . , p, i.e., the combined i-th row norm R(·) in (5). Also deﬁne w(⇥) = E[s u be a Gaussian width of set ⇥ for g ⇠ N(0 Rdp . Then with probability at least 1 c1 e log(p)), for any ⌘ > 0, inf 2cone(⌦E) ||(Ip⇥p⌦ || | where ⌫ = p NL 2 p M cw(⇥) ⌘ and positive constants, and L, M are deﬁned in (9 3.2. Discussion From Theorem 3.4, we can choose ⌘ = 1 2 p p N = p NL 2 p M cw(⇥) ⌘ and sin must be satisﬁed, we can establish a lower b number of samples N: p N > 2 p M+cw(⇥) p L/2 Examining this bound and using (9) and (13), clude that the number of samples needed t restricted eigenvalue condition is smaller if ⇤ cw(⇥) ⌘ and c, c1, c2 are positive constants, and L and M are deﬁned in (9) choose ⌘ = 1 2 p NL and set p N = p NL 2 p M cw(⇥) ⌘ and since we can establish a lower bound on the number of samples N p N > 2 p M + cw(⇥) p L/2 = O(w(⇥)). (18) sing (9) and (13), we can conclude that the number of samples needed to satisfy dition is smaller if ⇤min(A) and ⇤min(⌃) are larger and ⇤max(A) and ⇤max(⌃) eans that matrices A and A in (10) and (12) must be well conditioned and the eigenvalues well inside the unit circle (see Section 3.2). Alternatively, we can wing that large values of M and small values of L indicate stronger dependency ore samples for the RE conditions to hold with high probability. d 4.4 we can interpret the established results as follows. As the size and dimen- oblem increase, we emphasize the scale of the results and use the order notations ect a number of samples at least N O(w2(⇥)) and let the regularization pa- 2 ⌘ Ej ⌦Ep due to the assumption on t norm R(·) in (5). Also deﬁne be a Gaussian width of set ⇥ Rdp . Then with probability at log(p)), for any ⌘ > 0, in 2co where ⌫ = p NL 2 p M cw positive constants, and L, M ar 3.2. Discussion From Theorem 3.4, we can ch p N = p NL 2 p M cw(⇥ must be satisﬁed, we can estab number of samples N: p N > Examining this bound and usin clude that the number of sam restricted eigenvalue condition probability at least 1 c1 exp( c2⌘ + log(p)), for any ⌘ > 0 inf 2cone(⌦E) ||(Ip⇥p ⌦ X) ||2 || ||2 ⌫, where ⌫ = p NL 2 p M cw(⇥) ⌘ and c, c1, c2 are positive constants, and L and M are deﬁne and (13). 4.2 Discussion From Theorem 4.4, we can choose ⌘ = 1 2 p NL and set p N = p NL 2 p M cw(⇥) ⌘ an p N > 0 must be satisﬁed, we can establish a lower bound on the number of samples N p N > 2 p M + cw(⇥) p L/2 = O(w(⇥)). Examining this bound and using (9) and (13), we can conclude that the number of samples needed to the restricted eigenvalue condition is smaller if ⇤min(A) and ⇤min(⌃) are larger and ⇤max(A) and ⇤ are smaller. In turn, this means that matrices A and A in (10) and (12) must be well conditioned VAR process is stable, with eigenvalues well inside the unit circle (see Section 3.2). Alternatively, also understand (18) as showing that large values of M and small values of L indicate stronger depe in the data, thus requiring more samples for the RE conditions to hold with high probability. Analyzing Theorems 4.3 and 4.4 we can interpret the established results as follows. As the size and sionality N, p and d of the problem increase, we emphasize the scale of the results and use the order no to denote the constants. Select a number of samples at least N O(w2(⇥)) and let the regularizat rameter satisfy N O ⇣ w(⌦R) p + w2(⌦R) 2 ⌘ . With high probability then the restricted eigenvalue co and 3.4 we can interpret the es- ws. As the size and dimension- problem increase, we emphasize nd use the order notations to de- ct a number of samples at least the regularization parameter sat- w2 (⌦R) N2 ⌘ . With high probability value condition ||Z ||2 || ||2 p N ds, so that  = O(1) is a pos- , the norm of the estimation er- em (4) is bounded by k k2  cone(⌦Ej )). Note that the norm cone(⌦Ej )) is assumed to be the which follows from our assump- 3.3 and the bound on the regu- O ⇣ w(⌦R) p N + w2 (⌦R) N2 ⌘ . As e problem p and d grows and r 1, j = 1, . . . , p, and = [ T 1 , . . . , T p ]T , for j is of size dp ⇥ 1, and ⇤ = [ ⇤T 1 . . . ⇤T p ]T , for ⇤ j 2 Rdp . The set ⌦Ej is a part of the decomposition in ⌦E = ⌦E1 ⇥ · · · ⇥ ⌦Ep due to the assumption on the row-wise separability of norm R(·) in (5). Also deﬁne w(⇥) = E[sup u2⇥ hg, ui] to be a Gaussian width of set ⇥ for g ⇠ N(0, I) and u 2 Rdp . Then with probability at least 1 c1 exp( c2⌘2 + log(p)), for any ⌘ > 0, inf 2cone(⌦E ) ||(Ip⇥p⌦X) ||2 || ||2 ⌫, where ⌫ = p NL 2 p M cw(⇥) ⌘ and c, c1, c2 are positive constants, and L, M are deﬁned in (9) and (13). 3.2. Discussion From Theorem 3.4, we can choose ⌘ = 1 2 p NL and set p N = p NL 2 p M cw(⇥) ⌘ and since p N > 0 must be satisﬁed, we can establish a lower bound on the number of samples N: p N > 2 p M+cw(⇥) p = O(w(⇥)). by w(⌦ p N w(⌦ low N k k 3.3. Wh sepa ize such OW 3.3 To ), for any ⌘ > 0 ||(Ip⇥p ⌦ X) ||2 || ||2 ⌫, c1, c2 are positive constants, and L and M are deﬁned in (9) L and set p N = p NL 2 p M cw(⇥) ⌘ and since lower bound on the number of samples N M + cw(⇥) p L/2 = O(w(⇥)). (18) we can conclude that the number of samples needed to satisfy ⇤min(A) and ⇤min(⌃) are larger and ⇤max(A) and ⇤max(⌃) A and A in (10) and (12) must be well conditioned and the 0 < = E[sup u2⇥ hg, ui] to be a Gaussian width of set ⇥ for g ⇠ N(0, I) and u 2 R least 1 c1 exp( c2⌘2 + log(p)), for any ⌘ > 0 inf 2cone(⌦E) ||(Ip⇥p ⌦ X) ||2 || ||2 ⌫, NL 2 p M cw(⇥) ⌘ and c, c1, c2 are positive constants, and L and M ar sion m 4.4, we can choose ⌘ = 1 2 p NL and set p N = p NL 2 p M cw(⇥) ust be satisﬁed, we can establish a lower bound on the number of samples N p N > 2 p M + cw(⇥) p L/2 = O(w(⇥)). is bound and using (9) and (13), we can conclude that the number of samples n eigenvalue condition is smaller if ⇤min(A) and ⇤min(⌃) are larger and ⇤max(A n turn, this means that matrices A and A in (10) and (12) must be well cond16