Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.

Side 2019 #10

7.370 Aufrufe

Veröffentlicht am

SIdE (Italian Econometric Association) Summer School, on Machine Learning Algorithms for Econometricians #10

Veröffentlicht in: Wirtschaft & Finanzen
  • Als Erste(r) kommentieren

  • Gehören Sie zu den Ersten, denen das gefällt!

Side 2019 #10

  1. 1. Arthur Charpentier, SIDE Summer School, July 2019 # 10 Times Series and Forecasting Arthur Charpentier (Universit´e du Qu´ebec `a Montr´eal) Machine Learning & Econometrics SIDE Summer School - July 2019 @freakonometrics freakonometrics freakonometrics.hypotheses.org 1
  2. 2. Arthur Charpentier, SIDE Summer School, July 2019 Time Series Time Series A time series is a sequence of observations (yt) ordered in time. Write yt = st + ut, with systematic part st (signal / trend) and ‘residual’ term ut (ut) is supposed to be a strictly stationary time series (st) might be a ‘linear’ trend, plus a seasonal cycle Buys-Ballot (1847, Les changements p´eriodiques de temp´erature, d´ependants de la nature du soleil et de la lune, mis en rapport avec le pronostic du temps, d´eduits d’observations n´eerlandaises de 1729 `a 1846) - original probably in Dutch. @freakonometrics freakonometrics freakonometrics.hypotheses.org 2
  3. 3. Arthur Charpentier, SIDE Summer School, July 2019 Time Series Consider the general prediction tyt+h = m(information available at time t) 1 hp <- read.csv("http:// freakonometrics .free.fr/ multiTimeline .csv", skip =2) 2 T=86 -24 3 trainY <- ts(hp[1:T,2], frequency= 12, start= c(2012 , 6)) 4 validY <- ts(hp[(T+1):nrow(hp) ,2], frequency= 12, start= c(2017 , 8)) @freakonometrics freakonometrics freakonometrics.hypotheses.org 3
  4. 4. Arthur Charpentier, SIDE Summer School, July 2019 Time Series In yt = st + ut, st can be a trend, plus a seasonal cycle 1 stats :: decompose(trainY) @freakonometrics freakonometrics freakonometrics.hypotheses.org 4
  5. 5. Arthur Charpentier, SIDE Summer School, July 2019 Time Series The Buys-Ballot model is based on yt = st + ut, with st = β0 + β1t + 12 h=1 γt mod 12 @freakonometrics freakonometrics freakonometrics.hypotheses.org 5
  6. 6. Arthur Charpentier, SIDE Summer School, July 2019 Time Series : Linear Trend ? Various (machine learning) techniques can used, such as Laurinec (2017, Using regression trees for forecasting double-seasonal time series with trend in R), Bontempi et al. (2012, Machine Learning Strategies for Time Series Forecasting), Dietterich (2002, Machine Learning for Sequential Data: A Review) @freakonometrics freakonometrics freakonometrics.hypotheses.org 6
  7. 7. Arthur Charpentier, SIDE Summer School, July 2019 Time Series : Exponential Smoothing “when Gardner (2005) appeared, many believed that exponential smoothing should be disregarded because it was either a special case of ARIMA modeling or an ad hoc procedure with no statistical rationale. As McKenzie (1985) observed, this opinion was expressed in numerous references to my paper. Since 1985, the special case argument has been turned on its head, and today we know that exponential smoothing methods are optimal for a very general class of state-space models that is in fact broader than the ARIMA class.” from Hyndman et al. (2008, Forecasting with Exponential Smoothing) @freakonometrics freakonometrics freakonometrics.hypotheses.org 7
  8. 8. Arthur Charpentier, SIDE Summer School, July 2019 Time Series : Exponential Smoothing Exponential smoothing - Simple From time series (yt) define a smooth version st = α · yt + (1 − α) · st−1 = st−1 + α · (yt − st−1) for some α ∈ (0, 1) and starting point s0 = y1 Forecast is tyt+h = st It is called exponential smoothing since st = αyt + (1 − α)st−1 = αyt + α(1 − α)yt−1 + (1 − α)2 st−2 = α yt + (1 − α)yt−1 + (1 − α)2 yt−2 + (1 − α)3 yt−3 + · · · + (1 − α)t−1 y1 + (1 − α)t y corresponding to exponentially weighted moving average Need to adapt cross-validation techniques, @freakonometrics freakonometrics freakonometrics.hypotheses.org 8
  9. 9. Arthur Charpentier, SIDE Summer School, July 2019 Time Series : Exponential Smoothing Optimal α ? α ∈ argmin T t=2 2(yt − t−1yt) (leave-one-out strategy) @freakonometrics freakonometrics freakonometrics.hypotheses.org 9
  10. 10. Arthur Charpentier, SIDE Summer School, July 2019 Time Series : Exponential Smoothing Optimal α ? α ∈ argmin T t=2 2(yt − t−1yt) (leave-one-out strategy) @freakonometrics freakonometrics freakonometrics.hypotheses.org 10
  11. 11. Arthur Charpentier, SIDE Summer School, July 2019 Time Series : Exponential Smoothing Optimal α ? α ∈ argmin T t=2 2(yt − t−1yt) (leave-one-out strategy) @freakonometrics freakonometrics freakonometrics.hypotheses.org 11
  12. 12. Arthur Charpentier, SIDE Summer School, July 2019 See Hyndman et al. (2008, Forecasting with Exponential Smoothing) Exponential smoothing - Double From time series (yt) define a smooth version st = αyt + (1 − α)(st−1 + bt−1) bt = β(st − st−1) + (1 − β)bt−1 st = αyt + (1 − α)(st−1 + bt−1) bt = β(st − st−1) + (1 − β)bt−1 for some α ∈ (0, 1), some trend β ∈ (0, 1) and starting points s0 and b0, s0 = y0 and b0 = y1 − y0. Forecast is tyt+h = st + h · bt. @freakonometrics freakonometrics freakonometrics.hypotheses.org 12
  13. 13. Arthur Charpentier, SIDE Summer School, July 2019 See Hyndman et al. (2008, Forecasting with Exponential Smoothing) Exponential smoothing - Double From time series (yt) define a smooth version st = αyt + (1 − α)(st−1 + bt−1) bt = β(st − st−1) + (1 − β)bt−1 st = αyt + (1 − α)(st−1 + bt−1) bt = β(st − st−1) + (1 − β)bt−1 for some α ∈ (0, 1), some trend β ∈ (0, 1) and starting points s0 and b0, s0 = y0 and b0 = y1 − y0. Forecast is tyt+h = st + h · bt. @freakonometrics freakonometrics freakonometrics.hypotheses.org 13
  14. 14. Arthur Charpentier, SIDE Summer School, July 2019 See Hyndman et al. (2008, Forecasting with Exponential Smoothing) Exponential smoothing - Double From time series (yt) define a smooth version st = αyt + (1 − α)(st−1 + bt−1) bt = β(st − st−1) + (1 − β)bt−1 st = αyt + (1 − α)(st−1 + bt−1) bt = β(st − st−1) + (1 − β)bt−1 for some α ∈ (0, 1), some trend β ∈ (0, 1) and starting points s0 and b0, s0 = y0 and b0 = y1 − y0. Forecast is tyt+h = st + h · bt. @freakonometrics freakonometrics freakonometrics.hypotheses.org 14
  15. 15. Arthur Charpentier, SIDE Summer School, July 2019 Time Series : Exponential Smoothing Exponential smoothing - Seasonal with lag L (Holt-Winters) From time series (yt) define a smooth version    st = α xt ct−L + (1 − α)(st−1 + bt−1) bt = β(st − st−1) + (1 − β)bt−1 ct = γ yt st + (1 − γ)ct−L for some α ∈ (0, 1), some trend β ∈ (0, 1), some seasonal change smoothing factor, γ ∈ (0, 1) and starting points s0 = y0. Forecast is tyt+h = (st + hbt)ct−L+1+(h−1) mod L. See stats::HoltWinters() @freakonometrics freakonometrics freakonometrics.hypotheses.org 15
  16. 16. Arthur Charpentier, SIDE Summer School, July 2019 Time Series : Exponential Smoothing 1 hw_fit <- stats :: HoltWinters (trainY) 2 library(forecast) 3 plot(forecast(hw_fit , h=30)) 4 lines(validY ,col="red") @freakonometrics freakonometrics freakonometrics.hypotheses.org 16
  17. 17. Arthur Charpentier, SIDE Summer School, July 2019 Time Series : State Space Models See De Livera, Hyndman & Snyder (2011, Forecasting Time Series With Complex Seasonal Patterns Using Exponential Smoothing),based on Box-Cox transformation on yt: y (λ) t = yλ t − 1 λ if λ = 0 (otherwise log yt) See forecast::tbats 1 library(forecast) 2 forecast :: tbats(trainY)$lambda 3 [1] 0.2775889 @freakonometrics freakonometrics freakonometrics.hypotheses.org 17
  18. 18. Arthur Charpentier, SIDE Summer School, July 2019 Time Series : State Space Models 1 library(forecast) 2 tbats_fit <- tbats(trainY) 3 plot(forecast(tbats_fit , h=30)) 4 lines(validY , col="red") @freakonometrics freakonometrics freakonometrics.hypotheses.org 18
  19. 19. Arthur Charpentier, SIDE Summer School, July 2019 Time Series : State Space Models Exponential smoothing state space model with Box-Cox transformation, ARMA errors, Trend and Seasonal components y (λ) t = t−1 + φbt−1 + T i=1 s (i) t−mi + dt where • ( t) is some local level, t = t−1 + φbt−1 + αdt • (bt) is some trend with damping, bt = φbt−1 + αdt • (dt) is some ARMA process for the stationary component dt = p i=1 ϕidt−i + t + q j=1 θj t−j • (s (i) t ) is the i-th seasonal component @freakonometrics freakonometrics freakonometrics.hypotheses.org 19
  20. 20. Arthur Charpentier, SIDE Summer School, July 2019 Time Series : State Space Models Let xt denote state variables (e.g. level, slope, seasonal). Classical statistical approach: compute likelihood from errors ε1, · · · , εT see forecast::ets Innovations state space models Let xt = (st, bt, ct) and suppose εt i.i.d. N(0, σ2 ) State equation : xt = f(xt−1) + g(xt−1)εt Observation equation : yt = µt + et = h(xt−1) + σ(xt−1)εt Inference based on log L = n log T t=1 ε2 t σ(xt−1) + 2 T t=1 log |σ(xt−1)| One can use time series cross-validation, valued on a rolling forecast origin @freakonometrics freakonometrics freakonometrics.hypotheses.org 20
  21. 21. Arthur Charpentier, SIDE Summer School, July 2019 Time Series : State Space Models 1 library(forecast) 2 ets_fit <- forecast ::ets(trainY) 3 plot(forecast(ets_fit , h=30)) @freakonometrics freakonometrics freakonometrics.hypotheses.org 21
  22. 22. Arthur Charpentier, SIDE Summer School, July 2019 Time Series : Automatic ARIMA Consider a general seasonal ARIMA process, Φs(Ls )Φ(L)(1 − L)d (1 − Ls )ds yt = c + Θs(Ls )Θ(L)εt See forecast::autoarima(, include.drift=TRUE) , “Automatic algorithms will become more general - handling a wide variety of time series” (Rob Hyndman) @freakonometrics freakonometrics freakonometrics.hypotheses.org 22
  23. 23. Arthur Charpentier, SIDE Summer School, July 2019 Time Series : RNN (recurrent neural nets) RNN : Recurrent neural network Class of neural networks where connections between nodes form a directed graph along a temporal sequence. Recurrent neural networks are networks with loops, allowing information to persist. Classical neural net yi = m(xi) Recurrent neural net yt = m(xt, yt−1) = m(xt, m(xt−1, yt−2)) = · · · @freakonometrics freakonometrics freakonometrics.hypotheses.org 23
  24. 24. Arthur Charpentier, SIDE Summer School, July 2019 Time Series : RNN (recurrent neural nets) A is the neural net, h is the output (y) and x some covariates. (source https://colah.github.io/) See Sutskever (2017, Training Reccurent Neural Networks) From recurrent networks to LSTM @freakonometrics freakonometrics freakonometrics.hypotheses.org 24
  25. 25. Arthur Charpentier, SIDE Summer School, July 2019 Time Series : RNN and LTSM (source Greff et al. (2017, LSTM: A Search Space Odyssey)) see Hochreiter & Schmidhuber (1997, Long Short-Term Memory) @freakonometrics freakonometrics freakonometrics.hypotheses.org 25
  26. 26. Arthur Charpentier, SIDE Summer School, July 2019 Time Series : RNN and LSTM A classical RNN (with a single layer) would be (source https://colah.github.io/) “In theory, RNNs are absolutely capable of handling such ‘long-term dependencies’. A human could carefully pick parameters for them to solve toy problems of this form. Sadly, in practice, RNNs don’t seem to be able to learn them” see Benghio et al. (1994, Learning long-term dependencies with gradient descent is difficult) @freakonometrics freakonometrics freakonometrics.hypotheses.org 26
  27. 27. Arthur Charpentier, SIDE Summer School, July 2019 Time Series : RNN and LSTM “RNNs can keep track of arbitrary long-term dependencies in the input sequences.The problem of “vanilla RNNs” is computational (or practical) in nature: when training a vanilla RNN using back-propagation, the gradients which are back-propagated can “vanish” (that is, they can tend to zero) “explode” (that is, they can tend to infinity), because of the computations involved in the process” (from wikipedia) @freakonometrics freakonometrics freakonometrics.hypotheses.org 27
  28. 28. Arthur Charpentier, SIDE Summer School, July 2019 Time Series : LSTM C is the long-term state H is the short-term state forget gate: ft = sigmoid(Af [ht−1, xt] + bf ) input gate: it = sigmoid(Ai[ht−1, xt] + bi) new memory cell: ˜ct = tanh(Ac[ht−1, xt] + bc) @freakonometrics freakonometrics freakonometrics.hypotheses.org 28
  29. 29. Arthur Charpentier, SIDE Summer School, July 2019 Time Series : LSTM final memory cell: ct = ft · ct−1 + it · ˜ct output gate: ot = sigmoid(Ao[ht−1, xt] + bo) ht = ot · tanh(ct) @freakonometrics freakonometrics freakonometrics.hypotheses.org 29
  30. 30. Arthur Charpentier, SIDE Summer School, July 2019 Elicitable Measures & Forecasting “elicitable” means “being a minimizer of a suitable expected score”, see Gneiting (2011) Making and evaluating point forecasts. Elicitable function T is an elicatable function if there exits a scoring function S : R×R → [0, ∞) T(Y ) = argmin x∈R R S(x, y)dF(y) = argmin x∈R E S(x, Y ) where Y ∼ F Example: mean, T(Y ) = E[Y ] is elicited by S(x, y) = x − y 2 2 Example: median, T(Y ) = median[Y ] is elicited by S(x, y) = x − y 1 Example: quantile, T(Y ) = QY (τ) is elicited by S(x, y) = τ(y − x)+ + (1 − τ)(y − x)− Example: expectile, T(Y ) = EY (τ) is elicited by S(x, y) = τ(y − x)2 + + (1 − τ)(y − x)2 − @freakonometrics freakonometrics freakonometrics.hypotheses.org 30
  31. 31. Arthur Charpentier, SIDE Summer School, July 2019 Forecasts and Predictions Mathematical statistics is based on inference and testing, using probabilistic properties. If we can reproduce past observations, it is supposed to proved good predictions. Why not consider a collection of scenarios likely to occur on a given time horizon (drawing fro, a (predictive) probability distribution) The closer forecast yt is to observed yt, the better the model, either according to 1-norm - with |yt − yt| - or to the 2-norm - with (yt − yt)2 . If this is an interesting information about central tendency, it cannot be used to anticipate extremal events. @freakonometrics freakonometrics freakonometrics.hypotheses.org 31
  32. 32. Arthur Charpentier, SIDE Summer School, July 2019 Forecasts and Predictions More formally, we try to compare two very different objects : a function (the predictive probability distribution) and a real value number (the observed value). Natural idea : introduce a score, as in Good (1952, Rational Decisions) or Winkler (1969, Scoring Rules and the Evaluation of Probability Assessors), used in meteorology by Murphy & Winkler (1987, A General Framework for Forecast Verification). Let F denote the predictive distribution, expressing the uncertainty attributed to future values, conditional on the available information. @freakonometrics freakonometrics freakonometrics.hypotheses.org 32
  33. 33. Arthur Charpentier, SIDE Summer School, July 2019 Probabilistic Forecasts Notion of probabilistic forecasts, Gneiting & Raftery (2007 Strictly Proper Scoring Rules, Prediction, and Estimation). In a general setting, we want to predict value taken by random variable Y . Let F denote a cumulative distribution function. Let A denote the information available when forecast is made. F is the ideal forecast for Y given A if the law of Y |A has distribution F. Suppose F continuous. Set ZF = F(Y ), the probability integral transform of Y . F is probabilistically calibrated if ZF ∼ U([0, 1]) F is marginally calibrated if E[F(y)] = P[Y ≤ y] for any y ∈ R. @freakonometrics freakonometrics freakonometrics.hypotheses.org 33
  34. 34. Arthur Charpentier, SIDE Summer School, July 2019 Probabilistic Forecasts Observe that for a ideal forecast, F(y) = P[Y ≤ y|A], then • E[F(y)] = E[P[Y ≤ y|A]] = P[Y ≤ y] This forecast is est marginally calibrated • P[ZF ≤ z] = E[P[ZF ≤ z|A]] = z This forecast is probabilistically calibrated Suppose µ ∼ N(0, 1). And that ideal forecast is Y |µ ∼ N(µ, 1). E.g. if Yt ∼ N(0, 1) and Yt+1 = yt + εt ∼ N(yt, 1). One can consider F = N(0, 2) as na¨ıve forecast. This distribution is marginally calibrated, probabilistically calibrated and ideal. One can consider F a mixture N(µ, 2) and N(µ ± 1, 2) where ”±1” means +1 or −1 probability 1/2, hesitating forecast. This distribution is probabilistically calibrated, but not marginally calibrated. @freakonometrics freakonometrics freakonometrics.hypotheses.org 34
  35. 35. Arthur Charpentier, SIDE Summer School, July 2019 Probabilistic Forecasts Indeed P[F(Y ) ≤ u] = u, P[F(Y ) ≤ u] = P[Φ(Y ) ≤ u] + P[Φ(Y + 1) ≤ u] 2 + P[Φ(Y ) ≤ u] + P[Φ(Y − 1) ≤ u] 2 One can consider F = N(−µ, 1). This distribution is marginally calibrated, but not probabilistically calibrated. In practice, we have a sequence (Yt, Ft) of pairs, (Y , F ). The set of forecasts F is said to be performant if for all t, predictive distributions Ft are precise (sharpness) and well-calibrated. Precision is related to the concentration of the predictive density around a central value (uncertainty degree). Calibration is related to the coherence between predictive distribution Ft and observations yt. @freakonometrics freakonometrics freakonometrics.hypotheses.org 35
  36. 36. Arthur Charpentier, SIDE Summer School, July 2019 Probabilistic Forecasts Calibration is poor if 80%-confidence intervals (implied from predictive distributions, i.e. F−1 t (α), F−1 t (1 − α) ) do not contain yt’s about 8 times out of 10. To test marginal calibration, compare the empirical cumulative distribution function G(y) = lim n→∞ 1 n 1Yt≤y and the average of predictive distributios F(y) = lim n→∞ 1 n n t=1 Ft(y) To test probabilistic calibration, test if sample {Ft(Yt)} has a uniform distribution - PIT approach, see Dawid (1984, Present Position and Potential Developments: The Prequential Approach). @freakonometrics freakonometrics freakonometrics.hypotheses.org 36
  37. 37. Arthur Charpentier, SIDE Summer School, July 2019 Probabilistic Forecasts One can also consider a score S(F, y) for all distribution F and all observation y. The score is said to be proper if ∀F, G, E[S(G, Y )] ≤ E[S(F, Y )] where Y ∼ G. In practice, this expected value is approximated using 1 n n t=1 S(Ft, Yt) One classical rule is the logarithmic score S(F, y) = − log[F (y)] if F is (abs.) continuous. Another classical rule is the continuous ranked probability score (CRPS, see Hersbach (2000, Decomposition of the Continuous Ranked Probability Score for Ensemble Prediction Systems)) S(F, y) = +∞ −∞ (F(x) − 1x≥y)2 dx = y −∞ F(x)2 + +∞ y (F(x) − 1)2 dx @freakonometrics freakonometrics freakonometrics.hypotheses.org 37
  38. 38. Arthur Charpentier, SIDE Summer School, July 2019 Probabilistic Forecasts with empirical version S = 1 n n t=1 S(Ft, yt) = 1 n n t=1 +∞ −∞ (Ft(x) − 1x≥yt )2 dx studied in Murphy (1970, The ranked probability score and the probability score: a comparison). This rule is proper since E[S(F, Y )] = ∞ −∞ E F(x) − 1x≥Y 2 dx = ∞ −∞ [F(x) − G(x)]2 + G(x)[1 − G(x)] 2 dx is minimal when F = G. @freakonometrics freakonometrics freakonometrics.hypotheses.org 38
  39. 39. Arthur Charpentier, SIDE Summer School, July 2019 Probabilistic Forecasts If F corresponds to the N(µ, σ2 ) distribution S(F, y) = σ y − µ σ 2Φ y − µ σ − 1 + 2 y − µ σ − 1 √ π Observe that S(F, y) = E X − y − 1 2 E X − X o`u X, X ∼ F (where X and X are independent versions), cf Gneiting & Raftery (2007, Strictly Proper Scoring Rules, Prediction, and Estimation. If we use for F the empirical cumulative distribution function Fn(y) = 1 n n i=1 1yi≤y then S(Fn, y) = 2 n n i=1 (yi:n − y) 1yi:n≤y − i − 1/2 n @freakonometrics freakonometrics freakonometrics.hypotheses.org 39
  40. 40. Arthur Charpentier, SIDE Summer School, July 2019 Probabilistic Forecasts Consider a Gaussian AR(p) time series, Yt = c + ϕ1Yt−1 + · · · + ϕpYt−p + εt, with εt ∼ N(0, σ2 ) then forecast with horizon 1 yields Ft ∼ N(t−1Yt, σ2 ) where t−1Yt = c + ϕ1Yt−1 + · · · + ϕpYt−p. @freakonometrics freakonometrics freakonometrics.hypotheses.org 40
  41. 41. Arthur Charpentier, SIDE Summer School, July 2019 Probabilistic Forecasts Suppose that Y can be explained by covariates x = (x1, · · · , xm). Consider some kernel based conditional density estimation p(y|x) = p(y, x) p(x) = n i=1 Kh(y − yi)Kh(x − xi) n i=1 Kh(x − xi) In the case of a linear model, there exists θ such that p(y|x) = p(y|θT x), and p(y|θT x = s) = n i=1 Kh(y − yi)Kh(s − θT xi) n i=1 Kh(s − θT xi) Parameter θ can be estimated using a proxy of the log-likelihood θ = argmax n i=1 log p(yi|θT xi) @freakonometrics freakonometrics freakonometrics.hypotheses.org 41
  42. 42. Arthur Charpentier, SIDE Summer School, July 2019 Time Series : Stacking See Clemen (1989, Combining forecasts: A review and annotated bibliography) See opera::oracle(Y = Y, experts = X, loss.type =’square’, model =’convex’) (on Online Prediction by Expert Aggregation) @freakonometrics freakonometrics freakonometrics.hypotheses.org 42
  43. 43. Arthur Charpentier, SIDE Summer School, July 2019 Natural Language Processing & Probabilistic Language Models Idea : P[today is Wednesday] > P[today Wednesday is] Idea : P[today is Wednesday] > P[today is Wendy] E.g. try to predict the missing word I grew up in France, I speak fluent @freakonometrics freakonometrics freakonometrics.hypotheses.org 43
  44. 44. Arthur Charpentier, SIDE Summer School, July 2019 Natural Language Processing & Probabilistic Language Models Use of the chain rule P[A1, A2, · · · , An] = n i=1 P[Ai|A1, A2, · · · , Ai−1] P(the wine is so good) = P(the) · P(wine|the) · P(is|the wine) · P(so|the wine is) · P(good|the wine is so) Markov assumption & k-gram model P[A1, A2, · · · , An] ∼ n i=1 P[Ai|Ai−k, · · · , Ai−1] @freakonometrics freakonometrics freakonometrics.hypotheses.org 44

×