Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.

Lausanne 2019 #1

12.664 Aufrufe

Veröffentlicht am

Actuarial Summer School

Veröffentlicht in: Wirtschaft & Finanzen
  • Als Erste(r) kommentieren

Lausanne 2019 #1

  1. 1. Use & Value of Unusual Data Arthur Charpentier UQAM Actuarial Summer School 2019 @freakonometrics freakonometrics freakonometrics.hypotheses.org 1 / 57
  2. 2. The Team for the Week Jean-Philippe Boucher UQAM (Montr´eal, Canada) Professor, holder of The Co-operators Chair in Actuar- ial Risk Analysis, author of several research articles in actuarial science. @J P Boucher jpboucher.uqam.ca Arthur Charpentier UQAM (Montr´eal, Canada) Professor, editor of Computational Actuarial Science with R. Former director of the Data Science for Actuaries pro- gram of the French Institute of Actuaries @freakonometrics freakonometrics freakonometrics.hypotheses.org Ewen Gallic AMSE (Aix-Marseille, France) Assistant professor, teaching computational economet- rics, data science and machine learning. @3wen 3wen egallic.fr @freakonometrics freakonometrics freakonometrics.hypotheses.org 2 / 57
  3. 3. Practical Issues @freakonometrics freakonometrics freakonometrics.hypotheses.org 3 / 57
  4. 4. General Context (in Insurance) Networks and P2P Insurance (via https://sharable.life and source) @freakonometrics freakonometrics freakonometrics.hypotheses.org 4 / 57
  5. 5. General Context (in Insurance) Networks and P2P Insurance, via Insurance Business (2016) @freakonometrics freakonometrics freakonometrics.hypotheses.org 5 / 57
  6. 6. General Context (in Insurance) Telematics and Usage-Based-Insurance (source, source, The Actuary) @freakonometrics freakonometrics freakonometrics.hypotheses.org 6 / 57
  7. 7. General Context (in Insurance) Satellite pictures (source and Index-based Flood Insurance (IBFI), http://ibfi.iwmi.org/) @freakonometrics freakonometrics freakonometrics.hypotheses.org 7 / 57
  8. 8. General Context (in Insurance) Very small granularity (1 pixel is 15 × 15cm2, see detail) @freakonometrics freakonometrics freakonometrics.hypotheses.org 8 / 57
  9. 9. More and More (exponentially) Data 1969, (legend says) 5 photographs were taken by Neil Armstrong More than 800,000,000 pictures are uploaded on every day, on Facebook (2 billion photos per day with instagram, Messenger and Whatsapp) @freakonometrics freakonometrics freakonometrics.hypotheses.org 9 / 57
  10. 10. Data Complexity Pictures, drawings, videos, handwrit- ing text, ECG, medical imagery, satel- lite pictures, tweets, phone calls, etc. Hard to distinguish structured and un-structured data... @freakonometrics freakonometrics freakonometrics.hypotheses.org 10 / 57
  11. 11. Data Complexity @freakonometrics freakonometrics freakonometrics.hypotheses.org 11 / 57
  12. 12. Insurance Data From questionnaires to data frames xi,j row i : insured, policy, claim column j : variables (features) @freakonometrics freakonometrics freakonometrics.hypotheses.org 12 / 57
  13. 13. From Pictures to Data Consider pictures saa-iss.ch/testalbum/nggallery/all-old-galleries/ @freakonometrics freakonometrics freakonometrics.hypotheses.org 13 / 57
  14. 14. From Pictures to Data Face detection face-api.js (justadudewhohacks.github.io) @freakonometrics freakonometrics freakonometrics.hypotheses.org 14 / 57
  15. 15. From Pictures to Data Expression recognition face-api.js (justadudewhohacks.github.io) @freakonometrics freakonometrics freakonometrics.hypotheses.org 15 / 57
  16. 16. From Pictures to Data Face recognition face-api.js (justadudewhohacks.github.io) @freakonometrics freakonometrics freakonometrics.hypotheses.org 16 / 57
  17. 17. Pictures Extracting information from a picture What is on the picture ? What is the cost of such a claim ? High dimensional data : h × d color picture, e.g. 850 × 1280 xi ∈ Rd with d > 3 nillion Need new statistical tools (deep) neural networks perform well or reduce dimension Pictures Odermatts (2003, Karambolage) @freakonometrics freakonometrics freakonometrics.hypotheses.org 17 / 57
  18. 18. Text Kaufman et al. (2016, Natural Language Processing-Enabled and Conventional Data Capture Methods for Input to Electronic Health Records), Chen et al. (2018, A Natural Language Processing System That Links Medical Terms in Electronic Health Record Notes to Lay Definitions) or Simon (2018, Natural Language Processing for Healthcare Customers) @freakonometrics freakonometrics freakonometrics.hypotheses.org 18 / 57
  19. 19. Statistics & ML Toolbox : PCA Projection method, used to reduce dimensionality Principcal Component Analysis (PCA) find a small number of directions in input space that explain variation in input data represent data by projecting along those directions n observations in dimension d, stored in matrix X ∈ Mn×d Natural ideal: linearly project (multiply by a projection matrix) to much lower dimensional space k < d Search for orthogonal directions in space with highest variance Information is stored in the covariance matrix Σ = X X when variables are centered @freakonometrics freakonometrics freakonometrics.hypotheses.org 19 / 57
  20. 20. Statistics & ML Toolbox : PCA Find the k largest eigenvectors of Σ : principal components Assemble these eigenvectors into a d × k matrix ˜Σ Express d-dimensional vectors x by projecting them to m-dimensional ˜x, ˜x = ˜Σ x We have data X ∈ Mn×d , i.e. n vectors x ∈ Rd , Let ω1 ∈ Rd denote weights. We want to maximize the variance of z = ω1 x max 1 n n i=1 (ω1 xi − ω1 x)2 = max ω1 Σω1 s.t. ω1 = 1 @freakonometrics freakonometrics freakonometrics.hypotheses.org 20 / 57
  21. 21. Statistics & ML Toolbox : PCA To solve max 1 n n i=1 (ω1 xi − ω1 x)2 = max ω1 Σω1 s.t. ω1 = 1 we can use Lagrange multiplier to solve this constraint optimization problem max ω1,λ1 ω1 Σω1 + λ(1 − ω1 ω1) If we differentiate Σω1 − λ1ω1 = 0, i.e. Σω1 = λ1ω1 i.e. ω1 is an eigenvector of Σ, with eigenvalue the Lagrange multiplier. It should be the one with the largest eigenvalue. @freakonometrics freakonometrics freakonometrics.hypotheses.org 21 / 57
  22. 22. Statistics & ML Toolbox : PCA max ω2 Σω2 subject to ω2 = 1 ω2 ω1 = 0 The Lagragian is ω2 Σω2 + λ2(1 − ω2 ω2) − λ1ω2 ω1 when differentiating, we get Σω2 = λ2ω2 λ2 is the second largest eigenvalue Set z = W x and ˜x = Mz. We want to solve (where is the standard 2 norm) min W ,M n i=1 (xi , ˜xi ) i.e. min W ,M n i=1 (xi , MWxi ) @freakonometrics freakonometrics freakonometrics.hypotheses.org 22 / 57
  23. 23. Statistics & ML Toolbox : GLM Generalized Linear Model Flexible generalization of linear regression that allows for y to have distribution other than a normal distribution The classical regression is (matrix form) y n×1 = X n×k β k×1 + ε n×1 or yi = xT i β xi ,β + εi Yi ∼ N(µi , σ2 ) , with µi = E[Yi |xi ] = xT i β It is possible to consider a more general regression. For a classification y ∈ {0, 1}n, Yi ∼ B(pi ) , with pi = E[Yi |xi ] = exT i β 1 + exT i β Recall that we assumed that (Yi , Xi ) were i.i.d. with unknown distribution P. @freakonometrics freakonometrics freakonometrics.hypotheses.org 23 / 57
  24. 24. Statistics & ML Toolbox : GLM In the (standard) linear model, we assume that (Y |X = x) ∼ N(x β, σ2). The maximum-likelihood estimator of β is then β = argmax β∈Rp+1 − n i=1 (yi − xi β)2 For the logistic regression, (Y |X = x) ∼ B(logit−1 (x β)). The maximum-likelihood estimator of β is then argmin β∈Rp+1 n i=1 yi log(logit−1 (xi β)) + (1 − yi ) log(1 − logit−1 (xi β)) @freakonometrics freakonometrics freakonometrics.hypotheses.org 24 / 57
  25. 25. Statistics & ML Toolbox : loss and risk In a general setting, consider some loss function : R × R → R+, e.g. (y, m) = (y − m)2 (quadratic 2 norm) Average Risk The average risk associated with mn is RP(mn) = EP (Y , mn(X)) @freakonometrics freakonometrics freakonometrics.hypotheses.org 25 / 57
  26. 26. Statistics & ML Toolbox : loss and risk In a general setting, consider some loss function : R × R → R+, e.g. (y, m) = (y − m)2 (quadratic 2 norm) Empirical Risk The empirical risk associated with mn is Rn(mn) = 1 n n i=1 yi , mn(xi ) . @freakonometrics freakonometrics freakonometrics.hypotheses.org 26 / 57
  27. 27. Statistics & ML Toolbox : loss and risk If we minimize the average risk, we overfit... Hold-Out Cross Validation 1. Split {1, 2, · · · , n} in T (training) and V (validation) 2 . Estimate m on sample (yi , xi ), i ∈ T : mT = argmin m∈M 1 |T| i∈V yi , m(x1,i , · · · , xp,i ) 3. Compute 1 |V | i∈V yi , mT (x1,i , · · · , xp,i ) @freakonometrics freakonometrics freakonometrics.hypotheses.org 27 / 57
  28. 28. Statistics & ML Toolbox : loss and risk source: Hastie et al. (2009, The Elements of Statistical Learning) @freakonometrics freakonometrics freakonometrics.hypotheses.org 28 / 57
  29. 29. Statistics & ML Toolbox : loss and risk Leave-one-Out Cross Validation 1. Estimate n models : estimate m−j on sample (yi , xi ), i ∈ {1, · · · , n}{j} m−j = argmin m∈M 1 n − 1 i=j yi , m(x1,i , · · · , xp,i ) 2. Compute 1 n n i=1 yi , m−i (xi ) @freakonometrics freakonometrics freakonometrics.hypotheses.org 29 / 57
  30. 30. Statistics & ML Toolbox : loss and risk K-Fold Cross Validation 1. Split {1, 2, · · · , n} in K groups V1, · · · , VK 2. Estimate K models : estinate mk on sample (yi , xi ), i ∈ {1, · · · , n}Vk 3. Compute 1 K K k=1 1 |Vk| i∈Vk yi , mk(x1,i , · · · , Xp,i ) @freakonometrics freakonometrics freakonometrics.hypotheses.org 30 / 57
  31. 31. Statistics & ML Toolbox : loss and risk Bootstrap Cross Validation 1. Generate B boostrap samples from {1, 2, · · · , n}, I1, · · · , IB 2. Estimate B models : mb on sample (yi , xi ), i ∈ Ib 3. Compute 1 B B b=1 1 n − |Ib| i∈Ib yi , mb(x1,i , · · · , Xp,i ) The probability that i /∈ Ib is 1 − 1 n n ∼ e−1 (= 36.78%) At stage b, we validate on ∼ 36.78% of the dataset. @freakonometrics freakonometrics freakonometrics.hypotheses.org 31 / 57
  32. 32. Statistics & ML Toolbox : loss and risk y ∈ Rd , y = argmin m∈R    n i=1 1 n yi − m εi 2    is the empirical version of E[Y ] = argmin m∈R    y − m ε 2 dF(y)    = argmin m∈R    E Y − m ε 2    where Y is a random variable. Thus, argmin m:Rk →R    n i=1 1 n yi − m(xi ) εi 2    is the empirical version of E[Y |X = x]. See Legendre (1805, Nouvelles m´ethodes pour la d´etermination des orbites des com`etes) and Gauβ (1809, Theoria motus corporum coelestium in sectionibus conicis solem ambientium). @freakonometrics freakonometrics freakonometrics.hypotheses.org 32 / 57
  33. 33. Statistics & ML Toolbox : loss and risk med[y] ∈ argmin m∈R    n i=1 1 n yi − m εi    is the empirical version of med[Y ] ∈ argmin m∈R    y − m ε dF(y)    = argmin m∈R    E Y − m ε 1    where P[Y ≤ med[Y ]] ≥ 1 2 and P[Y ≥ med[Y ]] ≥ 1 2 . argmin m:Rk →R    n i=1 1 n yi − m(xi ) εi    is the empirical version of med[Y |X = x]. See Boscovich (1757, De Litteraria expeditione per pontificiam ditionem ad dimetiendos duos meridiani ) and Laplace (1793, Sur quelques points du syst`eme du monde). @freakonometrics freakonometrics freakonometrics.hypotheses.org 33 / 57
  34. 34. Statistics & ML Toolbox : loss and risk For least-squares ( 2 norm), we have a strictly convex problem... classical optimization routines can be used (e.g. gradient descient) More complicated with the 1 norm Consider a sample {y1, · · · , yn}. To compute the median, solve min µ n i=1 |yi − µ| which can be solved using linear programming techniques., see Dantzig (1963, Linear programming). More precisely, this problem is equivalent to min µ,a,b n i=1 ai + bi with ai , bi ≥ 0 and yi − µ = ai − bi , ∀i = 1, · · · , n. @freakonometrics freakonometrics freakonometrics.hypotheses.org 34 / 57
  35. 35. Statistics & ML Toolbox : loss and risk For a quantile, the linear program is min µ,a,b n i=1 τai + (1 − τ)bi with ai , bi ≥ 0 and yi − µ = ai − bi , ∀i = 1, · · · , n. This is extended to the quantile regression, min βτ ,a,b n i=1 τai + (1 − τ)bi with ai , bi ≥ 0 and yi − xi βτ ] = ai − bi , ∀i = 1, · · · , n. But is the following problem the one we should really care about ? m = argmin m∈M n i=1 yi , m(xi ) @freakonometrics freakonometrics freakonometrics.hypotheses.org 35 / 57
  36. 36. Statistics & ML Toolbox : Penalization Usually, there are tuning parameters p. To avoid overfit, 1. use cross-validation mp = argmin p    i∈validation yi , mp(xi )    where mp = argmin m∈Mp    i∈training yi , m(xi )    2. use penalization m = argmin m∈M n i=1 yi , m(xi ) + penalization(m) @freakonometrics freakonometrics freakonometrics.hypotheses.org 36 / 57
  37. 37. Statistics & ML Toolbox : Penalization Penalization often yield bias... but it might interesting if the risk is the mean squared error... Consider a parametric model, with true (unknown) parameter θ, then mse(ˆθ) = E (ˆθ − θ)2 = E (ˆθ − E ˆθ )2 variance + E (E ˆθ − θ)2 bias2 One can think of a shrinkage of an unbiased estimator, Let θ denote an unbiased estimator of θ, ˆθ = θ2 θ2 + mse(θ) · θ ≤ θ satisfies mse(ˆθ) ≤ mse(θ). @freakonometrics freakonometrics freakonometrics.hypotheses.org 37 / 57 −2 −1 0 1 2 3 4 0.00.20.40.60.8 variance
  38. 38. Statistics & ML Toolbox : Ridge as in Wieringen (2018 Lecture notes on ridge regression) Ridge Estimator (OLS) β ridge λ = argmin    n i=1 (yi − xi β)2 + λ p j=1 β2 j    β ridge λ = (XT X + λI)−1 XT y Ridge Estimator (GLM) β ridge λ = argmin    − n i=1 log f (yi |µi = g−1 (xi β)) + λ 2 p j=1 β2 j    @freakonometrics freakonometrics freakonometrics.hypotheses.org 38 / 57
  39. 39. Statistics & ML Toolbox : Ridge Observe that if xj1 ⊥ xj2 , then β ridge λ = [1 + λ]−1 β ols which explain relationship with shrinkage. But not in the general case... q q Smaller mse There exists λ such that mse[β ridge λ ] ≤ mse[β ols ] @freakonometrics freakonometrics freakonometrics.hypotheses.org 39 / 57
  40. 40. Statistics & ML Toolbox : Ridge From a Bayesian perspective, P[θ|y] posterior ∝ P[y|θ] likelihood · P[θ] prior i.e. log P[θ|y] = log P[y|θ] log likelihood + log P[θ] penalty If β has a prior N(0, τ2I) distribution, then its posterior distribution has mean E[β|y, X] = XT X + σ2 τ2 I −1 XT y. @freakonometrics freakonometrics freakonometrics.hypotheses.org 40 / 57
  41. 41. Statistics & ML Toolbox : lasso lasso Estimator (OLS) β lasso λ = argmin    n i=1 (yi − xi β)2 + λ p j=1 |βj|    lasso Estimator (GLM) β lasso λ = argmin    − n i=1 log f (yi |µi = g−1 (xi β)) + λ 2 p j=1 |βj|    @freakonometrics freakonometrics freakonometrics.hypotheses.org 41 / 57
  42. 42. Statistics & ML Toolbox : sparcity In several applications, k can be (very) large, but a lot of features are just noise: βj = 0 for many j’s. Let s denote the number of relevant features, with s << k, cf Hastie, Tibshirani & Wainwright (2015, Statistical Learning with Sparsity), s = card{S} where S = {j; βj = 0} The model is now y = XT SβS + ε, where XT SXS is a full rank matrix. @freakonometrics freakonometrics freakonometrics.hypotheses.org 42 / 57 q q = . +
  43. 43. Statistics & ML Toolbox : sparcity 0-regularization Estimator (OLS) β lasso λ = argmin n i=1 (yi − xi β)2 + λ β 0 where a 0 = 1(|aj| > 0) = dim(β). More generally p-regularization Estimator (OLS) β lasso λ = argmin n i=1 (yi − xi β)2 + λ β p • sparsity is obtained when p ≤ 1 • convexity is obtained when p ≥ 1 @freakonometrics freakonometrics freakonometrics.hypotheses.org 43 / 57
  44. 44. Statistics & ML Toolbox : sparcity source: Hastie et al. (2009, The Elements of Statistical Learning) @freakonometrics freakonometrics freakonometrics.hypotheses.org 44 / 57
  45. 45. Statistics & ML Toolbox : lasso In the orthogonal case, XT X = I, β lasso k,λ = sign(β ols k ) |β ols k | − λ 2 i.e. the lasso estimate is related to the soft threshold function... q q @freakonometrics freakonometrics freakonometrics.hypotheses.org 45 / 57
  46. 46. Numerical Toolbox LASSO Coordinate Descent Algorithm 1. Set β0 = β 2 . For k = 1, · · · for j = 1, · · · , p (i) compute Rj = xj y−X−jβk−1(−j) (ii) set βk,j = Rj · 1 − λ 2|Rj| + 3. The final estimate βκ is βλ The softmax function log(1+ex−s) can be used as a smooth version of (x − s)+ @freakonometrics freakonometrics freakonometrics.hypotheses.org 46 / 57
  47. 47. Statistics & ML Toolbox : trees Use recursive binary splitting to grow a tree Trees (CART) Each interior node corresponds to one of the input variables; there are edges to children for each of the possible values of that input variable. Each leaf represents a value of the target variable given the values of the input variables represented by the path from the root to the leaf. @freakonometrics freakonometrics freakonometrics.hypotheses.org 47 / 57
  48. 48. Statistics & ML Toolbox : forests Bagging (Bootstrap + Aggregation) 1. For k = 1, · · · (i) draw a bootstrap sample from (yi , xi )’s (ii) estimate a model mk on that sample 2. The final model is m (·) = 1 κ κ k=1 mk(·) @freakonometrics freakonometrics freakonometrics.hypotheses.org 48 / 57
  49. 49. Statistics & ML Toolbox : boosting Bosting & Sequential Learning mk(·) = mk−1(·) + argmin h∈H    n i=1 yi − mk−1(xi ) εi , h(xi )    @freakonometrics freakonometrics freakonometrics.hypotheses.org 49 / 57
  50. 50. Statistics & ML Toolbox : Neural Nets Very popular for pictures... Picture xi is • a n×n matrix in {0, 1}n2 for black & white • a n × n matrix in [0, 1]n2 for grey-scale • a 3 × n × n array in ([0, 1]3)n2 for color • a T ×3×n ×n tensor in (([0, 1]3)T )n2 for video y here is the label (“8”, “9”, “6”, etc) Suppose we want to recognize a “6” on a picture m(x) = +1 if x is a “6” −1 otherwise @freakonometrics freakonometrics freakonometrics.hypotheses.org 50 / 57
  51. 51. Statistics & ML Toolbox : Neural Nets Consider some specifics pixels, and associate weights ω such that m(x) = sign   i,j ωi,jxi,j   where xi,j = +1 if pixel xi,j is black −1 if pixel xi,j is white for some weights ωi,j (that can be nega- tive...) @freakonometrics freakonometrics freakonometrics.hypotheses.org 51 / 57
  52. 52. Statistics & ML Toolbox : Neural Nets A deep network is a network with a lot of layers m(x) = sign i ωi mi (x) where mi ’s are outputs of previous neural nets Those layers can capture shapes in some ar- eas nonlinearities, cross-dependence, etc @freakonometrics freakonometrics freakonometrics.hypotheses.org 52 / 57
  53. 53. Binary Threshold Neuron & Perceptron If x ∈ {0, 1}p, McCulloch & Pitts (1943, A logical calculus of the ideas immanent in nervous activity) suggested a simple model, with threshold b yi = f   p j=1 xj,i   where f (x) = 1(x ≥ b) where 1(x ≥ b) = +1 if x ≥ b and 0 otherwise, or (equivalently) yi = f  ω + p j=1 xj,i   where f (x) = 1(x ≥ 0) with weight ω = −b. The trick of adding 1 as an input was very important ! ω = −1 is the or logical operator : yi = 1 if ∃j such that xj,i = 1 ω = −p, is the and logical operator : yi = 1 if ∀j, xj,i = 1 @freakonometrics freakonometrics freakonometrics.hypotheses.org 53 / 57
  54. 54. Binary Threshold Neuron & Perceptron but not possible for the xor logical operator : yi = 1 if x1,i = x2,i Rosenblatt (1961, Principles of Neurodynamics: Perceptron & Theory of Brain Mechanism) considered the extension where x’s are real-valued, with weight ω ∈ Rp yi = f   p j=1 ωjxj,i   where f (x) = 1(x ≥ b) where 1(x ≥ b) = +1 if x ≥ b and 0 otherwise, or (equiv- alently) yi = f  ω0 + p j=1 ωjxj,i   where f (x) = 1(x ≥ 0) with weights ω ∈ Rp+1 @freakonometrics freakonometrics freakonometrics.hypotheses.org 54 / 57
  55. 55. Binary Threshold Neuron & Perceptron Minsky & Papert (1969, Perceptrons: an Introduction to Computational Geometry) proved that perceptron were a linear separator, not very powerful Define the sigmoid function f (x) = 1 1 + e−x = ex ex + 1 (= logistic function). This function f is called the activation function. If y ∈ {−1, +1}, one can consider the hyperbolic tangent f (x)) = (ex − e−x ) (ex + e−x ) or the inverse tangent function (see wikipedia). @freakonometrics freakonometrics freakonometrics.hypotheses.org 55 / 57
  56. 56. Statistics & ML Toolbox : Neural Nets So here, for a classication problem, yi = f  ω0 + p j=1 ωjxj,i   = p(xi ) The error of that model can be com- puted used the quadratic loss function, n i=1 (yi − p(xi ))2 or cross-entropy n i=1 (yi log p(xi ) + [1 − yi ] log[1 − p(xi )]) @freakonometrics freakonometrics freakonometrics.hypotheses.org 56 / 57
  57. 57. Statistics & ML Toolbox : Neural Nets Consider a single hidden layer, with 3 different neurons, so that p(x) = f  ω0 + 3 h=1 ωhfh  ωh,0 + p j=1 ωh,jxj     or p(x) = f ω0 + 3 h=1 ωhfh ωh,0 + x ωh @freakonometrics freakonometrics freakonometrics.hypotheses.org 57 / 57

×