Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.

Women in Data Science 2018 Slides--Small Samples, Subgroups, and Topology

2.418 Aufrufe

Veröffentlicht am

A lot of data science coverage in the media focuses on big data—storage systems, deep learning, and analyzing data with billions or trillions of observations. However, there’s an equally pressing problem in many industries and smaller companies today: small sample sizes or small subgroups within larger datasets. Machine learning algorithms fail to converge. Statistical methods break down completely. And valuable insight is lost.

However, recent advances in a branch of machine learning called topological data analysis (TDA), along with novel applications of topology to existing statistical methods, have provided a toolset suited to the challenges of small data. These methods have great potential as the field of data science moves from quantity to quality of data. This talk overviews several of TDA’s major tools, as well as their applications to three projects in which traditional methods fail.

I will link to the video when it is made available :)

Veröffentlicht in: Daten & Analysen

Women in Data Science 2018 Slides--Small Samples, Subgroups, and Topology

  1. 1. Human Behavior, Small Samples, and the Problem of Subgroups The Power of Topology 3/5/2018
  2. 2. Introduction  Big data hype  Less publicized but very important types of data: 1. Small data 2. Data with distinct subgroups  Industries where these are common:  Education  Insurance  Biotechnology/ pharmaceuticals  Industrial psychology
  3. 3. Problems Unique to Small Data  Types of small data problems:  Rare diseases (100 cases worldwide with unknown genetic causes)  Pilot studies (10’s or 100’s of participants)  Small educational programs (10’s of students enrolled in the previous year)  Main issues:  Statistical models require minimum sample sizes to estimate effects with computational issues or wide confidence intervals (singularities, p>>n problems).  Machine learning algorithms need to converge for stable estimates and models.  Small samples can induce sparsity in the data space, which is problematic for clustering and general data mining techniques. 3
  4. 4. Problems Unique to Subgroups in Data  Types of data in which subgroups are common:  Medicine (diverse causes of a given disease, subtypes of disease)  Education (different types of students and risk types for failure)  Industrial psychology (different personality trait patterns)  Main issues:  Washing out of effects in a full model  Examples:  Small subgroup defined by extremely high extraversion and openness related to public speaking outcome  Rare genetic variant combination predicting high likelihood of response to a drug within a disease population  Defining robust partitions within a piecewise regression model to deal with this phenomenon  Mixed results from many methods employing this strategy  Difficult with small sample sizes for most piecewise regression models 4
  5. 5. Unique Solutions: Topology  Branch of mainly pure mathematics  Study of changes in function behavior on different shapes  Identify invariant properties of shapes  Classify similarities/ differences between shapes 5 Deep connections to physics and differential equations
  6. 6. Topological Data Analysis  Data as discrete point clouds  Topological spaces, called simplicial complexes, built from these:  Connect points within a certain distance of each other  Topologically similar to a graph  Tools using simplicial complexes for data analysis called topological data analysis (TDA) 6 2-d neighborhoods are defined by Euclidean distance. Points within a given circle are mutually connected, forming a simplex.
  7. 7. Tool 1: Persistent Homology  Filtration  Series of simplicial complexes based on varying distance thresholds  Features appear and disappear as lens changes  Nested sequence of features with deep algebraic properties  Persistence as length of feature existence in the sequences (plotted as persistence diagrams)  Termed persistent homology  A bit like an MRI-type examination of data  Persistence as organ size and type  Gives a comprehensive view of data  Persistent homology related to hierarchical clustering  Statistical methods to compare datasets0 2 4 6 8 10 0246810 Birth Death 0 2 4 6 8 10 time
  8. 8. Tool 2: Morse-Smale Clustering  Multivariate technique from TDA similar to mode clustering  Find peaks and valleys in data by filtering on a defined function:  A watershed on mountains  Dribbling a soccer ball across a field of hills  Separate data based on shared peaks and valleys  Many nice developments on convergence and theoretical properties 8
  9. 9. Tool 3: Homotopy and Path Equivalence  Homotopy arrow example  Red to blue by wiggling start to finish path  Yellow arrow and hole problem  Homotopy method in LASSO  Wiggles easy regression path to optimal regression path  Recent success solving ordinary differential equations  Avoids local optima that can trap other regression estimators 9
  10. 10. Case Study 1: Small Educational Samples  Problem set-up 1. Understand subgroups of profoundly gifted students (IQ>160) 2. Explore impact of educational interventions on early career awards  Sample 1. 17 profoundly gifted students:  Gross’s 2003 sample  Intelligence testing data available  Early achievement testing (verbal, math) available 2. 16 of these same students with follow-up data related to:  Educational intervention data  Early career recognition/awards 10
  11. 11. Data Mining 11 9 3 13 5 1 7 8 14 10 11 12 6 16 15 17 2 4 0204060 Intelligence and Achievement Dendrogram hclust (*, "complete") dist(mydata[, 2:4]) Height  Distinct population that separates out very early in the filtration (box)  Students with an IQ>200 and achievement scores 5+ grades ahead for math and verbal (multivariate outliers)  Corroborates previous evidence of a “high flat” profile distinct from other types of profound giftedness
  12. 12. Logistic Regression Coefficient Comparison  Comparison of 2 machine learning models and 2 topologically-based models  Too few observations for traditional logistic regression  Multivariate adaptive regression splines (MARS) inadequate fit (R^2=0.27)  Bayesian model averaging (BMA) extremely large confidence interval  DGLARS and HLASSO (topologically-based) good fit, small confidence intervals, consistent results across replication 12 IQ Score Early English Early Math Early Entry Grade Skip Subject Acceleration Radical Acceleration MARS 0.44 BMA -6.25 5.79 0.97 1.38 2.41 33.10 DGLARS 2.20 4.66 HLASSO 0.02 -0.26 1.44 3.27
  13. 13. Case Study 2: Actuarial Modeling with Subgroups  Problem set-up  Understand risk factors associated with auto insurance claims  Understand subgroups with different types of risk  Sample  Open-source Swedish automobile claims dataset from 1977  2182 claims, 6 predictors 13
  14. 14. Risk Clusters  Group 1: relatively high dependence on make and number of claims  Group 2: relatively high dependence on bonus and number of years insured  Group 3: almost solely dependent on number of claims and geographic zone 14  Three distinct subgroups with varying risk type
  15. 15. Case Study 3: Psychometric Test Design  Set-up:  Explore/validate survey measuring identity importance/expression across social contexts  Create subscales within the survey  Sample:  406 participants in a pilot study  91 test items  Random samples of 130 participants taken with replacement as validation samples 15
  16. 16. Advantages of Topology Over Factor Analysis 16 Loss of information with each projection to a lower-dimensional space (errors) Topological methods work by partitioning existing space into homogenous components (no maps, no error) 2D example
  17. 17. Exploratory Analysis 17 ILLCa_school_success_family ILLCa_school_success_school ILLCa_gender_dating ILLCa_age_dating ILLCa_age_freetime ILLCa_sexual_or_dating ILLCa_beauty_dating ILLCa_sport_dating ILLCa_sport_freetime ILLCa_sport_religion ILLCa_religion_freetime ILLCa_religion_family ILLCa_religion_school ILLCa_religion_neighborhood ILLCa_politics_dating ILLCa_religion_group ILLCa_sexual_or_religion ILLCa_gender_religion ILLCa_age_religion ILLCa_politics_religion ILLCa_politics_family ILLCa_politics_neighborhood ILLCa_politics_group ILLCa_politics_school ILLCa_politics_freetime ILLCa_tribe_dating ILLCa_tribe_group ILLCa_tribe_freetime ILLCa_tribe_family ILLCa_tribe_school ILLCa_tribe_neighborhood ILLCa_tribe_religion ILLCa_beauty_neighborhood ILLCa_look_neighborhood ILLCa_school_success_religion ILLCa_look_religion ILLCa_music_neighborhood ILLCa_race_religion ILLCa_status_religion ILLCa_beauty_religion ILLCa_religion_religion ILLCa_religion_dating ILLCa_race_school ILLCa_race_freetime ILLCa_sexual_or_school ILLCa_beauty_family ILLCa_beauty_freetime ILLCa_beauty_school ILLCa_beauty_group ILLCa_look_freetime ILLCa_look_family ILLCa_look_school ILLCa_status_dating ILLCa_status_group ILLCa_race_group ILLCa_race_dating ILLCa_sexual_or_group ILLCa_sexual_or_freetime ILLCa_gender_freetime ILLCa_gender_family ILLCa_gender_school ILLCa_age_family ILLCa_age_school ILLCa_school_success_neighborhood ILLCa_race_neighborhood ILLCa_sexual_or_neighborhood ILLCa_status_neighborhood ILLCa_gender_neighborhood ILLCa_age_neighborhood ILLCa_sport_school ILLCa_sport_family ILLCa_sport_group ILLCa_music_freetime ILLCa_music_religion ILLCa_music_dating ILLCa_sport_neighborhood ILLCa_school_success_dating ILLCa_school_success_group ILLCa_school_success_freetime ILLCa_music_school ILLCa_music_family ILLCa_music_group ILLCa_gender_group ILLCa_age_group ILLCa_look_group ILLCa_look_dating ILLCa_race_family ILLCa_sexual_or_family ILLCa_status_freetime ILLCa_status_family ILLCa_status_school -0.2 0 0.2 0.4 0.6 0.8 1
  18. 18. Insights Gained  Some aspects of identity fluid, others are fixed  Political and racial/ethnic identity fixed  Other types, such as athletic or gender, fairly fluid  No statistically significant differences between samples  Subscales consistent  Validates measure 18
  19. 19. Conclusions  Unique challenges in data science  Subgroups  Small samples  Failure of statistical and machine learning algorithms  Topological data analysis as robust solutions 19
  20. 20. Reference Papers  Carlsson, G. (2009). Topology and data. Bulletin of the American Mathematical Society, 46(2), 255-308.  Edelsbrunner, H., & Harer, J. (2008). Persistent homology-a survey. Contemporary mathematics, 453, 257-282.  Farrelly, C. M. (2017). Extensions of Morse-Smale Regression with Application to Actuarial Science. arXiv preprint arXiv:1708.05712. Accepted as new model by Casualty Actuarial Society, December 2017.  Farrelly, C. M. (2018). Topology and Geometry for Small Sample Sizes: An Application to Research on the Profoundly Gifted.  Farrelly, C. M., Schwartz, S. J., Amodeo, A. L., Feaster, D. J., Steinley, D. L., Meca, A., & Picariello, S. (2017). The analysis of bridging constructs with hierarchical clustering methods: An application to identity. Journal of Research in Personality, 70, 93-106.  Gerber, S., Rübel, O., Bremer, P. T., Pascucci, V., & Whitaker, R. T. (2013). Morse–smale regression. Journal of Computational and Graphical Statistics, 22(1), 193-214. 20