R-Package DescTools
Why and where to go?
Andri Signorell, Helsana Health Sciences,
Zurich R-Group 21.01.2016
Randomized clinical trials (RCT)
do not represent the reality in health care
• Population included in RCT does not corresp...
Medication of one patient…
Is this evidence-based medicine?
3
Real example from
our database:
Mrs. G. H. in G.
received in...
Unnötige Herzkatheteruntersuchungen in der
Schweiz
ni. Mit einem Herzkatheter können beim Patienten
gefährliche Verschlüss...
Orders of magnitude
• Analytical DataWareHouse (TeraData),
updated daily and in a bitemporal history
• 492 tables und 7494...
Where's the pain point?
Cross-Industry Standard Process
for Data-Mining
Shearer C., The CRISP-DM model: the new blueprint ...
Users, even expert statisticians, do not always
screen the data.
B. D. Ripley, Robust statistics (2004)
Andri Signorell, 2...
Get the Right Tool for the Job!
• Datasets with 150
Variablen, 500’000 rows
not unusal
• R might not always be
optimal for...
DescTools focus
• provide elaborated descriptive routines
– numeric, factor, logical, table, numeric ~ factor, ...
– data....
Describe numeric
> summary(d.pizza$temperature) # base R
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
19.30 42.22 50.00 47.9...
• Base R
plot(d.pizza$temperature)
• DescTools
plot(Desc(d.pizza$temperature))
Visualization excellence …
… is that which ...
Describe table
> tab <- table(d.pizza$driver, d.pizza$area)
> summary(tab)
Number of cases in table: 1194
Number of factor...
> tab <- as.table(apply(HairEyeColor, c(1,2), sum))[
+ , c("Brown","Hazel","Green","Blue")]
> (z <- Desc(tab, row.vars=c(3...
Describe factors in Word
Desc(d.pizza$driver, wrd=GetNewWrd())
Andri Signorell, 21.01.2016
Summary:
n pairs: 768, valid: 768 (100%), missings: 0 (0%), groups: 2
neg pos Total
mean 31.19 37.07 33.24
median 27.00 36...
+ ~ 440 Functions
• Statistical functions and Confidence Intervals
Skew, Kurt, CramerV, SomersDelta, CohenKappa, HuberM, M...
Pain Point «Speed»
> x <- runif(1e8)
> system.time(e1071::kurtosis(x))
user system elapsed
5.67 0.55 6.21
> system.time(De...
Andri Signorell, 21.01.2016
Pain point «Import»
R Data Import/Export
This is a guide to importing and exporting data
to and from R.
This manual is for...
DescTools::XLGetRange()
• Import directly from XL
Andri Signorell, 21.01.2016
Can one be a good data analyst without being a half-good programmer?
The short answer to that is, 'No.' The long answer to...
The reasonable man adapts himself to the world; the
unreasonable one persists in trying to adapt the world
to himself.
The...
Thanks to
• All the R-Core members and R–contributors
• Frank E Harrell Jr, with contributions from Charles Dupont and man...
Nächste SlideShare
Wird geladen in …5
×

Zurich R User group: Desc tools

941 Aufrufe

Veröffentlicht am

Screening data is still a laborious task in R. Calculating summary statistics for all variables while listing the occurrence of missing data and producing some kind of graphics is a three-click process in SPSS, but base R does not contain higher level functions for quickly describing bigger datasets in a more or less automated way. The R package DescTools addresses three problem areas. First it provides functions meant to facilitate the construction of univariate and bivariate descriptive tables of several variable types. Then the connectivity between R and MS-Office is enhanced by providing an easy interface to Word and Excel. Generating reports directly in Word and importing data directly from Excel becomes an easy task. Finally a considerable amount of base functions (operators, string and date functions, statistics, tests, several plot types) not present in base R is collected from other packages and internet sources with the goal to have them consolidated in ONE instead of dozens of packages and to have a common and consistent interface as far as function and arguments naming, NA handling, recycling rules etc. are concerned.

Veröffentlicht in: Daten & Analysen
0 Kommentare
0 Gefällt mir
Statistik
Notizen
  • Als Erste(r) kommentieren

  • Gehören Sie zu den Ersten, denen das gefällt!

Keine Downloads
Aufrufe
Aufrufe insgesamt
941
Auf SlideShare
0
Aus Einbettungen
0
Anzahl an Einbettungen
13
Aktionen
Geteilt
0
Downloads
11
Kommentare
0
Gefällt mir
0
Einbettungen 0
Keine Einbettungen

Keine Notizen für die Folie

Zurich R User group: Desc tools

  1. 1. R-Package DescTools Why and where to go? Andri Signorell, Helsana Health Sciences, Zurich R-Group 21.01.2016
  2. 2. Randomized clinical trials (RCT) do not represent the reality in health care • Population included in RCT does not correspond to the population finally receiving the treatment 2Andri Signorell, 21.01.2016 Only 1/3 of the ultimatlely treated people would at all fulfill the inclusion criteria Elderly underrepresented in clinical trials
  3. 3. Medication of one patient… Is this evidence-based medicine? 3 Real example from our database: Mrs. G. H. in G. received in 2013 drugs with 101 different agents (ATC-Codes) in total 533 prescriptions Andri Signorell, 21.01.2016
  4. 4. Unnötige Herzkatheteruntersuchungen in der Schweiz ni. Mit einem Herzkatheter können beim Patienten gefährliche Verschlüsse in den Herzkranzarterien nachgewiesen und behoben werden. Weil die Untersuchung aber teuer, invasiv und nicht frei von Komplikationen ist, sollte sie nur bei begründetem Verdacht auf Engnisse durchgeführt werden – so sehen es die Richtlinien vor. Wird das in der Schweiz befolgt? Dieser Frage sind Forscher in einer Studie nachgegangen. Ihre vor kurzem in «Plos One» veröffentlichten Resultate legen nahe, dass drei von zehn Herzkathetern unnötig sind. (NZZ, 5.3.2015) Zeichnung: Felix Schaad Andri Signorell, 21.01.2016
  5. 5. Orders of magnitude • Analytical DataWareHouse (TeraData), updated daily and in a bitemporal history • 492 tables und 7494 attributes • 1'468'893 insured in 2014 • complete treatment information since ~ 2005 • 201'875'131 claims with all in all 949'392'044 detailed positions • Analysed with Andri Signorell, 21.01.2016
  6. 6. Where's the pain point? Cross-Industry Standard Process for Data-Mining Shearer C., The CRISP-DM model: the new blueprint for data mining, J Data Warehousing (2000); 5:13—22. 80% of the analysts ressources are lost for data understanding and preparation – … and no one is doing something about it! Andri Signorell, 21.01.2016
  7. 7. Users, even expert statisticians, do not always screen the data. B. D. Ripley, Robust statistics (2004) Andri Signorell, 21.01.2016
  8. 8. Get the Right Tool for the Job! • Datasets with 150 Variablen, 500’000 rows not unusal • R might not always be optimal for this order of magnitude (performance, RAM) • Programming paradigm let grow the screening code and make it confusing! Andri Signorell, 21.01.2016
  9. 9. DescTools focus • provide elaborated descriptive routines – numeric, factor, logical, table, numeric ~ factor, ... – data.frame, formula interface • integrate descriptive plots • easy output to MS-Word document > Desc(d.pizza$temperature) # describe single variable > wrd <- GetNewWrd() > Desc(d.pizza, wrd=wrd) # describe data.frame and send # it directly to Word > Desc(. ~ driver, d.pizza) > Desc(driver ~ ., d.pizza) Andri Signorell, 21.01.2016
  10. 10. Describe numeric > summary(d.pizza$temperature) # base R Min. 1st Qu. Median Mean 3rd Qu. Max. NA's 19.30 42.22 50.00 47.94 55.30 64.80 40 > describe(d.pizza$temperature) # library(Hmisc) d.pizza$temperature n missing unique Info Mean .05 .10 .25 .50 .75 .90 .95 1170 39 375 1 47.94 26.70 33.29 42.23 50.00 55.30 58.80 60.50 lowest : 19.30 19.40 20.00 20.20 20.35, highest: 63.80 64.10 64.60 64.70 64.80 > Desc(d.pizza$temperature) # library(DescTools) -------------------------------------------------- d.pizza$temperature (numeric) length n NAs unique 0s mean meanSE 1'210 1'170 40 375 0 47.937 0.291 .05 .10 .25 median .75 .90 .95 26.700 33.290 42.225 50 55.300 58.800 60.500 rng sd vcoef mad IQR skew kurt 45.500 9.938 0.207 9.192 13.075 -0.842 0.051 lowest : 19.3, 19.4, 20, 20.2 (2), 20.35 highest: 63.8, 64.1, 64.6, 64.7, 64.8 Screening-Fragen: • What happens at the edges? • Are there Missings? • Are all elements unique? • Has 0 been misused as NA? Andri Signorell, 21.01.2016
  11. 11. • Base R plot(d.pizza$temperature) • DescTools plot(Desc(d.pizza$temperature)) Visualization excellence … … is that which gives to the viewer the greatest number of ideas in the shortest time with the least ink in the smallest space. … requires telling the truth about the data. Edward Tufte The Visual Display of Quantitative Information and Envisioning Information, Graphics Press, PO Box 430, Cheshire, CT 06410. Andri Signorell, 21.01.2016
  12. 12. Describe table > tab <- table(d.pizza$driver, d.pizza$area) > summary(tab) Number of cases in table: 1194 Number of factors: 2 Test for independence of all factors: Chisq = 1009.5, df = 12, p-value = 1.697e-208 > describe(tab) tab 3 Variables 7 Observations ---------------------------------------------------- Brent n missing unique Info Mean 7 0 7 1 67.57 6 19 29 42 72 128 177 Frequency 1 1 1 1 1 1 1 % 14 14 14 14 14 14 14 ---------------------------------------------------- Camden n missing unique Info Mean 7 0 7 1 48.71 1 4 19 41 47 87 142 Frequency 1 1 1 1 1 1 1 % 14 14 14 14 14 14 14 ---------------------------------------------------- ... base R: reduced to the limits… Hmisc: Oups! Missinterpreted… Andri Signorell, 21.01.2016
  13. 13. > tab <- as.table(apply(HairEyeColor, c(1,2), sum))[ + , c("Brown","Hazel","Green","Blue")] > (z <- Desc(tab, row.vars=c(3, 1), rfrq="011", plotit=FALSE, main="Hair ~ Eye")) Hair ~ Eye Summary: n: 592, rows: 4, columns: 4 Pearson's Chi-squared test: X-squared = 138.29, df = 9, p-value < 2.2e-16 Likelihood Ratio: X-squared = 146.44, df = 9, p-value < 2.2e-16 Mantel-Haenszel Chi-squared: X-squared = 109.64, df = 1, p-value < 2.2e-16 Phi-Coefficient 0.483 Contingency Coeff. 0.435 Cramer's V 0.279 Eye Brown Hazel Green Blue Sum Hair freq Black 68 15 5 20 108 Brown 119 54 29 84 286 Red 26 14 14 17 71 Blond 7 10 16 94 127 Sum 220 93 64 215 592 p.row Black 63% 13.9% 4.6% 18.5% . Brown 41.6% 18.9% 10.1% 29.4% . Red 36.6% 19.7% 19.7% 23.9% . Blond 5.5% 7.9% 12.6% 74% . Sum 37.2% 15.7% 10.8% 36.3% . p.col Black 30.9% 16.1% 7.8% 9.3% 18.2% Brown 54.1% 58.1% 45.3% 39.1% 48.3% Red 11.8% 15.1% 21.9% 7.9% 12% Blond 3.2% 10.8% 25% 43.7% 21.5% Sum . . . . . > # do the plot by hand, while setting the colours > cols1 <- SetAlpha(c("sienna4", "burlywood", "chartreuse3", "slategray1"), 0.6) > cols2 <- SetAlpha(c("moccasin", "salmon1", "wheat3", "gray32"), 0.8) > plot(z, col1=cols1, col2=cols2, horiz=FALSE) Andri Signorell, 21.01.2016
  14. 14. Describe factors in Word Desc(d.pizza$driver, wrd=GetNewWrd()) Andri Signorell, 21.01.2016
  15. 15. Summary: n pairs: 768, valid: 768 (100%), missings: 0 (0%), groups: 2 neg pos Total mean 31.19 37.07 33.24 median 27.00 36.00 29.00 sd 11.67 10.97 11.76 IQR 14.00 16.00 17.00 n 500 268 768 np 65.1% 34.9% 100% NAs 0 0 0 0s 0 0 0 Kruskal-Wallis rank sum test: Kruskal-Wallis chi-squared = 73.253, df = 1, p-value < 2.2e-16 Proportions of diabetes in the quantiles of age: Q1 Q2 Q3 Q4 Q5 neg 86.7% 76.1% 57% 54.3% 46.8% pos 13.3% 23.9% 43% 45.7% 53.2% > Desc(diabetes ~ age, data=d.pima, digits=2, breaks=5, margin=TRUE, conf.level=0.90) factor ~ numeric further: factor ~ factor numeric ~ factor numeric ~ numeric Andri Signorell, 21.01.2016
  16. 16. + ~ 440 Functions • Statistical functions and Confidence Intervals Skew, Kurt, CramerV, SomersDelta, CohenKappa, HuberM, MeanCI, BinomCI, … • Additional Tests not found in base R HotellingsT2Test, JarqueBeraTest, BreslowDayTest, DurbinWatsonTest, LeveneTest, ScheffeTest, … • Date functions Today, AddMonths, Day, Month, Year, Weekday, IsWeekend, Zodiac, … • String functions StrAlign, StrTrim, StrDist, StrCountW, StrVal, … • Operators and other %()%, Untable, CollapseTable, Dummy, Large, Small, … Andri Signorell, 21.01.2016
  17. 17. Pain Point «Speed» > x <- runif(1e8) > system.time(e1071::kurtosis(x)) user system elapsed 5.67 0.55 6.21 > system.time(DescTools::Kurt(x)) user system elapsed 0.47 0.00 0.47 http://www.noamross.net/blog/2013/4/25/faster-talk.html -> Get a Bigger Computer Andri Signorell, 21.01.2016
  18. 18. Andri Signorell, 21.01.2016
  19. 19. Pain point «Import» R Data Import/Export This is a guide to importing and exporting data to and from R. This manual is for R, version 3.1.2 (2014-10-31). Copyright © 2000–2014 R Core Team Andri Signorell, 21.01.2016
  20. 20. DescTools::XLGetRange() • Import directly from XL Andri Signorell, 21.01.2016
  21. 21. Can one be a good data analyst without being a half-good programmer? The short answer to that is, 'No.' The long answer to that is, 'No.' -- Frank Harrell 1999 S-PLUS User Conference, New Orleans (October 1999) Could you spontaneously produce the R-code needed to present todays’ date? “Donnerstag, 21. Januar 2016” • Solution Base R*): > format(Sys.Date(), "%A, %d. %B %Y") [1] "Donnerstag, 21. Januar 2016" • Solution DescTools: > Format(Today(), fmt="dddd, dd. mmmm yyyy") [1] "Donnerstag, 21. Januar 2016" Pain Point «User Interface» Andri Signorell, 21.01.2016
  22. 22. The reasonable man adapts himself to the world; the unreasonable one persists in trying to adapt the world to himself. Therefore, all progress depends on the unreasonable man. George Bernard Shaw Be unreasonable and contact me with feedback or feature ideas! andri@signorell.net Andri Signorell, 21.01.2016
  23. 23. Thanks to • All the R-Core members and R–contributors • Frank E Harrell Jr, with contributions from Charles Dupont and many others. (2014). Hmisc: Harrell Miscellaneous. R package version 3.14-6. http://CRAN.R-project.org/package=Hmisc • Revelle, W. (2015) psych: Procedures for Personality and Psychological Research, Northwestern University, Evanston, Illinois, USA, http://CRAN.R-project.org/package=psych Version = 1.5.1. • Lemon, J. (2006) Plotrix: a package in the red light district of R. R-News, 6(4): 8-12. • Hans Peter Wolf and Uni Bielefeld (2014). aplpack: Another Plot PACKage: stem.leaf, bagplot, faces, spin3R, plotsummary, plothulls, and some slider functions. R package version 1.3.0. http://CRAN.R-project.org/package=aplpack • Martin Maechler et al. (2015). sfsmisc: Utilities from Seminar fuer Statistik ETH Zurich. R package version 1.0-27. http://CRAN.R-project.org/package=sfsmisc • Christian W. Hoffmann <http://www.echoffmann.ch> (2014). cwhmisc: Miscellaneous Functions for math, plotting, printing, statistics, strings, and tools. R package version 5.0. http://CRAN.R-project.org/package=cwhmisc • And many more! See DescTools’ authors list! Andri Signorell, 21.01.2016

×