Screening data is still a laborious task in R. Calculating summary statistics for all variables while listing the occurrence of missing data and producing some kind of graphics is a three-click process in SPSS, but base R does not contain higher level functions for quickly describing bigger datasets in a more or less automated way. The R package DescTools addresses three problem areas. First it provides functions meant to facilitate the construction of univariate and bivariate descriptive tables of several variable types. Then the connectivity between R and MS-Office is enhanced by providing an easy interface to Word and Excel. Generating reports directly in Word and importing data directly from Excel becomes an easy task. Finally a considerable amount of base functions (operators, string and date functions, statistics, tests, several plot types) not present in base R is collected from other packages and internet sources with the goal to have them consolidated in ONE instead of dozens of packages and to have a common and consistent interface as far as function and arguments naming, NA handling, recycling rules etc. are concerned.
2. Randomized clinical trials (RCT)
do not represent the reality in health care
• Population included in RCT does not correspond to
the population finally receiving the treatment
2Andri Signorell, 21.01.2016
Only 1/3 of the ultimatlely treated
people would at all fulfill the inclusion
criteria
Elderly underrepresented in
clinical trials
3. Medication of one patient…
Is this evidence-based medicine?
3
Real example from
our database:
Mrs. G. H. in G.
received in 2013
drugs with
101 different agents
(ATC-Codes)
in total
533 prescriptions
Andri Signorell, 21.01.2016
4. Unnötige Herzkatheteruntersuchungen in der
Schweiz
ni. Mit einem Herzkatheter können beim Patienten
gefährliche Verschlüsse in den Herzkranzarterien
nachgewiesen und behoben werden. Weil die
Untersuchung aber teuer, invasiv und nicht frei von
Komplikationen ist, sollte sie nur bei begründetem
Verdacht auf Engnisse durchgeführt werden – so sehen es
die Richtlinien vor. Wird das in der Schweiz befolgt?
Dieser Frage sind Forscher in einer Studie nachgegangen.
Ihre vor kurzem in «Plos One» veröffentlichten Resultate
legen nahe, dass drei von zehn Herzkathetern unnötig
sind. (NZZ, 5.3.2015)
Zeichnung: Felix Schaad
Andri Signorell, 21.01.2016
5. Orders of magnitude
• Analytical DataWareHouse (TeraData),
updated daily and in a bitemporal history
• 492 tables und 7494 attributes
• 1'468'893 insured in 2014
• complete treatment information since ~ 2005
• 201'875'131 claims with all in all
949'392'044 detailed positions
• Analysed with
Andri Signorell, 21.01.2016
6. Where's the pain point?
Cross-Industry Standard Process
for Data-Mining
Shearer C., The CRISP-DM model: the new blueprint for
data mining, J Data Warehousing (2000); 5:13—22.
80% of the analysts ressources
are lost for data understanding
and preparation – … and no one
is doing something about it!
Andri Signorell, 21.01.2016
7. Users, even expert statisticians, do not always
screen the data.
B. D. Ripley, Robust statistics (2004)
Andri Signorell, 21.01.2016
8. Get the Right Tool for the Job!
• Datasets with 150
Variablen, 500’000 rows
not unusal
• R might not always be
optimal for this order of
magnitude (performance,
RAM)
• Programming paradigm let
grow the screening code
and make it confusing!
Andri Signorell, 21.01.2016
9. DescTools focus
• provide elaborated descriptive routines
– numeric, factor, logical, table, numeric ~ factor, ...
– data.frame, formula interface
• integrate descriptive plots
• easy output to MS-Word document
> Desc(d.pizza$temperature) # describe single variable
> wrd <- GetNewWrd()
> Desc(d.pizza, wrd=wrd) # describe data.frame and send
# it directly to Word
> Desc(. ~ driver, d.pizza)
> Desc(driver ~ ., d.pizza)
Andri Signorell, 21.01.2016
10. Describe numeric
> summary(d.pizza$temperature) # base R
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
19.30 42.22 50.00 47.94 55.30 64.80 40
> describe(d.pizza$temperature) # library(Hmisc)
d.pizza$temperature
n missing unique Info Mean .05 .10 .25 .50 .75 .90 .95
1170 39 375 1 47.94 26.70 33.29 42.23 50.00 55.30 58.80 60.50
lowest : 19.30 19.40 20.00 20.20 20.35, highest: 63.80 64.10 64.60 64.70 64.80
> Desc(d.pizza$temperature) # library(DescTools)
--------------------------------------------------
d.pizza$temperature (numeric)
length n NAs unique 0s mean meanSE
1'210 1'170 40 375 0 47.937 0.291
.05 .10 .25 median .75 .90 .95
26.700 33.290 42.225 50 55.300 58.800 60.500
rng sd vcoef mad IQR skew kurt
45.500 9.938 0.207 9.192 13.075 -0.842 0.051
lowest : 19.3, 19.4, 20, 20.2 (2), 20.35
highest: 63.8, 64.1, 64.6, 64.7, 64.8
Screening-Fragen:
• What happens at the edges?
• Are there Missings?
• Are all elements unique?
• Has 0 been misused as NA?
Andri Signorell, 21.01.2016
11. • Base R
plot(d.pizza$temperature)
• DescTools
plot(Desc(d.pizza$temperature))
Visualization excellence …
… is that which gives to the viewer the greatest number of ideas in the shortest
time with the least ink in the smallest space.
… requires telling the truth about the data.
Edward Tufte The Visual Display of Quantitative Information and Envisioning Information, Graphics Press, PO Box 430, Cheshire, CT 06410.
Andri Signorell, 21.01.2016
12. Describe table
> tab <- table(d.pizza$driver, d.pizza$area)
> summary(tab)
Number of cases in table: 1194
Number of factors: 2
Test for independence of all factors:
Chisq = 1009.5, df = 12, p-value = 1.697e-208
> describe(tab)
tab
3 Variables 7 Observations
----------------------------------------------------
Brent
n missing unique Info Mean
7 0 7 1 67.57
6 19 29 42 72 128 177
Frequency 1 1 1 1 1 1 1
% 14 14 14 14 14 14 14
----------------------------------------------------
Camden
n missing unique Info Mean
7 0 7 1 48.71
1 4 19 41 47 87 142
Frequency 1 1 1 1 1 1 1
% 14 14 14 14 14 14 14
----------------------------------------------------
...
base R: reduced to the limits…
Hmisc:
Oups! Missinterpreted…
Andri Signorell, 21.01.2016
13. > tab <- as.table(apply(HairEyeColor, c(1,2), sum))[
+ , c("Brown","Hazel","Green","Blue")]
> (z <- Desc(tab, row.vars=c(3, 1), rfrq="011",
plotit=FALSE, main="Hair ~ Eye"))
Hair ~ Eye
Summary:
n: 592, rows: 4, columns: 4
Pearson's Chi-squared test:
X-squared = 138.29, df = 9, p-value < 2.2e-16
Likelihood Ratio:
X-squared = 146.44, df = 9, p-value < 2.2e-16
Mantel-Haenszel Chi-squared:
X-squared = 109.64, df = 1, p-value < 2.2e-16
Phi-Coefficient 0.483
Contingency Coeff. 0.435
Cramer's V 0.279
Eye
Brown Hazel Green Blue Sum
Hair
freq Black 68 15 5 20 108
Brown 119 54 29 84 286
Red 26 14 14 17 71
Blond 7 10 16 94 127
Sum 220 93 64 215 592
p.row Black 63% 13.9% 4.6% 18.5% .
Brown 41.6% 18.9% 10.1% 29.4% .
Red 36.6% 19.7% 19.7% 23.9% .
Blond 5.5% 7.9% 12.6% 74% .
Sum 37.2% 15.7% 10.8% 36.3% .
p.col Black 30.9% 16.1% 7.8% 9.3% 18.2%
Brown 54.1% 58.1% 45.3% 39.1% 48.3%
Red 11.8% 15.1% 21.9% 7.9% 12%
Blond 3.2% 10.8% 25% 43.7% 21.5%
Sum . . . . .
> # do the plot by hand, while setting the colours
> cols1 <- SetAlpha(c("sienna4", "burlywood",
"chartreuse3", "slategray1"), 0.6)
> cols2 <- SetAlpha(c("moccasin", "salmon1", "wheat3",
"gray32"), 0.8)
> plot(z, col1=cols1, col2=cols2, horiz=FALSE)
Andri Signorell, 21.01.2016
14. Describe factors in Word
Desc(d.pizza$driver, wrd=GetNewWrd())
Andri Signorell, 21.01.2016
16. + ~ 440 Functions
• Statistical functions and Confidence Intervals
Skew, Kurt, CramerV, SomersDelta, CohenKappa, HuberM, MeanCI,
BinomCI, …
• Additional Tests not found in base R
HotellingsT2Test, JarqueBeraTest, BreslowDayTest, DurbinWatsonTest,
LeveneTest, ScheffeTest, …
• Date functions
Today, AddMonths, Day, Month, Year, Weekday, IsWeekend, Zodiac, …
• String functions
StrAlign, StrTrim, StrDist, StrCountW, StrVal, …
• Operators and other
%()%, Untable, CollapseTable, Dummy, Large, Small, …
Andri Signorell, 21.01.2016
17. Pain Point «Speed»
> x <- runif(1e8)
> system.time(e1071::kurtosis(x))
user system elapsed
5.67 0.55 6.21
> system.time(DescTools::Kurt(x))
user system elapsed
0.47 0.00 0.47
http://www.noamross.net/blog/2013/4/25/faster-talk.html
-> Get a Bigger Computer
Andri Signorell, 21.01.2016
21. Can one be a good data analyst without being a half-good programmer?
The short answer to that is, 'No.' The long answer to that is, 'No.'
-- Frank Harrell 1999 S-PLUS User Conference, New Orleans (October 1999)
Could you spontaneously produce the R-code needed to present todays’ date?
“Donnerstag, 21. Januar 2016”
• Solution Base R*):
> format(Sys.Date(), "%A, %d. %B %Y")
[1] "Donnerstag, 21. Januar 2016"
• Solution DescTools:
> Format(Today(), fmt="dddd, dd. mmmm yyyy")
[1] "Donnerstag, 21. Januar 2016"
Pain Point «User Interface»
Andri Signorell, 21.01.2016
22. The reasonable man adapts himself to the world; the
unreasonable one persists in trying to adapt the world
to himself.
Therefore, all progress depends on the unreasonable
man.
George Bernard Shaw
Be unreasonable and contact me
with feedback or feature ideas!
andri@signorell.net
Andri Signorell, 21.01.2016
23. Thanks to
• All the R-Core members and R–contributors
• Frank E Harrell Jr, with contributions from Charles Dupont and many others. (2014). Hmisc:
Harrell Miscellaneous. R package version 3.14-6. http://CRAN.R-project.org/package=Hmisc
• Revelle, W. (2015) psych: Procedures for Personality and Psychological Research,
Northwestern University, Evanston, Illinois, USA, http://CRAN.R-project.org/package=psych
Version = 1.5.1.
• Lemon, J. (2006) Plotrix: a package in the red light district of R. R-News, 6(4): 8-12.
• Hans Peter Wolf and Uni Bielefeld (2014). aplpack: Another Plot PACKage: stem.leaf, bagplot,
faces, spin3R, plotsummary, plothulls, and some slider functions. R package version 1.3.0.
http://CRAN.R-project.org/package=aplpack
• Martin Maechler et al. (2015). sfsmisc: Utilities from Seminar fuer Statistik ETH Zurich. R
package version 1.0-27. http://CRAN.R-project.org/package=sfsmisc
• Christian W. Hoffmann <http://www.echoffmann.ch> (2014). cwhmisc: Miscellaneous
Functions for math, plotting, printing, statistics, strings, and tools. R package version 5.0.
http://CRAN.R-project.org/package=cwhmisc
• And many more! See DescTools’ authors list!
Andri Signorell, 21.01.2016