SlideShare ist ein Scribd-Unternehmen logo
1 von 31
Downloaden Sie, um offline zu lesen
Speeding up
(big) data manipulation with
data.table package
Vasily Tolkachev
Zurich University of Applied Sciences (ZHAW)
Institute for Data Analysis and Process Design (IDP)
vasily.tolkachev@gmail.com
www.idp.zhaw.ch
21.01.2016
2
About me
 Sep. 2014 – Present: Research Assistant in Statistics at
 Sep. 2011 – Aug. 2014: MSc Statistics & Research Assistant at
 Sep. 2008 – Aug. 2011: BSc Mathematical Economics at
 https://ch.linkedin.com/in/vasily-tolkachev-20130b35
Motivating example from stackoverflow
3
dt = data.table(nuc, key="gene_id")
dt[,list(A = min(start),
B = max(end),
C = mean(pctAT),
D = mean(pctGC),
E = sum(length)),
by = key(dt)]
# gene_id A B C D E
# 1: NM_032291 67000042 67108547 0.5582567 0.4417433 283
# 2: ZZZ 67000042 67108547 0.5582567 0.4417433 283
4
data.table solution
 takes ~ 3 seconds to run !
 easy to program
 easy to understand
5
Huge advantages of data.table
 easier & faster to write the code (no need to write data frame name
multiple times)
 easier to read & understand the code
 shorter code
 fast split-apply-combine operations on large data, e.g. 100GB in RAM
(up to 231 ≈ 2 billion rows in current R version, provided that you have
the RAM)
 fast add/modify/delete columns by reference by group without copies
 fast and smart file reading function ( fread )
 flexible syntax
 easier & faster than other advanced data manipulation packages like
dplyr, plyr, readr.
 backward-compatible with code using data.frame
 named one of the success factors by Kaggle competition winners
6
How to get a lot of RAM
 240 GB of RAM: https://aws.amazon.com/ec2/details/
 6 TB of RAM: http://www.dell.com/us/business/p/poweredge-r920/pd
7
Some limitations of data.table
 although merge.data.table is faster than merge.data.frame, it
requires the key variables to have the same names
 (In my experience) it may not be compatible with sp and other spatial
data packages, as converting sp object to data.table looses the polygons
sublist
 Used to be excellent when combined with dplyr’s pipeline (%>%)
operator for nested commands, but now a bit slower.
 file reading function (fread) currently does not support some
compressed data format (e.g. .gz, .bz2 )
 It’s still limited by your computer’s and R limits, but exploits them
maximally
8
General Syntax
DT[i, j, by]
rows or logical rule
to subset obs.
some function
of the data
To which groups
apply the function
Take DT,
subset rows using i,
then calculate j
grouped by by
9
Examples 1
 Let’s take a small dataset Boston from package MASS as a starting point.
 Accidentally typing the name of a large data table doesn’t crush R.
 It’s still a data frame, but if you prefer to use it in the code with data.frames,
convertion to data.frame is necessary:
 Converting a data frame to data.table:
 To get a data.table, use list()  The usual data.frame style is done
with with = FALSE
10
Examples 2
 Subset rows from 11 to 20:
 In this case the result is a vector:
comma not needed when subsetting rows
quotes not needed for variable names
11
Examples 3
 Find all rows where tax variable is equal to 216:
 Find the range of crim (criminality) variable:
 Display values of rad (radius) variable:
 Add a new variable with :=
 i.e. we defined a new factor variable(rad.f) in the data table from the integer
variable radius (rad), which describes accessibility to radial highways.
Tip: with with = FALSE,
you could also select
all columns between some two:
12
Examples 4
 Compute mean of house prices for every level of rad.f:
 Recall that j argument is a function, so in this case it’s a function calling a
variable medv:
 Below it’s a function which is equal to 5:
Here’s the standard way
to select 5th variable
13
Examples 5
 Select several variables Or equivalently:
(result is a data.table)
 Compute several functions:
 Compute these functions for groups (levels) of rad.f:
14
Examples 6
 Compute functions for every level of rad.f and return a data.table with column
names:
 Add many new variables with `:=`().
If a variable attains only a single value, copy it for each observation:
 Updating or deletion of old variables/columns is done the same way
15
Examples 7
 Compute a more complicated function for groups. It’s a weighted mean of house
prices, with dis (distances to Boston employment centers) as weights:
 Dynamic variable creation. Now let’s create a variable of weighted means
(mean_w), and then use it to create a variable for weighted standard deviation
(std_w).
16
Examples 8
 What if variable names are too long and you have a non-standard function where
they are used multiple times?
 Of course, it’s possible to change variable names, do the analysis and then return
to the original names, but if this isn’t an option, one needs to use a list for variable
names .SD, and the variables are specified in.SDcols:
give variable names here
use these instead of variable names
17
Examples 9
 Multiple expressions in j could be handled with { }:
18
Examples 10
 Hence, a separate data.table with dynamically created variables can be done by
 Changing a subset of observations. Let’s create another factor variable crim.f
with 3 levels standing for low, medium and severe crime rates per capita:
19
Examples 11. Chaining
DT[i, j, by][i, j, by]
 It’s a very powerful way of doing multiple operations in one command
 The command for crim.f on the previous slide can thus be done by
 Or in one go:
data[…, …][…, …][…, …][…, …]
20
Examples 12
 Now that we have 2 factor variables, crim.f and rad.f, we can also apply
functions in j on two groups:
 It appears that there is one remote district with severe crime rates.
 .N function counts the number observations in a group:
21
Examples 13
 Another useful function is .SD which contains values of all variables except the
one used for grouping:
 Use setnames() and setcolorder() functions to change column names or
reorder them:
22
Examples 14. Key on one variable
 The reason why data.table works so fast is the use of keys. All observations are
internally indexed by the way they are stored in RAM and sorted using Radix sort.
 Any column can be set as a key (list & complex number classes not supported),
and duplicate entries are allowed.
 setkey(DT, colA)introduces an index for column A and sorts the data.table by
it increasingly. In contrast to data.frame style, this is done without extra copies and
with a very efficient memory use.
 After that it’s possible to use
binary search by providing index values directly data[“1”], which is 100-
1000… times faster than
vector scan data[rad.f == “1”]
 Setting keys is necessary for joins and significantly speeds up things for big data.
However, it’s not necessary for by = aggregation.
23
Examples 15. Keys on multiple variables
 Any number of columns can be set as key using setkey(). This way rows can
be selected on 2 keys.
 setkey(DT, colA, colB)introduces indexes for both columns and sorts the
data.table by column A, then by column B within each group of column A:
 Then binary search on two keys is
24
Vector Scan vs. Binary Search
 The reason vector scan is so inefficient is that is searches first for entries “7” in
rad.f variable row-by-row, then does the same for crim.f, then takes element-
wise intersection of logical vectors.
 Binary search, on the other hand, searches already on sorted variables, and
hence cuts the number of observations by half at each step.
 Since rows of each column of data.tables have corresponding locations in RAM
memory, the operations are performed in a very cache efficient manner.
 In addition, since the matching row indices are obtained directly without having to
create huge logical vectors (equal to the number of rows in a data.table), it is quite
memory efficient as well.
Vector Scan Binary search
data[rad.f ==“7” & crim.f == “low”]
setkey(data, rad.f, crim.f)
data[ .(“7”, “low")]
𝑂(𝑛) 𝑂(log⁡( 𝑛))
25
What to avoid
 Avoid read.csv function which takes hours to read in files > 1 Gb. Use fread
instead. It’s a lot smarter and more efficient, e.g. it can guess the separator.
 Avoid rbind which is again notoriously slow. Use rbindlist instead.
 Avoid using data.frame’s vector scan inside data.table:
data[ data$rad.f == "7" & data$crim.f == "low", ]
(even though data.table’s vector scan is faster than data.frame’s vector scan, this
slows it down.)
 In general, avoid using $ inside the data.table, whether it’s for subsetting, or
updating some subset of the observations:
data[ data$rad.f == "7", ] = data[ data$rad.f == "7", ] + 1
 For speed use := by group, don't transform() by group or cbind() afterwards
 data.table used to work with dplyr well, but now it is usually slow:
data %>% filter(rad == 1)
26
Speed comparison
 Create artificial data which is randomly ordered. No pre-sort. No indexes. No key.
 5 simple queries are run: large groups and small groups on different columns of
different types. Similar to what a data analyst might do in practice; i.e., various ad
hoc aggregations as the data is explored and investigated.
 Each package is tested separately in its own fresh session.
 Each query is repeated once more, immediately. This is to isolate cache effects
and confirm the first timing. The first and second times are plotted. The total
runtime of all 5 tests is also displayed.
 The results are compared and checked allowing for numeric tolerance and column
name differences.
 It is the toughest test the developers could think of but happens to be realistic and
very common.
27
Speed comparison. Data
 The artificial dataset looks like:
28
Speed comparison
29
Speed comparison
30
References
 Matt Dowle’s data.table git account with newest vignettes.
https://github.com/Rdatatable/data.table/wiki
 Matt Dowle’s presentations & conference videos.
https://github.com/Rdatatable/data.table/wiki/Presentations
 Official introduction to data.table:
https://github.com/Rdatatable/data.table/wiki/Getting-started
 Why to set keys in data.table:
http://stackoverflow.com/questions/20039335/what-is-the-purpose-of-setting-a-
key-in-data-table
 Performance comparisons to other packages:
https://github.com/Rdatatable/data.table/wiki/Benchmarks-%3A-Grouping
 Comprehensive data.table summary sheet:
https://s3.amazonaws.com/assets.datacamp.com/img/blog/data+table+cheat+she
et.pdf
 An unabridged comparison of dplyr and data.table:
http://stackoverflow.com/questions/21435339/data-table-vs-dplyr-can-one-do-
something-well-the-other-cant-or-does-poorly/27840349#27840349
Thanks a lot for
your attention and interest!

Weitere ähnliche Inhalte

Was ist angesagt?

Grouping & Summarizing Data in R
Grouping & Summarizing Data in RGrouping & Summarizing Data in R
Grouping & Summarizing Data in RJeffrey Breen
 
pandas - Python Data Analysis
pandas - Python Data Analysispandas - Python Data Analysis
pandas - Python Data AnalysisAndrew Henshaw
 
4 R Tutorial DPLYR Apply Function
4 R Tutorial DPLYR Apply Function4 R Tutorial DPLYR Apply Function
4 R Tutorial DPLYR Apply FunctionSakthi Dasans
 
R programming & Machine Learning
R programming & Machine LearningR programming & Machine Learning
R programming & Machine LearningAmanBhalla14
 
R data-import, data-export
R data-import, data-exportR data-import, data-export
R data-import, data-exportFAO
 
Best corporate-r-programming-training-in-mumbai
Best corporate-r-programming-training-in-mumbaiBest corporate-r-programming-training-in-mumbai
Best corporate-r-programming-training-in-mumbaiUnmesh Baile
 
R basics
R basicsR basics
R basicsFAO
 
Python for R Users
Python for R UsersPython for R Users
Python for R UsersAjay Ohri
 
R Programming: Learn To Manipulate Strings In R
R Programming: Learn To Manipulate Strings In RR Programming: Learn To Manipulate Strings In R
R Programming: Learn To Manipulate Strings In RRsquared Academy
 
Presentation R basic teaching module
Presentation R basic teaching modulePresentation R basic teaching module
Presentation R basic teaching moduleSander Timmer
 
Morel, a Functional Query Language
Morel, a Functional Query LanguageMorel, a Functional Query Language
Morel, a Functional Query LanguageJulian Hyde
 

Was ist angesagt? (20)

Grouping & Summarizing Data in R
Grouping & Summarizing Data in RGrouping & Summarizing Data in R
Grouping & Summarizing Data in R
 
R Language Introduction
R Language IntroductionR Language Introduction
R Language Introduction
 
pandas - Python Data Analysis
pandas - Python Data Analysispandas - Python Data Analysis
pandas - Python Data Analysis
 
Data Management in Python
Data Management in PythonData Management in Python
Data Management in Python
 
Programming in R
Programming in RProgramming in R
Programming in R
 
4 R Tutorial DPLYR Apply Function
4 R Tutorial DPLYR Apply Function4 R Tutorial DPLYR Apply Function
4 R Tutorial DPLYR Apply Function
 
R programming & Machine Learning
R programming & Machine LearningR programming & Machine Learning
R programming & Machine Learning
 
R data-import, data-export
R data-import, data-exportR data-import, data-export
R data-import, data-export
 
Best corporate-r-programming-training-in-mumbai
Best corporate-r-programming-training-in-mumbaiBest corporate-r-programming-training-in-mumbai
Best corporate-r-programming-training-in-mumbai
 
R basics
R basicsR basics
R basics
 
R programming by ganesh kavhar
R programming by ganesh kavharR programming by ganesh kavhar
R programming by ganesh kavhar
 
Datamining with R
Datamining with RDatamining with R
Datamining with R
 
User biglm
User biglmUser biglm
User biglm
 
Python for R Users
Python for R UsersPython for R Users
Python for R Users
 
R Programming: Learn To Manipulate Strings In R
R Programming: Learn To Manipulate Strings In RR Programming: Learn To Manipulate Strings In R
R Programming: Learn To Manipulate Strings In R
 
Presentation R basic teaching module
Presentation R basic teaching modulePresentation R basic teaching module
Presentation R basic teaching module
 
Machine Learning in R
Machine Learning in RMachine Learning in R
Machine Learning in R
 
R programming language
R programming languageR programming language
R programming language
 
Morel, a Functional Query Language
Morel, a Functional Query LanguageMorel, a Functional Query Language
Morel, a Functional Query Language
 
R language
R languageR language
R language
 

Ähnlich wie January 2016 Meetup: Speeding up (big) data manipulation with data.table package

Python-for-Data-Analysis.pptx
Python-for-Data-Analysis.pptxPython-for-Data-Analysis.pptx
Python-for-Data-Analysis.pptxParveenShaik21
 
PPT on Data Science Using Python
PPT on Data Science Using PythonPPT on Data Science Using Python
PPT on Data Science Using PythonNishantKumar1179
 
Abstract Data Types (a) Explain briefly what is meant by the ter.pdf
Abstract Data Types (a) Explain briefly what is meant by the ter.pdfAbstract Data Types (a) Explain briefly what is meant by the ter.pdf
Abstract Data Types (a) Explain briefly what is meant by the ter.pdfkarymadelaneyrenne19
 
Star Transformation, 12c Adaptive Bitmap Pruning and In-Memory option
Star Transformation, 12c Adaptive Bitmap Pruning and In-Memory optionStar Transformation, 12c Adaptive Bitmap Pruning and In-Memory option
Star Transformation, 12c Adaptive Bitmap Pruning and In-Memory optionFranck Pachot
 
fINAL Lesson_5_Data_Manipulation_using_R_v1.pptx
fINAL Lesson_5_Data_Manipulation_using_R_v1.pptxfINAL Lesson_5_Data_Manipulation_using_R_v1.pptx
fINAL Lesson_5_Data_Manipulation_using_R_v1.pptxdataKarthik
 
CS 542 -- Query Execution
CS 542 -- Query ExecutionCS 542 -- Query Execution
CS 542 -- Query ExecutionJ Singh
 
Bsc cs ii dfs u-1 introduction to data structure
Bsc cs ii dfs u-1 introduction to data structureBsc cs ii dfs u-1 introduction to data structure
Bsc cs ii dfs u-1 introduction to data structureRai University
 
Bca ii dfs u-1 introduction to data structure
Bca ii dfs u-1 introduction to data structureBca ii dfs u-1 introduction to data structure
Bca ii dfs u-1 introduction to data structureRai University
 
Data structures and algorithms short note (version 14).pd
Data structures and algorithms short note (version 14).pdData structures and algorithms short note (version 14).pd
Data structures and algorithms short note (version 14).pdNimmi Weeraddana
 
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...Yao Yao
 
The best ETL questions in a nut shell
The best ETL questions in a nut shellThe best ETL questions in a nut shell
The best ETL questions in a nut shellSrinimf-Slides
 
Fusing Transformations of Strict Scala Collections with Views
Fusing Transformations of Strict Scala Collections with ViewsFusing Transformations of Strict Scala Collections with Views
Fusing Transformations of Strict Scala Collections with ViewsPhilip Schwarz
 
22827361 ab initio-fa-qs
22827361 ab initio-fa-qs22827361 ab initio-fa-qs
22827361 ab initio-fa-qsCapgemini
 
Mca ii dfs u-1 introduction to data structure
Mca ii dfs u-1 introduction to data structureMca ii dfs u-1 introduction to data structure
Mca ii dfs u-1 introduction to data structureRai University
 

Ähnlich wie January 2016 Meetup: Speeding up (big) data manipulation with data.table package (20)

Lecture 9.pptx
Lecture 9.pptxLecture 9.pptx
Lecture 9.pptx
 
Python-for-Data-Analysis.pptx
Python-for-Data-Analysis.pptxPython-for-Data-Analysis.pptx
Python-for-Data-Analysis.pptx
 
interenship.pptx
interenship.pptxinterenship.pptx
interenship.pptx
 
PPT on Data Science Using Python
PPT on Data Science Using PythonPPT on Data Science Using Python
PPT on Data Science Using Python
 
Abstract Data Types (a) Explain briefly what is meant by the ter.pdf
Abstract Data Types (a) Explain briefly what is meant by the ter.pdfAbstract Data Types (a) Explain briefly what is meant by the ter.pdf
Abstract Data Types (a) Explain briefly what is meant by the ter.pdf
 
Star Transformation, 12c Adaptive Bitmap Pruning and In-Memory option
Star Transformation, 12c Adaptive Bitmap Pruning and In-Memory optionStar Transformation, 12c Adaptive Bitmap Pruning and In-Memory option
Star Transformation, 12c Adaptive Bitmap Pruning and In-Memory option
 
fINAL Lesson_5_Data_Manipulation_using_R_v1.pptx
fINAL Lesson_5_Data_Manipulation_using_R_v1.pptxfINAL Lesson_5_Data_Manipulation_using_R_v1.pptx
fINAL Lesson_5_Data_Manipulation_using_R_v1.pptx
 
CS 542 -- Query Execution
CS 542 -- Query ExecutionCS 542 -- Query Execution
CS 542 -- Query Execution
 
Python - Lecture 12
Python - Lecture 12Python - Lecture 12
Python - Lecture 12
 
Bsc cs ii dfs u-1 introduction to data structure
Bsc cs ii dfs u-1 introduction to data structureBsc cs ii dfs u-1 introduction to data structure
Bsc cs ii dfs u-1 introduction to data structure
 
Bca ii dfs u-1 introduction to data structure
Bca ii dfs u-1 introduction to data structureBca ii dfs u-1 introduction to data structure
Bca ii dfs u-1 introduction to data structure
 
Data structures and algorithms short note (version 14).pd
Data structures and algorithms short note (version 14).pdData structures and algorithms short note (version 14).pd
Data structures and algorithms short note (version 14).pd
 
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...
 
An Introduction to MATLAB with Worked Examples
An Introduction to MATLAB with Worked ExamplesAn Introduction to MATLAB with Worked Examples
An Introduction to MATLAB with Worked Examples
 
The best ETL questions in a nut shell
The best ETL questions in a nut shellThe best ETL questions in a nut shell
The best ETL questions in a nut shell
 
Isam2_v1_2
Isam2_v1_2Isam2_v1_2
Isam2_v1_2
 
Fusing Transformations of Strict Scala Collections with Views
Fusing Transformations of Strict Scala Collections with ViewsFusing Transformations of Strict Scala Collections with Views
Fusing Transformations of Strict Scala Collections with Views
 
Oct.22nd.Presentation.Final
Oct.22nd.Presentation.FinalOct.22nd.Presentation.Final
Oct.22nd.Presentation.Final
 
22827361 ab initio-fa-qs
22827361 ab initio-fa-qs22827361 ab initio-fa-qs
22827361 ab initio-fa-qs
 
Mca ii dfs u-1 introduction to data structure
Mca ii dfs u-1 introduction to data structureMca ii dfs u-1 introduction to data structure
Mca ii dfs u-1 introduction to data structure
 

Mehr von Zurich_R_User_Group

Anomaly detection - database integrated
Anomaly detection - database integratedAnomaly detection - database integrated
Anomaly detection - database integratedZurich_R_User_Group
 
R at Sanitas - Workflow, Problems and Solutions
R at Sanitas - Workflow, Problems and SolutionsR at Sanitas - Workflow, Problems and Solutions
R at Sanitas - Workflow, Problems and SolutionsZurich_R_User_Group
 
Visualizing the frequency of transit delays using QGIS and the Leaflet javasc...
Visualizing the frequency of transit delays using QGIS and the Leaflet javasc...Visualizing the frequency of transit delays using QGIS and the Leaflet javasc...
Visualizing the frequency of transit delays using QGIS and the Leaflet javasc...Zurich_R_User_Group
 
Introduction to Renjin, the alternative engine for R
Introduction to Renjin, the alternative engine for R Introduction to Renjin, the alternative engine for R
Introduction to Renjin, the alternative engine for R Zurich_R_User_Group
 
How to use R in different professions: R for Car Insurance Product (Speaker: ...
How to use R in different professions: R for Car Insurance Product (Speaker: ...How to use R in different professions: R for Car Insurance Product (Speaker: ...
How to use R in different professions: R for Car Insurance Product (Speaker: ...Zurich_R_User_Group
 
How to use R in different professions: R In Finance (Speaker: Gabriel Foix, M...
How to use R in different professions: R In Finance (Speaker: Gabriel Foix, M...How to use R in different professions: R In Finance (Speaker: Gabriel Foix, M...
How to use R in different professions: R In Finance (Speaker: Gabriel Foix, M...Zurich_R_User_Group
 
Where South America is Swinging to the Right: An R-Driven Data Journalism Pr...
Where South America is Swinging to the Right:  An R-Driven Data Journalism Pr...Where South America is Swinging to the Right:  An R-Driven Data Journalism Pr...
Where South America is Swinging to the Right: An R-Driven Data Journalism Pr...Zurich_R_User_Group
 
Visualization Challenge: Mapping Health During Travel
Visualization Challenge: Mapping Health During TravelVisualization Challenge: Mapping Health During Travel
Visualization Challenge: Mapping Health During TravelZurich_R_User_Group
 
Zurich R User group: Desc tools
Zurich R User group: Desc tools Zurich R User group: Desc tools
Zurich R User group: Desc tools Zurich_R_User_Group
 
December 2015 Meetup - Shiny: Make Your R Code Interactive - Craig Wang
December 2015 Meetup - Shiny: Make Your R Code Interactive - Craig WangDecember 2015 Meetup - Shiny: Make Your R Code Interactive - Craig Wang
December 2015 Meetup - Shiny: Make Your R Code Interactive - Craig WangZurich_R_User_Group
 

Mehr von Zurich_R_User_Group (11)

Anomaly detection - database integrated
Anomaly detection - database integratedAnomaly detection - database integrated
Anomaly detection - database integrated
 
R at Sanitas - Workflow, Problems and Solutions
R at Sanitas - Workflow, Problems and SolutionsR at Sanitas - Workflow, Problems and Solutions
R at Sanitas - Workflow, Problems and Solutions
 
Modeling Bus Bunching
Modeling Bus BunchingModeling Bus Bunching
Modeling Bus Bunching
 
Visualizing the frequency of transit delays using QGIS and the Leaflet javasc...
Visualizing the frequency of transit delays using QGIS and the Leaflet javasc...Visualizing the frequency of transit delays using QGIS and the Leaflet javasc...
Visualizing the frequency of transit delays using QGIS and the Leaflet javasc...
 
Introduction to Renjin, the alternative engine for R
Introduction to Renjin, the alternative engine for R Introduction to Renjin, the alternative engine for R
Introduction to Renjin, the alternative engine for R
 
How to use R in different professions: R for Car Insurance Product (Speaker: ...
How to use R in different professions: R for Car Insurance Product (Speaker: ...How to use R in different professions: R for Car Insurance Product (Speaker: ...
How to use R in different professions: R for Car Insurance Product (Speaker: ...
 
How to use R in different professions: R In Finance (Speaker: Gabriel Foix, M...
How to use R in different professions: R In Finance (Speaker: Gabriel Foix, M...How to use R in different professions: R In Finance (Speaker: Gabriel Foix, M...
How to use R in different professions: R In Finance (Speaker: Gabriel Foix, M...
 
Where South America is Swinging to the Right: An R-Driven Data Journalism Pr...
Where South America is Swinging to the Right:  An R-Driven Data Journalism Pr...Where South America is Swinging to the Right:  An R-Driven Data Journalism Pr...
Where South America is Swinging to the Right: An R-Driven Data Journalism Pr...
 
Visualization Challenge: Mapping Health During Travel
Visualization Challenge: Mapping Health During TravelVisualization Challenge: Mapping Health During Travel
Visualization Challenge: Mapping Health During Travel
 
Zurich R User group: Desc tools
Zurich R User group: Desc tools Zurich R User group: Desc tools
Zurich R User group: Desc tools
 
December 2015 Meetup - Shiny: Make Your R Code Interactive - Craig Wang
December 2015 Meetup - Shiny: Make Your R Code Interactive - Craig WangDecember 2015 Meetup - Shiny: Make Your R Code Interactive - Craig Wang
December 2015 Meetup - Shiny: Make Your R Code Interactive - Craig Wang
 

Kürzlich hochgeladen

Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...OnePlan Solutions
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfkalichargn70th171
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfkalichargn70th171
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsArshad QA
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...kellynguyen01
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comFatema Valibhai
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...panagenda
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Steffen Staab
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providermohitmore19
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerThousandEyes
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...MyIntelliSource, Inc.
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...ICS
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.
 
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AIABDERRAOUF MEHENNI
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...Health
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxbodapatigopi8531
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsJhone kinadey
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdfWave PLM
 

Kürzlich hochgeladen (20)

Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
 
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptx
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf
 

January 2016 Meetup: Speeding up (big) data manipulation with data.table package

  • 1. Speeding up (big) data manipulation with data.table package Vasily Tolkachev Zurich University of Applied Sciences (ZHAW) Institute for Data Analysis and Process Design (IDP) vasily.tolkachev@gmail.com www.idp.zhaw.ch 21.01.2016
  • 2. 2 About me  Sep. 2014 – Present: Research Assistant in Statistics at  Sep. 2011 – Aug. 2014: MSc Statistics & Research Assistant at  Sep. 2008 – Aug. 2011: BSc Mathematical Economics at  https://ch.linkedin.com/in/vasily-tolkachev-20130b35
  • 3. Motivating example from stackoverflow 3
  • 4. dt = data.table(nuc, key="gene_id") dt[,list(A = min(start), B = max(end), C = mean(pctAT), D = mean(pctGC), E = sum(length)), by = key(dt)] # gene_id A B C D E # 1: NM_032291 67000042 67108547 0.5582567 0.4417433 283 # 2: ZZZ 67000042 67108547 0.5582567 0.4417433 283 4 data.table solution  takes ~ 3 seconds to run !  easy to program  easy to understand
  • 5. 5 Huge advantages of data.table  easier & faster to write the code (no need to write data frame name multiple times)  easier to read & understand the code  shorter code  fast split-apply-combine operations on large data, e.g. 100GB in RAM (up to 231 ≈ 2 billion rows in current R version, provided that you have the RAM)  fast add/modify/delete columns by reference by group without copies  fast and smart file reading function ( fread )  flexible syntax  easier & faster than other advanced data manipulation packages like dplyr, plyr, readr.  backward-compatible with code using data.frame  named one of the success factors by Kaggle competition winners
  • 6. 6 How to get a lot of RAM  240 GB of RAM: https://aws.amazon.com/ec2/details/  6 TB of RAM: http://www.dell.com/us/business/p/poweredge-r920/pd
  • 7. 7 Some limitations of data.table  although merge.data.table is faster than merge.data.frame, it requires the key variables to have the same names  (In my experience) it may not be compatible with sp and other spatial data packages, as converting sp object to data.table looses the polygons sublist  Used to be excellent when combined with dplyr’s pipeline (%>%) operator for nested commands, but now a bit slower.  file reading function (fread) currently does not support some compressed data format (e.g. .gz, .bz2 )  It’s still limited by your computer’s and R limits, but exploits them maximally
  • 8. 8 General Syntax DT[i, j, by] rows or logical rule to subset obs. some function of the data To which groups apply the function Take DT, subset rows using i, then calculate j grouped by by
  • 9. 9 Examples 1  Let’s take a small dataset Boston from package MASS as a starting point.  Accidentally typing the name of a large data table doesn’t crush R.  It’s still a data frame, but if you prefer to use it in the code with data.frames, convertion to data.frame is necessary:  Converting a data frame to data.table:
  • 10.  To get a data.table, use list()  The usual data.frame style is done with with = FALSE 10 Examples 2  Subset rows from 11 to 20:  In this case the result is a vector: comma not needed when subsetting rows quotes not needed for variable names
  • 11. 11 Examples 3  Find all rows where tax variable is equal to 216:  Find the range of crim (criminality) variable:  Display values of rad (radius) variable:  Add a new variable with :=  i.e. we defined a new factor variable(rad.f) in the data table from the integer variable radius (rad), which describes accessibility to radial highways. Tip: with with = FALSE, you could also select all columns between some two:
  • 12. 12 Examples 4  Compute mean of house prices for every level of rad.f:  Recall that j argument is a function, so in this case it’s a function calling a variable medv:  Below it’s a function which is equal to 5: Here’s the standard way to select 5th variable
  • 13. 13 Examples 5  Select several variables Or equivalently: (result is a data.table)  Compute several functions:  Compute these functions for groups (levels) of rad.f:
  • 14. 14 Examples 6  Compute functions for every level of rad.f and return a data.table with column names:  Add many new variables with `:=`(). If a variable attains only a single value, copy it for each observation:  Updating or deletion of old variables/columns is done the same way
  • 15. 15 Examples 7  Compute a more complicated function for groups. It’s a weighted mean of house prices, with dis (distances to Boston employment centers) as weights:  Dynamic variable creation. Now let’s create a variable of weighted means (mean_w), and then use it to create a variable for weighted standard deviation (std_w).
  • 16. 16 Examples 8  What if variable names are too long and you have a non-standard function where they are used multiple times?  Of course, it’s possible to change variable names, do the analysis and then return to the original names, but if this isn’t an option, one needs to use a list for variable names .SD, and the variables are specified in.SDcols: give variable names here use these instead of variable names
  • 17. 17 Examples 9  Multiple expressions in j could be handled with { }:
  • 18. 18 Examples 10  Hence, a separate data.table with dynamically created variables can be done by  Changing a subset of observations. Let’s create another factor variable crim.f with 3 levels standing for low, medium and severe crime rates per capita:
  • 19. 19 Examples 11. Chaining DT[i, j, by][i, j, by]  It’s a very powerful way of doing multiple operations in one command  The command for crim.f on the previous slide can thus be done by  Or in one go: data[…, …][…, …][…, …][…, …]
  • 20. 20 Examples 12  Now that we have 2 factor variables, crim.f and rad.f, we can also apply functions in j on two groups:  It appears that there is one remote district with severe crime rates.  .N function counts the number observations in a group:
  • 21. 21 Examples 13  Another useful function is .SD which contains values of all variables except the one used for grouping:  Use setnames() and setcolorder() functions to change column names or reorder them:
  • 22. 22 Examples 14. Key on one variable  The reason why data.table works so fast is the use of keys. All observations are internally indexed by the way they are stored in RAM and sorted using Radix sort.  Any column can be set as a key (list & complex number classes not supported), and duplicate entries are allowed.  setkey(DT, colA)introduces an index for column A and sorts the data.table by it increasingly. In contrast to data.frame style, this is done without extra copies and with a very efficient memory use.  After that it’s possible to use binary search by providing index values directly data[“1”], which is 100- 1000… times faster than vector scan data[rad.f == “1”]  Setting keys is necessary for joins and significantly speeds up things for big data. However, it’s not necessary for by = aggregation.
  • 23. 23 Examples 15. Keys on multiple variables  Any number of columns can be set as key using setkey(). This way rows can be selected on 2 keys.  setkey(DT, colA, colB)introduces indexes for both columns and sorts the data.table by column A, then by column B within each group of column A:  Then binary search on two keys is
  • 24. 24 Vector Scan vs. Binary Search  The reason vector scan is so inefficient is that is searches first for entries “7” in rad.f variable row-by-row, then does the same for crim.f, then takes element- wise intersection of logical vectors.  Binary search, on the other hand, searches already on sorted variables, and hence cuts the number of observations by half at each step.  Since rows of each column of data.tables have corresponding locations in RAM memory, the operations are performed in a very cache efficient manner.  In addition, since the matching row indices are obtained directly without having to create huge logical vectors (equal to the number of rows in a data.table), it is quite memory efficient as well. Vector Scan Binary search data[rad.f ==“7” & crim.f == “low”] setkey(data, rad.f, crim.f) data[ .(“7”, “low")] 𝑂(𝑛) 𝑂(log⁡( 𝑛))
  • 25. 25 What to avoid  Avoid read.csv function which takes hours to read in files > 1 Gb. Use fread instead. It’s a lot smarter and more efficient, e.g. it can guess the separator.  Avoid rbind which is again notoriously slow. Use rbindlist instead.  Avoid using data.frame’s vector scan inside data.table: data[ data$rad.f == "7" & data$crim.f == "low", ] (even though data.table’s vector scan is faster than data.frame’s vector scan, this slows it down.)  In general, avoid using $ inside the data.table, whether it’s for subsetting, or updating some subset of the observations: data[ data$rad.f == "7", ] = data[ data$rad.f == "7", ] + 1  For speed use := by group, don't transform() by group or cbind() afterwards  data.table used to work with dplyr well, but now it is usually slow: data %>% filter(rad == 1)
  • 26. 26 Speed comparison  Create artificial data which is randomly ordered. No pre-sort. No indexes. No key.  5 simple queries are run: large groups and small groups on different columns of different types. Similar to what a data analyst might do in practice; i.e., various ad hoc aggregations as the data is explored and investigated.  Each package is tested separately in its own fresh session.  Each query is repeated once more, immediately. This is to isolate cache effects and confirm the first timing. The first and second times are plotted. The total runtime of all 5 tests is also displayed.  The results are compared and checked allowing for numeric tolerance and column name differences.  It is the toughest test the developers could think of but happens to be realistic and very common.
  • 27. 27 Speed comparison. Data  The artificial dataset looks like:
  • 30. 30 References  Matt Dowle’s data.table git account with newest vignettes. https://github.com/Rdatatable/data.table/wiki  Matt Dowle’s presentations & conference videos. https://github.com/Rdatatable/data.table/wiki/Presentations  Official introduction to data.table: https://github.com/Rdatatable/data.table/wiki/Getting-started  Why to set keys in data.table: http://stackoverflow.com/questions/20039335/what-is-the-purpose-of-setting-a- key-in-data-table  Performance comparisons to other packages: https://github.com/Rdatatable/data.table/wiki/Benchmarks-%3A-Grouping  Comprehensive data.table summary sheet: https://s3.amazonaws.com/assets.datacamp.com/img/blog/data+table+cheat+she et.pdf  An unabridged comparison of dplyr and data.table: http://stackoverflow.com/questions/21435339/data-table-vs-dplyr-can-one-do- something-well-the-other-cant-or-does-poorly/27840349#27840349
  • 31. Thanks a lot for your attention and interest!