SlideShare ist ein Scribd-Unternehmen logo
1 von 3
Downloaden Sie, um offline zu lesen
Data Aggregation in R


Examples of using different functions for data aggregation and comparison of their performance:

> #data simulation
> set.seed(2211984)
> DF <- data.frame(x=sample(1:3, 10000000, rep=T), y=runif(10000000,1,100),
z=rnorm(10000000,10,2))
>
> #tapply - basic function
> tapply(DF$y, DF$x, mean)
        1         2       3
50.50248 50.52115 50.50778
> tapply(DF$z, DF$x, sd)
        1         2       3
1.999511 1.998869 1.998408
>
> #aggregate - basic function
> aggregate(DF$y, list(group=DF$x), FUN=mean)
  group         x
1     1 50.50248
2     2 50.52115
3     3 50.50778
> aggregate(DF$z, list(group=DF$x), FUN=sd)
  group         x
1     1 1.999511
2     2 1.998869
3     3 1.998408
>
> #ddply - plyr library
> library(plyr)
> ddply(DF, .(x), summarise, avg_y=mean(y), sd_z=sd(z))
  x     avg_y      sd_z
1 1 50.50248 1.999511
2 2 50.52115 1.998869
3 3 50.50778 1.998408
>
> #sql query - sqldf library
> library(sqldf)
> sqldf("select avg(y) as avg_y, stdev(z) as sd_z from DF group by x")
     avg_y      sd_z
1 50.50248 1.999511
2 50.52115 1.998869
3 50.50778 1.998408
>
> #data.table objects - data.table library
> library(data.table)
data.table 1.8.2 For help type: help("data.table")
> DT <- data.table(DF)
> DT
           x          y        z
        1: 1 12.133576 11.947320
        2: 2 44.485393 6.290101
        3: 2 71.566670 10.280873
4: 2 88.883879 11.121398
          5: 1 3.952848 8.688182
        ---
  9999996: 3 17.317273 10.085156
  9999997: 3 64.856928 8.250676
  9999998: 2 6.489453 8.812301
  9999999: 3 94.344257 8.203418
10000000: 3 3.267286 6.688272
> identical(DT$x,DF$x)
[1] TRUE
> identical(DT$y,DF$y)
[1] TRUE
>
> DT[, sum(y), by=x]
    x          V1
1: 1 168310537
2: 2 168424040
3: 3 168370154
> DT[,list(avg_y=mean(y), sd_z=sd(z)), by=x]
    x      avg_y     sd_z
1: 1 50.50248 1.999511
2: 2 50.52115 1.998869
3: 3 50.50778 1.998408
>
> #function performance - system time
> system.time(tapply(DF$y, DF$x, mean)) + system.time(tapply(DF$z, DF$x, sd))
    user system elapsed
   13.18      0.52   13.68
> system.time(aggregate(DF$y, list(group=DF$x), FUN=mean)) +
system.time(aggregate(DF$z, list(group=DF$x), FUN=sd))
    user system elapsed
   29.65      1.03   30.76
> system.time(ddply(DF, .(x), summarise, avg_y=mean(y), sd_z=sd(z)))
    user system elapsed
    2.23      0.86    3.09
> system.time(sqldf("select sum(y) as avg_y, stdev(z) as sd_z from DF group
by x"))
    user system elapsed
   33.83      2.85   37.11
> system.time(DT[,list(avg_y=mean(y), sd_z=sd(z)), by=x])
    user system elapsed
      0.7      0.0     0.7

> sessionInfo()
R version 2.15.2 (2012-10-26)
Platform: x86_64-w64-mingw32/x64 (64-bit)

locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United
States.1252
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252

attached base packages:
[1] tcltk         stats         graphics     grDevices utils            datasets      methods
base

other attached packages:
[1] data.table_1.8.2              sqldf_0.4-6.4                 RSQLite.extfuns_0.0.1
RSQLite_0.11.2
[5] chron_2.3-42                  gsubfn_0.6-5                  proto_0.3-9.2
DBI_0.2-5
[9] plyr_1.7.1

Conclusion: data.table rocks. More than 4 times faster than ddply function, 19 times faster than tapply
function, 44 times faster than aggregate function and 53 times faster than sqldf function.

Weitere ähnliche Inhalte

Was ist angesagt?

Clustering and Visualisation using R programming
Clustering and Visualisation using R programmingClustering and Visualisation using R programming
Clustering and Visualisation using R programmingNixon Mendez
 
Python data structures
Python data structuresPython data structures
Python data structuresHarry Potter
 
Advanced Data Visualization Examples with R-Part II
Advanced Data Visualization Examples with R-Part IIAdvanced Data Visualization Examples with R-Part II
Advanced Data Visualization Examples with R-Part IIDr. Volkan OBAN
 
Computational Linguistics week 10
 Computational Linguistics week 10 Computational Linguistics week 10
Computational Linguistics week 10Mark Chang
 
MySQL 5.7 NF – JSON Datatype 활용
MySQL 5.7 NF – JSON Datatype 활용MySQL 5.7 NF – JSON Datatype 활용
MySQL 5.7 NF – JSON Datatype 활용I Goo Lee
 
The Ring programming language version 1.3 book - Part 52 of 88
The Ring programming language version 1.3 book - Part 52 of 88The Ring programming language version 1.3 book - Part 52 of 88
The Ring programming language version 1.3 book - Part 52 of 88Mahmoud Samir Fayed
 
Intro to OTP in Elixir
Intro to OTP in ElixirIntro to OTP in Elixir
Intro to OTP in ElixirJesse Anderson
 
Advanced Data Visualization in R- Somes Examples.
Advanced Data Visualization in R- Somes Examples.Advanced Data Visualization in R- Somes Examples.
Advanced Data Visualization in R- Somes Examples.Dr. Volkan OBAN
 
Clojure for Data Science
Clojure for Data ScienceClojure for Data Science
Clojure for Data Sciencehenrygarner
 
tensorflow/keras model coding tutorial 勉強会
tensorflow/keras model coding tutorial 勉強会tensorflow/keras model coding tutorial 勉強会
tensorflow/keras model coding tutorial 勉強会RyoyaKatafuchi
 

Was ist angesagt? (15)

Clustering and Visualisation using R programming
Clustering and Visualisation using R programmingClustering and Visualisation using R programming
Clustering and Visualisation using R programming
 
Python data structures
Python data structuresPython data structures
Python data structures
 
Advanced Data Visualization Examples with R-Part II
Advanced Data Visualization Examples with R-Part IIAdvanced Data Visualization Examples with R-Part II
Advanced Data Visualization Examples with R-Part II
 
Computational Linguistics week 10
 Computational Linguistics week 10 Computational Linguistics week 10
Computational Linguistics week 10
 
MySQL 5.7 NF – JSON Datatype 활용
MySQL 5.7 NF – JSON Datatype 활용MySQL 5.7 NF – JSON Datatype 활용
MySQL 5.7 NF – JSON Datatype 활용
 
The Ring programming language version 1.3 book - Part 52 of 88
The Ring programming language version 1.3 book - Part 52 of 88The Ring programming language version 1.3 book - Part 52 of 88
The Ring programming language version 1.3 book - Part 52 of 88
 
Ggplot2 v3
Ggplot2 v3Ggplot2 v3
Ggplot2 v3
 
CLUSTERGRAM
CLUSTERGRAMCLUSTERGRAM
CLUSTERGRAM
 
ملخص البرمجة المرئية - الوحدة الخامسة
ملخص البرمجة المرئية - الوحدة الخامسةملخص البرمجة المرئية - الوحدة الخامسة
ملخص البرمجة المرئية - الوحدة الخامسة
 
Intro to OTP in Elixir
Intro to OTP in ElixirIntro to OTP in Elixir
Intro to OTP in Elixir
 
STLD- Switching functions
STLD- Switching functions STLD- Switching functions
STLD- Switching functions
 
Advanced Data Visualization in R- Somes Examples.
Advanced Data Visualization in R- Somes Examples.Advanced Data Visualization in R- Somes Examples.
Advanced Data Visualization in R- Somes Examples.
 
interfaz
interfazinterfaz
interfaz
 
Clojure for Data Science
Clojure for Data ScienceClojure for Data Science
Clojure for Data Science
 
tensorflow/keras model coding tutorial 勉強会
tensorflow/keras model coding tutorial 勉強会tensorflow/keras model coding tutorial 勉強会
tensorflow/keras model coding tutorial 勉強会
 

Andere mochten auch

Debt Collection Report - using R in Finance
Debt Collection Report  - using R in FinanceDebt Collection Report  - using R in Finance
Debt Collection Report - using R in FinanceAndrija Djurovic
 
Visualization of contingency table in R - vcd package
Visualization of contingency table in R - vcd packageVisualization of contingency table in R - vcd package
Visualization of contingency table in R - vcd packageAndrija Djurovic
 
Roll Rate Model - Using R in Finance
Roll Rate Model - Using R in FinanceRoll Rate Model - Using R in Finance
Roll Rate Model - Using R in FinanceAndrija Djurovic
 
Seguimiento del modelo interno de riesgo
Seguimiento del modelo interno de riesgoSeguimiento del modelo interno de riesgo
Seguimiento del modelo interno de riesgoAIS
 
Data Exploration, Validation and Sanitization
Data Exploration, Validation and SanitizationData Exploration, Validation and Sanitization
Data Exploration, Validation and SanitizationVenkata Reddy Konasani
 
Model building in credit card and loan approval
Model building in credit card and loan approval Model building in credit card and loan approval
Model building in credit card and loan approval Venkata Reddy Konasani
 

Andere mochten auch (13)

Polar plots with R
Polar plots with RPolar plots with R
Polar plots with R
 
Debt Collection Report - using R in Finance
Debt Collection Report  - using R in FinanceDebt Collection Report  - using R in Finance
Debt Collection Report - using R in Finance
 
Visualization of contingency table in R - vcd package
Visualization of contingency table in R - vcd packageVisualization of contingency table in R - vcd package
Visualization of contingency table in R - vcd package
 
R and Access 2007
R and Access 2007R and Access 2007
R and Access 2007
 
R-Excel Integration
R-Excel IntegrationR-Excel Integration
R-Excel Integration
 
Roll Rate Model - Using R in Finance
Roll Rate Model - Using R in FinanceRoll Rate Model - Using R in Finance
Roll Rate Model - Using R in Finance
 
Seguimiento del modelo interno de riesgo
Seguimiento del modelo interno de riesgoSeguimiento del modelo interno de riesgo
Seguimiento del modelo interno de riesgo
 
Excel/R
Excel/RExcel/R
Excel/R
 
Testing of hypothesis case study
Testing of hypothesis case study Testing of hypothesis case study
Testing of hypothesis case study
 
Data Exploration, Validation and Sanitization
Data Exploration, Validation and SanitizationData Exploration, Validation and Sanitization
Data Exploration, Validation and Sanitization
 
Model building in credit card and loan approval
Model building in credit card and loan approval Model building in credit card and loan approval
Model building in credit card and loan approval
 
Decision tree
Decision treeDecision tree
Decision tree
 
Credit Risk Model Building Steps
Credit Risk Model Building StepsCredit Risk Model Building Steps
Credit Risk Model Building Steps
 

Ähnlich wie Data aggregation in R

Basic R Data Manipulation
Basic R Data ManipulationBasic R Data Manipulation
Basic R Data ManipulationChu An
 
Table of Useful R commands.
Table of Useful R commands.Table of Useful R commands.
Table of Useful R commands.Dr. Volkan OBAN
 
Time Series Analysis and Mining with R
Time Series Analysis and Mining with RTime Series Analysis and Mining with R
Time Series Analysis and Mining with RYanchang Zhao
 
Артём Акуляков - F# for Data Analysis
Артём Акуляков - F# for Data AnalysisАртём Акуляков - F# for Data Analysis
Артём Акуляков - F# for Data AnalysisSpbDotNet Community
 
令和から本気出す
令和から本気出す令和から本気出す
令和から本気出すTakashi Kitano
 
Optimization and Mathematical Programming in R and ROI - R Optimization Infra...
Optimization and Mathematical Programming in R and ROI - R Optimization Infra...Optimization and Mathematical Programming in R and ROI - R Optimization Infra...
Optimization and Mathematical Programming in R and ROI - R Optimization Infra...Dr. Volkan OBAN
 
intro-to-metaprogramming-in-r.pdf
intro-to-metaprogramming-in-r.pdfintro-to-metaprogramming-in-r.pdf
intro-to-metaprogramming-in-r.pdfK. Matthew Dupree
 
Regression_Sample
Regression_SampleRegression_Sample
Regression_SampleJie Huang
 
Introduction to Neural Networks and Deep Learning from Scratch
Introduction to Neural Networks and Deep Learning from ScratchIntroduction to Neural Networks and Deep Learning from Scratch
Introduction to Neural Networks and Deep Learning from ScratchAhmed BESBES
 
R tutorial for a windows environment
R tutorial for a windows environmentR tutorial for a windows environment
R tutorial for a windows environmentYogendra Chaubey
 
Seminar PSU 09.04.2013 - 10.04.2013 MiFIT, Arbuzov Vyacheslav
Seminar PSU 09.04.2013 - 10.04.2013 MiFIT, Arbuzov VyacheslavSeminar PSU 09.04.2013 - 10.04.2013 MiFIT, Arbuzov Vyacheslav
Seminar PSU 09.04.2013 - 10.04.2013 MiFIT, Arbuzov VyacheslavVyacheslav Arbuzov
 
Basic operations by novi reandy sasmita
Basic operations by novi reandy sasmitaBasic operations by novi reandy sasmita
Basic operations by novi reandy sasmitabeasiswa
 

Ähnlich wie Data aggregation in R (20)

Basic R Data Manipulation
Basic R Data ManipulationBasic R Data Manipulation
Basic R Data Manipulation
 
R programming language
R programming languageR programming language
R programming language
 
Table of Useful R commands.
Table of Useful R commands.Table of Useful R commands.
Table of Useful R commands.
 
Time Series Analysis and Mining with R
Time Series Analysis and Mining with RTime Series Analysis and Mining with R
Time Series Analysis and Mining with R
 
Артём Акуляков - F# for Data Analysis
Артём Акуляков - F# for Data AnalysisАртём Акуляков - F# for Data Analysis
Артём Акуляков - F# for Data Analysis
 
令和から本気出す
令和から本気出す令和から本気出す
令和から本気出す
 
R Programming Homework Help
R Programming Homework HelpR Programming Homework Help
R Programming Homework Help
 
Python.pdf
Python.pdfPython.pdf
Python.pdf
 
Optimization and Mathematical Programming in R and ROI - R Optimization Infra...
Optimization and Mathematical Programming in R and ROI - R Optimization Infra...Optimization and Mathematical Programming in R and ROI - R Optimization Infra...
Optimization and Mathematical Programming in R and ROI - R Optimization Infra...
 
intro-to-metaprogramming-in-r.pdf
intro-to-metaprogramming-in-r.pdfintro-to-metaprogramming-in-r.pdf
intro-to-metaprogramming-in-r.pdf
 
20100528
2010052820100528
20100528
 
20100528
2010052820100528
20100528
 
Regression_Sample
Regression_SampleRegression_Sample
Regression_Sample
 
Introduction to Neural Networks and Deep Learning from Scratch
Introduction to Neural Networks and Deep Learning from ScratchIntroduction to Neural Networks and Deep Learning from Scratch
Introduction to Neural Networks and Deep Learning from Scratch
 
R tutorial for a windows environment
R tutorial for a windows environmentR tutorial for a windows environment
R tutorial for a windows environment
 
MongoDB Oplog入門
MongoDB Oplog入門MongoDB Oplog入門
MongoDB Oplog入門
 
Seminar PSU 09.04.2013 - 10.04.2013 MiFIT, Arbuzov Vyacheslav
Seminar PSU 09.04.2013 - 10.04.2013 MiFIT, Arbuzov VyacheslavSeminar PSU 09.04.2013 - 10.04.2013 MiFIT, Arbuzov Vyacheslav
Seminar PSU 09.04.2013 - 10.04.2013 MiFIT, Arbuzov Vyacheslav
 
Basic operations by novi reandy sasmita
Basic operations by novi reandy sasmitaBasic operations by novi reandy sasmita
Basic operations by novi reandy sasmita
 
Learn Matlab
Learn MatlabLearn Matlab
Learn Matlab
 
R programming
R programmingR programming
R programming
 

Data aggregation in R

  • 1. Data Aggregation in R Examples of using different functions for data aggregation and comparison of their performance: > #data simulation > set.seed(2211984) > DF <- data.frame(x=sample(1:3, 10000000, rep=T), y=runif(10000000,1,100), z=rnorm(10000000,10,2)) > > #tapply - basic function > tapply(DF$y, DF$x, mean) 1 2 3 50.50248 50.52115 50.50778 > tapply(DF$z, DF$x, sd) 1 2 3 1.999511 1.998869 1.998408 > > #aggregate - basic function > aggregate(DF$y, list(group=DF$x), FUN=mean) group x 1 1 50.50248 2 2 50.52115 3 3 50.50778 > aggregate(DF$z, list(group=DF$x), FUN=sd) group x 1 1 1.999511 2 2 1.998869 3 3 1.998408 > > #ddply - plyr library > library(plyr) > ddply(DF, .(x), summarise, avg_y=mean(y), sd_z=sd(z)) x avg_y sd_z 1 1 50.50248 1.999511 2 2 50.52115 1.998869 3 3 50.50778 1.998408 > > #sql query - sqldf library > library(sqldf) > sqldf("select avg(y) as avg_y, stdev(z) as sd_z from DF group by x") avg_y sd_z 1 50.50248 1.999511 2 50.52115 1.998869 3 50.50778 1.998408 > > #data.table objects - data.table library > library(data.table) data.table 1.8.2 For help type: help("data.table") > DT <- data.table(DF) > DT x y z 1: 1 12.133576 11.947320 2: 2 44.485393 6.290101 3: 2 71.566670 10.280873
  • 2. 4: 2 88.883879 11.121398 5: 1 3.952848 8.688182 --- 9999996: 3 17.317273 10.085156 9999997: 3 64.856928 8.250676 9999998: 2 6.489453 8.812301 9999999: 3 94.344257 8.203418 10000000: 3 3.267286 6.688272 > identical(DT$x,DF$x) [1] TRUE > identical(DT$y,DF$y) [1] TRUE > > DT[, sum(y), by=x] x V1 1: 1 168310537 2: 2 168424040 3: 3 168370154 > DT[,list(avg_y=mean(y), sd_z=sd(z)), by=x] x avg_y sd_z 1: 1 50.50248 1.999511 2: 2 50.52115 1.998869 3: 3 50.50778 1.998408 > > #function performance - system time > system.time(tapply(DF$y, DF$x, mean)) + system.time(tapply(DF$z, DF$x, sd)) user system elapsed 13.18 0.52 13.68 > system.time(aggregate(DF$y, list(group=DF$x), FUN=mean)) + system.time(aggregate(DF$z, list(group=DF$x), FUN=sd)) user system elapsed 29.65 1.03 30.76 > system.time(ddply(DF, .(x), summarise, avg_y=mean(y), sd_z=sd(z))) user system elapsed 2.23 0.86 3.09 > system.time(sqldf("select sum(y) as avg_y, stdev(z) as sd_z from DF group by x")) user system elapsed 33.83 2.85 37.11 > system.time(DT[,list(avg_y=mean(y), sd_z=sd(z)), by=x]) user system elapsed 0.7 0.0 0.7 > sessionInfo() R version 2.15.2 (2012-10-26) Platform: x86_64-w64-mingw32/x64 (64-bit) locale: [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 [3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C [5] LC_TIME=English_United States.1252 attached base packages:
  • 3. [1] tcltk stats graphics grDevices utils datasets methods base other attached packages: [1] data.table_1.8.2 sqldf_0.4-6.4 RSQLite.extfuns_0.0.1 RSQLite_0.11.2 [5] chron_2.3-42 gsubfn_0.6-5 proto_0.3-9.2 DBI_0.2-5 [9] plyr_1.7.1 Conclusion: data.table rocks. More than 4 times faster than ddply function, 19 times faster than tapply function, 44 times faster than aggregate function and 53 times faster than sqldf function.