SlideShare ist ein Scribd-Unternehmen logo
1 von 33
Downloaden Sie, um offline zu lesen
Motivation and Overview
                     Strategies for computing with massive data
                                     Modeling with Massive Data
                                                    Conclusion




            Scalable Strategies for Computing with
                        Massive Data:
                   The Bigmemory Project

                            Michael J. Kane and John W. Emerson
                                 http://www.bigmemory.org/

                                    http://www.stat.yale.edu/~mjk56/
                                   Associate Research Scientist, Yale University

                                    http://www.stat.yale.edu/~jay/
                                 Associate Professor of Statistics, Yale University

                   Note: This is a modified version of a talk that was written by Jay.




Michael J. Kane and John W. Emerson http://www.bigmemory.org/     Scalable Strategies for Computing with Massive Data: The Bigmemory Project
Motivation and Overview
                       Strategies for computing with massive data
                                       Modeling with Massive Data
                                                      Conclusion


Outline



  1     Motivation and Overview


  2     Strategies for computing with massive data


  3     Modeling with Massive Data


  4     Conclusion




  Michael J. Kane and John W. Emerson http://www.bigmemory.org/     Scalable Strategies for Computing with Massive Data: The Bigmemory Project
Motivation and Overview
                       Strategies for computing with massive data
                                       Modeling with Massive Data
                                                      Conclusion


A new era

  The analysis of very large data sets has recently become an active
  area of research in statistics and machine learning. Many new
  computational challenges arise when managing, exploring, and
  analyzing these data sets, challenges that effectively put the data
  beyond the reach of researchers who lack specialized software
  development skills of expensive hardware.


          “We have entered an era of massive scientific data collection,
          with a demand for answers to large-scale inference problems
          that lie beyond the scope of classical statistics.” – Efron (2005)
          “classical statistics” should include “mainstream computational
          statistics.” – Kane, Emerson, and Weston (in preparation, in
          reference to Efron’s quote)


  Michael J. Kane and John W. Emerson http://www.bigmemory.org/     Scalable Strategies for Computing with Massive Data: The Bigmemory Project
Motivation and Overview
                       Strategies for computing with massive data
                                       Modeling with Massive Data
                                                      Conclusion


Small data approaches don’t scale...



          Data analysts who want to explore massive data must
                   be aware of the various pitfalls;
                   adopt new approaches to avoid them.


          This talk will
                   illustrate common challenges for dealing with massive data;
                   provide general solutions for avoiding the pitfalls.




  Michael J. Kane and John W. Emerson http://www.bigmemory.org/     Scalable Strategies for Computing with Massive Data: The Bigmemory Project
Motivation and Overview
                       Strategies for computing with massive data
                                       Modeling with Massive Data
                                                      Conclusion


Example data sets

          Airline on-time data
                   2009 JSM Data Expo
                   About 120 million commercial US airline flights over 20 years
                   29 variables, integer-valued or categorical (recoded as integer)
                   About 12 gigabytes (GB)
                   http://stat-computing.org/dataexpo/2009/


          Netflix data
                   About 100 million ratings from 500,000 customers for 17,000
                   movies
                   About 2 GB stored as integers
                   No statisticians on the winning team; hard to find statisticians on
                   the leaderboard
                   Top teams: access to expensive hardware; professional computer
                   science and programming expertise
                   http://www.netflixprize.com/

  Michael J. Kane and John W. Emerson http://www.bigmemory.org/     Scalable Strategies for Computing with Massive Data: The Bigmemory Project
Motivation and Overview
                       Strategies for computing with massive data
                                       Modeling with Massive Data
                                                      Conclusion


R as the language for computing with data



  R is the lingua franca of statistics:
          The syntax is simple and well-suited for data exploration and
          analysis.
          It has excellent graphical capabilities.
          It is extensible, with over 2500 packages available on CRAN
          alone.
          It is open source and freely available for Windows/MacOS/Linux
          platforms.




  Michael J. Kane and John W. Emerson http://www.bigmemory.org/     Scalable Strategies for Computing with Massive Data: The Bigmemory Project
Motivation and Overview
                       Strategies for computing with massive data
                                       Modeling with Massive Data
                                                      Conclusion


The Bigmemory Project for managing and computing
with massive matrices



          Set of packages (bigmemory, bigtabulate, biganalytics,
          synchronicity, and bigalgebra) that extends the R
          programming environment
          Core data structures are written in C++
          Could be used to extend other high-level computing
          environments




  Michael J. Kane and John W. Emerson http://www.bigmemory.org/     Scalable Strategies for Computing with Massive Data: The Bigmemory Project
Motivation and Overview
                           Strategies for computing with massive data
                                           Modeling with Massive Data
                                                          Conclusion


The Bigmemory Project: http://www.bigmemory.org/

                           NMF                                                         DNAcopy                                          irlba
                      non-negative                                                     DNA copy                                     truncated
                      matrix                                                           number data                                  SVDs on
                      factorization                                                    analysis                                     (big.)matrices




                                                                                                                                                                 methods
                         biglm                                      biganalytics                         bigalgebra
                                                                                                                                                                    generic
                      regressions for                               statistical                          linear algebra for                                         functions
                      data too big to                               analyses with                        (big.)matrices
                      fit in memory                                 big.matrices


                                                                                                                                                                     stats
                                                                                       bigmemory
                                                                                       Core big.matrix                                                              statistical
                                                                                       creation and                                                                 functions
                                                                                       manipution


                                                                                                                                                                     utils
                       foreach                                      bigtabulate                          synchronicity
                                                                                                                                                                    utility
                       concurrent-                                   fast tabulation                       mutual                                                   functions
                       enabled                                       and                                   exclusions
                       loops                                         summaries

                                                                                                                                                         R base packages
                                                                                The Bigmemory Project                                                    (not all are shown)




                  doMC                  doNWS          doSnow          doSMP            doRedis                                   package
                 parallel             concurrent      concurrent     concurrent        concurrent                                                    An R package and description
                 backend for          backend for     backend for    backend for       backend for                                description
                 SMP unix             NetworkSpaces   snow           SMP machines      redis



                                                                                                                              A                 B    A depends on (or imports) B



                                      Low-level parallel support                                                              B                 C    B suggests C




  Michael J. Kane and John W. Emerson http://www.bigmemory.org/                                      Scalable Strategies for Computing with Massive Data: The Bigmemory Project
Motivation and Overview
                       Strategies for computing with massive data
                                       Modeling with Massive Data
                                                      Conclusion


Importing and managing massive data
  The Netflix data ∼ 2 GB: read.table() uses 3.7 GB total in the
  process of importing the data (taking 140 seconds):
  > net <- read.table("netflix.txt", header=FALSE,
  +                   sep="t", colClasses = "integer")
  > object.size(net)
  1981443344 bytes
  With bigmemory, only the 2 GB is needed (no memory overhead,
  taking 130 seconds):
  > net <- read.big.matrix("netflix.txt", header=FALSE,
  +                        sep="t", type = "integer",
  > net[1:3,]
       [,1] [,2] [,3] [,4] [,5]
  [1,]    1    1    3 2005    9
  [2,]    1    2    5 2005    5
  [3,]    1    3    4 2005   10
  Michael J. Kane and John W. Emerson http://www.bigmemory.org/     Scalable Strategies for Computing with Massive Data: The Bigmemory Project
Motivation and Overview
                       Strategies for computing with massive data
                                       Modeling with Massive Data
                                                      Conclusion


Importing and managing massive data

          read.table():
                   memory overhead as much as 100% of size of data
                   data.frame and matrix objects not available in shared memory
                   resulting data.frame is limited in size by available RAM,
                   recommended maximum 10%-20% of RAM


          read.big.matrix():
                   matrix-like data (not data frames)
                   no memory overhead
                   faster that read.table()
                   resulting big.matrix supports shared memory for efficient
                   parallel programming
                   resulting big.matrix supports file-backed objects for data
                   larger-than-RAM


  Michael J. Kane and John W. Emerson http://www.bigmemory.org/     Scalable Strategies for Computing with Massive Data: The Bigmemory Project
Motivation and Overview
                       Strategies for computing with massive data
                                       Modeling with Massive Data
                                                      Conclusion


Importing and managing massive data
  With the full Airline data (∼ 12 GB), bigmemory’s file-backing allows
  you to work with the data even with only 4 GB of RAM, for example:
  > x <- read.big.matrix("airline.csv", header=TRUE,
  +                       backingfile="airline.bin",
  +                       descriptorfile="airline.desc",
  +                       type="integer")
  > x
  An object of class "big.matrix"
  Slot "address":
  <pointer: 0x3031fc0>
  > rm(x)
  > x <- attach.big.matrix("airline.desc")
  > dim(x)
  [1] 123534969        29


  Michael J. Kane and John W. Emerson http://www.bigmemory.org/     Scalable Strategies for Computing with Massive Data: The Bigmemory Project
Motivation and Overview
                       Strategies for computing with massive data
                                       Modeling with Massive Data
                                                      Conclusion


Exploring massive data


  > summary(x[, "DepDelay"])
         Min.      1st Qu.      Median        Mean
    -1410.000       -2.000       0.000       8.171
       3rd Qu.         Max.       NA's
         6.000    2601.000 2302136.000
  >
  > quantile(x[, "DepDelay"],
  +           probs=c(0.5, 0.9, 0.99), na.rm=TRUE)
  50% 90% 99%
    0 27 128




  Michael J. Kane and John W. Emerson http://www.bigmemory.org/     Scalable Strategies for Computing with Massive Data: The Bigmemory Project
Motivation and Overview
                       Strategies for computing with massive data
                                       Modeling with Massive Data
                                                      Conclusion


Example: caching in action

  In a fresh R session on this laptop, newly rebooted:
  > library(bigmemory)
  > library(biganalytics)
  > xdesc <- dget("airline.desc")
  > x <- attach.big.matrix(xdesc)
  > system.time( numplanes <- colmax(x, "TailNum",
  +                                  na.rm=TRUE) )
     user system elapsed
    0.770   0.550   6.144
  > system.time( numplanes <- colmax(x, "TailNum",
  +                                  na.rm=TRUE) )
     user system elapsed
    0.320   0.000   0.355


  Michael J. Kane and John W. Emerson http://www.bigmemory.org/     Scalable Strategies for Computing with Massive Data: The Bigmemory Project
Motivation and Overview
                       Strategies for computing with massive data
                                       Modeling with Massive Data
                                                      Conclusion


Split-apply-combine

          Many computational problems in statistics are solved by
          performing the same calculation repeatedly on independent sets
          of data. These problems can be solved by
                   partitioning the data (the split)
                   performing a single calculation on each partition (the apply)
                   returning the results in a specified format (the combine)
          Recent attention: it can be particularly efficient and easily lends
          itself to parallel computing
          “split-apply-combine” was coined by Hadley Wickham, but the
          approach has been supported on a number of different
          environments for some time under different names:
                   SAS: by
                   Google: MapReduce
                   Apache: Hadoop


  Michael J. Kane and John W. Emerson http://www.bigmemory.org/     Scalable Strategies for Computing with Massive Data: The Bigmemory Project
Motivation and Overview
                       Strategies for computing with massive data
                                       Modeling with Massive Data
                                                      Conclusion


Aside: split()

  > x <- matrix(c(rnorm(5), sample(c(1, 2), 5,
  +     replace = T)), 5, 2)
  > x
              [,1] [,2]
  [1,] -0.89691455    2
  [2,] 0.18484918     1
  [3,] 1.58784533     2
  [4,] -1.13037567    1
  [5,] -0.08025176    1
  > split(x[, 1], x[, 2])
  $`1`
  [1] 0.18484918 -1.13037567 -0.08025176

  $`2`
  [1] -0.8969145                           1.5878453

  Michael J. Kane and John W. Emerson http://www.bigmemory.org/     Scalable Strategies for Computing with Massive Data: The Bigmemory Project
Motivation and Overview
                       Strategies for computing with massive data
                                       Modeling with Massive Data
                                                      Conclusion


Exploring massive data
  >    GetDepQuantiles <- function(rows, data) {
  +      return(quantile(data[rows, "DepDelay"],
  +               probs=c(0.5, 0.9, 0.99), na.rm=TRUE))
  +    }
  >
  >    groups <- split(1:nrow(x), x[,"DayOfWeek"])
  >
  >    qs <- sapply(groups, GetDepQuantiles, data=x)
  >
  > colnames(qs) <- c("Mon", "Tue", "Wed", "Thu",
  +                    "Fri", "Sat", "Sun")
  > qs
       Mon Tue Wed Thu Fri Sat Sun
  50%    0   0   0   0   0   0   0
  90% 25 23 25 30 33 23 27
  99% 127 119 125 136 141 116 130
  Michael J. Kane and John W. Emerson http://www.bigmemory.org/     Scalable Strategies for Computing with Massive Data: The Bigmemory Project
Motivation and Overview
                       Strategies for computing with massive data
                                       Modeling with Massive Data
                                                      Conclusion


Parallelizing the apply step




          The apply step runs on disjoint subsets of a data set:
          “embarrassingly parallel”
          We should be able to run the apply step in parallel but
          embarrassingly parallel doesn’t mean easy to parallelize




  Michael J. Kane and John W. Emerson http://www.bigmemory.org/     Scalable Strategies for Computing with Massive Data: The Bigmemory Project
Motivation and Overview
                       Strategies for computing with massive data
                                       Modeling with Massive Data
                                                      Conclusion


foreach for easy to parallelize, embarrassingly
parallel problems

  >    library(foreach)
  >
  >    library(doMC)
  >    registerDoMC(2)
  >
  >    ans <- foreach(i = 1:10, .combine = c) %dopar%
  +    {
  +      i^2
  +    }
  >
  >    ans
      [1]          1          4         9       16         25       36      49        64         81 100


  Michael J. Kane and John W. Emerson http://www.bigmemory.org/      Scalable Strategies for Computing with Massive Data: The Bigmemory Project
Motivation and Overview
                       Strategies for computing with massive data
                                       Modeling with Massive Data
                                                      Conclusion


Concurrent programming with foreach (1995 only)



  > library(foreach)
  > library(doSNOW)
  > cl <- makeSOCKcluster(4)
  > registerDoSNOW(cl)
  >
  > x <- read.csv("1995.csv")
  > dim(x)
  [1] 5327435       29




  Michael J. Kane and John W. Emerson http://www.bigmemory.org/     Scalable Strategies for Computing with Massive Data: The Bigmemory Project
Motivation and Overview
                       Strategies for computing with massive data
                                       Modeling with Massive Data
                                                      Conclusion


Concurrent programming with foreach (1995 only)
  4 cores: 37 seconds; 3 cores: 23 seconds; 2 cores: 15 seconds,
  1 core: 8 seconds. Memory overhead of 4 cores: > 2.5 GB!
  >    groups <- split(1:nrow(x), x[,"DayOfWeek"])
  >
  >    qs <- foreach(g=groups, .combine=rbind) %dopar% {
  +      GetDepQuantiles(g, data=x)
  +    }
  >
  > daysOfWeek <- c("Mon", "Tues", "Wed", "Thu",
  +                 "Fri", "Sat", "Sun")
  > rownames(qs) <- daysOfWeek
  > t(qs)
      Mon Tues Wed Thu Fri Sat Sun
  50%   0    0   0   1   1   0   1
  90% 21    21 26 27 29 23 23
  99% 102 107 116 117 115 104 102
  Michael J. Kane and John W. Emerson http://www.bigmemory.org/     Scalable Strategies for Computing with Massive Data: The Bigmemory Project
Motivation and Overview
                      Strategies for computing with massive data
                                      Modeling with Massive Data
                                                     Conclusion


Memory usage: data.frame vs. big.matrix




 Michael J. Kane and John W. Emerson http://www.bigmemory.org/     Scalable Strategies for Computing with Massive Data: The Bigmemory Project
Motivation and Overview
                       Strategies for computing with massive data
                                       Modeling with Massive Data
                                                      Conclusion


Execution time: data.frame vs big.matrix




  Michael J. Kane and John W. Emerson http://www.bigmemory.org/     Scalable Strategies for Computing with Massive Data: The Bigmemory Project
Motivation and Overview
                       Strategies for computing with massive data
                                       Modeling with Massive Data
                                                      Conclusion


Plane ages: Is split-apply-combine really faster?
  10 planes only (there are > 13,000 total planes).
         With 1 core: 74 seconds. With 2 cores: 38 seconds.
  All planes, 4 cores: ∼ 2 hours.
  >    library(foreach)
  >    library(doMC)
  >    registerDoMC(cores=2)
  >    tailNums <- unique(x[,"TailNum"])
  >    planeStart <- foreach(i=1:length(tailNums),
  +      .combine=c) %dopar% {
  +      x <- attach.big.matrix(xdesc)
  +      yearInds <- which(x[,"TailNum"] == i)
  +      y <- x[yearInds,c("Year", "Month")]
  +      minYear <- min(y[,"Year"], na.rm=TRUE)
  +      these <- which(y[,"Year"]==minYear)
  +      minMonth <- min(y[these,"Month"], na.rm=TRUE)
  +      return(12*minYear + minMonth)
  +    }
  Michael J. Kane and John W. Emerson http://www.bigmemory.org/     Scalable Strategies for Computing with Massive Data: The Bigmemory Project
Motivation and Overview
                       Strategies for computing with massive data
                                       Modeling with Massive Data
                                                      Conclusion


A better solution: split-apply-combine!
  >    birthmonth <- function(y) {
  +      minYear <- min(y[,'Year'], na.rm=TRUE)
  +      these <- which(y[,'Year']==minYear)
  +      minMonth <- min(y[these,'Month'], na.rm=TRUE)
  +      return(12*minYear + minMonth)
  +    }
  >
  >    time.0 <- system.time( {
  +      planemap <- split(1:nrow(x), x[,"TailNum"])
  +      planeStart <- sapply( planemap,
  +        function(i) birthmonth(x[i, c('Year','Month'),
  +                                 drop=FALSE]) )
  +    } )
  >
  >    time.0
        user system elapsed
      53.520  2.020 78.925
  Michael J. Kane and John W. Emerson http://www.bigmemory.org/     Scalable Strategies for Computing with Massive Data: The Bigmemory Project
Motivation and Overview
                       Strategies for computing with massive data
                                       Modeling with Massive Data
                                                      Conclusion


Parallel split-apply-combine
  Using 4 cores, we can reduce the time to ∼ 20 seconds (not including
  the read.big.matrix(), repeated here for a special reason:


  x <- read.big.matrix("airline.csv", header = TRUE,
                       backingfile = "airline.bin",
                       descriptorfile = "airline.desc",
                       type = "integer",
                       extraCols = "Age")
  planeindices <- split(1:nrow(x), x[, "TailNum"])
  planeStart <- foreach(i = planeindices,
                        .combine = c) %dopar% {
    birthmonth(x[i, c("Year", "Month"), drop = FALSE])
  }
  x[, "Age"] <- x[, "Year"] * as.integer(12) +
                x[, "Month"] -
                as.integer(planeStart[x[, "TailNum"]])
  Michael J. Kane and John W. Emerson http://www.bigmemory.org/     Scalable Strategies for Computing with Massive Data: The Bigmemory Project
Motivation and Overview
                       Strategies for computing with massive data
                                       Modeling with Massive Data
                                                      Conclusion


Massive linear models via biglm


  > library (biganalytics)
  > x <- attach.big.matrix("airline.desc")
  > blm <- biglm.big.matrix(DepDelay ~ Age, data = x)
  > summary(blm)
  Large data regression model: biglm(formula = formula,
    data = data , ...)
  Sample size = 84406323
                 Coef  (95%    CI)     SE p
  (Intercept) 8.5889 8.5786 8.5991 0.0051 0
  Age         0.0053 0.0051 0.0055 0.0001 0




  Michael J. Kane and John W. Emerson http://www.bigmemory.org/     Scalable Strategies for Computing with Massive Data: The Bigmemory Project
Motivation and Overview
                       Strategies for computing with massive data
                                       Modeling with Massive Data
                                                      Conclusion


Netflix: a truncated singular value decomposition

                                                                                                                  LOTR(ex)
                                                                                                                 LOTR(ex)
                                                                                     ST ST ST
                                                                                      ST                        SW
                                                                                                               SW
                                                                           ST                                 SW
                                                                                                               LOTR
                                                                                                      HP    LOTR
                                                                                                             LOTR
                                                                                       AP
                                                                           AP
                                                                                AP
                                                                                                   HP
                                                                                                  HP
                                                                                  SW: I
                                                                                 SW: II




                                   Horror Hospital                                                  WestWest Wing: Season4
                                                                                                     The Wing: Season 1
                                                                                                The The West Wing: Season 23
               The Horror Within

                                                                                                      The Shawshank Redemption
                                                         Waterworld
                                                                                                  Rain Man
                                                                                                      Braveheart

                                                                         Mission: Impossible      The Green Mile
                                                                                     Men in Black
                                                                      Men in Black II Titanic
                                                                                      The Rock


                                                                                Independence Day



  Michael J. Kane and John W. Emerson http://www.bigmemory.org/         Scalable Strategies for Computing with Massive Data: The Bigmemory Project
Motivation and Overview
                       Strategies for computing with massive data
                                       Modeling with Massive Data
                                                      Conclusion


Comparing the Bigmemory with a Database

  What distinguishes the Bigmemory Project from a database?
          both manage large/massive data
          both can be used for efficient subset selection
  A database is a general tool for managing data
          databases addition and deletion of data
          database can manage complex relationship between data
  A big.matrix object is numerical object
          underlying representation is compatible with BLAS/LAPACK
          libraries
          intended as a platform for development of numerical techniques
          for large/massive data



  Michael J. Kane and John W. Emerson http://www.bigmemory.org/     Scalable Strategies for Computing with Massive Data: The Bigmemory Project
Motivation and Overview
                      Strategies for computing with massive data
                                      Modeling with Massive Data
                                                     Conclusion


Summary

 Four new ways to work with very large sets of data:
         memory and file-mapped data structures, which provide access
         to arbitrarily large sets of data while retaining a look and feel that
         is familiar to data analysts;
         data structures that are shared across processor cores on a
         single computer, in order to support efficient parallel computing
         techniques when multiple processors are used;
         file-mapped data structures that allow concurrent access by the
         different nodes in a cluster of computers.
         a technology-agnostic means for implementing embarrassingly
         parallel loops
 These techniques are currently implemented for R, but are intended
 to provide a flexible framework for future developments in the field of
 statistical computing.

 Michael J. Kane and John W. Emerson http://www.bigmemory.org/     Scalable Strategies for Computing with Massive Data: The Bigmemory Project
Motivation and Overview
                       Strategies for computing with massive data
                                       Modeling with Massive Data
                                                      Conclusion


Thanks! http://www.bigmemory.org/


          Dirk Eddelbuettel, Bryan Lewis, Steve Weston, and Martin
          Schultz, for their feedback and advice over the last three years
          Bell Laboratories (Rick Becker, John Chambers and Allan Wilks),
          for development of the S language
          Ross Ihaka and Robert Gentleman, for their work and unselfish
          vision for R
          The R Core team
          David Pollard, for pushing us to better communicate the
          contributions of the project to statisticians




  Michael J. Kane and John W. Emerson http://www.bigmemory.org/     Scalable Strategies for Computing with Massive Data: The Bigmemory Project
Motivation and Overview
                      Strategies for computing with massive data
                                      Modeling with Massive Data
                                                     Conclusion


OTHER SLIDES



 http://www.bigmemory.org/

 http://www.stat.yale.edu/~jay/

 and

 ...yale.edu/~jay/Brazil/SP/bigmemory/




 Michael J. Kane and John W. Emerson http://www.bigmemory.org/     Scalable Strategies for Computing with Massive Data: The Bigmemory Project
Motivation and Overview
                       Strategies for computing with massive data
                                       Modeling with Massive Data
                                                      Conclusion


Overview: the Bigmemory Project
          Problems and challenges:
                   R frequently makes copies of objects, which can be costly
                   Guideline: R’s performance begins to degrade with objects more
                   than about 10% of the address space, or when total objects
                   consume more than about 1/3 of RAM.
                   swapping: not sufficient
                   chunking: inefficient, customized coding
                   parallel programming: memory problems, non-portable solutions
                   shared memory: essentially inaccessible to non-experts


          Key parts of the solutions:
                   operating system caching
                   shared memory
                   file-backing for larger-than-RAM data
                   a framework for platform-independent parallel programming (credit
                   Steve Weston, independently of the BMP)

  Michael J. Kane and John W. Emerson http://www.bigmemory.org/     Scalable Strategies for Computing with Massive Data: The Bigmemory Project
Motivation and Overview
                       Strategies for computing with massive data
                                       Modeling with Massive Data
                                                      Conclusion


Extending capabilities versus extending the language

          Extending R’s capabilities:
                   Providing a new algorithm, statistical analysis, or data structure
                   Examples: lm(), or package bcp
                   Most of the 2500+ packages on CRAN and elsewhere


          Extending the language:
                   Example: grDevices, a low-level interface to create and
                   manipulate graphics devices
                   Example: grid graphics
                   The packages of the Bigmemory Project:
                           a developer’s interface to underlying operating system functionality
                           which could not have been written in R itself
                           a higher-level interface designed to mimic R’s matrix objects so that
                           statisticians can use their current knowledge of computing with data
                           sets as a starting point for dealing with massive ones.



  Michael J. Kane and John W. Emerson http://www.bigmemory.org/     Scalable Strategies for Computing with Massive Data: The Bigmemory Project

Weitere ähnliche Inhalte

Was ist angesagt?

Big data and data science
Big data and data scienceBig data and data science
Big data and data scienceSong Xue
 
International Collaboration Networks in the Emerging (Big) Data Science
International Collaboration Networks in the Emerging (Big) Data ScienceInternational Collaboration Networks in the Emerging (Big) Data Science
International Collaboration Networks in the Emerging (Big) Data Sciencedatasciencekorea
 
Tutorial Data Management and workflows
Tutorial Data Management and workflowsTutorial Data Management and workflows
Tutorial Data Management and workflowsSSSW
 
Decoding Data Science
Decoding Data ScienceDecoding Data Science
Decoding Data ScienceMatt Fornito
 
An Ecosystem for Linked Humanities Data
An Ecosystem for Linked Humanities DataAn Ecosystem for Linked Humanities Data
An Ecosystem for Linked Humanities DataRinke Hoekstra
 
How To Become a Data Scientist in Iran Marketplace
How To Become a Data Scientist in Iran Marketplace How To Become a Data Scientist in Iran Marketplace
How To Become a Data Scientist in Iran Marketplace Mohamadreza Mohtat
 
From Web Data to Knowledge: on the Complementarity of Human and Artificial In...
From Web Data to Knowledge: on the Complementarity of Human and Artificial In...From Web Data to Knowledge: on the Complementarity of Human and Artificial In...
From Web Data to Knowledge: on the Complementarity of Human and Artificial In...Stefan Dietze
 
Graph Analysis Trends and Opportunities -- CMG Performance and Capacity 2014
Graph Analysis Trends and Opportunities -- CMG Performance and Capacity 2014Graph Analysis Trends and Opportunities -- CMG Performance and Capacity 2014
Graph Analysis Trends and Opportunities -- CMG Performance and Capacity 2014Jason Riedy
 
Why L-3 Data Tactics Data Science?
Why L-3 Data Tactics Data Science?Why L-3 Data Tactics Data Science?
Why L-3 Data Tactics Data Science?Rich Heimann
 
Data science
Data scienceData science
Data science9diov
 
Make Accumulated Data in Companies Eloquent by SQL Statement Constructors (PDF)
Make Accumulated Data in Companies Eloquent by SQL Statement Constructors (PDF)Make Accumulated Data in Companies Eloquent by SQL Statement Constructors (PDF)
Make Accumulated Data in Companies Eloquent by SQL Statement Constructors (PDF)Toshiyuki Shimono
 
AI, Knowledge Representation and Graph Databases -
 Key Trends in Data Science
AI, Knowledge Representation and Graph Databases -
 Key Trends in Data ScienceAI, Knowledge Representation and Graph Databases -
 Key Trends in Data Science
AI, Knowledge Representation and Graph Databases -
 Key Trends in Data ScienceOptum
 
Data Tactics Data Science Brown Bag (April 2014)
Data Tactics Data Science Brown Bag (April 2014)Data Tactics Data Science Brown Bag (April 2014)
Data Tactics Data Science Brown Bag (April 2014)Rich Heimann
 
A Blended Approach to Analytics at Data Tactics Corporation
A Blended Approach to Analytics at Data Tactics CorporationA Blended Approach to Analytics at Data Tactics Corporation
A Blended Approach to Analytics at Data Tactics CorporationRich Heimann
 
ArXiv Literature Exploration using Social Network Analysis
ArXiv Literature Exploration using Social Network AnalysisArXiv Literature Exploration using Social Network Analysis
ArXiv Literature Exploration using Social Network AnalysisTanat Iempreedee
 
Graph Databases and Graph Data Science in Neo4j
Graph Databases and Graph Data Science in Neo4jGraph Databases and Graph Data Science in Neo4j
Graph Databases and Graph Data Science in Neo4jijtsrd
 
Managing Metadata for Science and Technology Studies: the RISIS case
Managing Metadata for Science and Technology Studies: the RISIS caseManaging Metadata for Science and Technology Studies: the RISIS case
Managing Metadata for Science and Technology Studies: the RISIS caseRinke Hoekstra
 

Was ist angesagt? (20)

Big data and data science
Big data and data scienceBig data and data science
Big data and data science
 
Introduction to Big Data Technologies
Introduction to Big Data TechnologiesIntroduction to Big Data Technologies
Introduction to Big Data Technologies
 
International Collaboration Networks in the Emerging (Big) Data Science
International Collaboration Networks in the Emerging (Big) Data ScienceInternational Collaboration Networks in the Emerging (Big) Data Science
International Collaboration Networks in the Emerging (Big) Data Science
 
Tutorial Data Management and workflows
Tutorial Data Management and workflowsTutorial Data Management and workflows
Tutorial Data Management and workflows
 
Decoding Data Science
Decoding Data ScienceDecoding Data Science
Decoding Data Science
 
An Ecosystem for Linked Humanities Data
An Ecosystem for Linked Humanities DataAn Ecosystem for Linked Humanities Data
An Ecosystem for Linked Humanities Data
 
How To Become a Data Scientist in Iran Marketplace
How To Become a Data Scientist in Iran Marketplace How To Become a Data Scientist in Iran Marketplace
How To Become a Data Scientist in Iran Marketplace
 
From Web Data to Knowledge: on the Complementarity of Human and Artificial In...
From Web Data to Knowledge: on the Complementarity of Human and Artificial In...From Web Data to Knowledge: on the Complementarity of Human and Artificial In...
From Web Data to Knowledge: on the Complementarity of Human and Artificial In...
 
Graph Analysis Trends and Opportunities -- CMG Performance and Capacity 2014
Graph Analysis Trends and Opportunities -- CMG Performance and Capacity 2014Graph Analysis Trends and Opportunities -- CMG Performance and Capacity 2014
Graph Analysis Trends and Opportunities -- CMG Performance and Capacity 2014
 
Why L-3 Data Tactics Data Science?
Why L-3 Data Tactics Data Science?Why L-3 Data Tactics Data Science?
Why L-3 Data Tactics Data Science?
 
Data science
Data scienceData science
Data science
 
Make Accumulated Data in Companies Eloquent by SQL Statement Constructors (PDF)
Make Accumulated Data in Companies Eloquent by SQL Statement Constructors (PDF)Make Accumulated Data in Companies Eloquent by SQL Statement Constructors (PDF)
Make Accumulated Data in Companies Eloquent by SQL Statement Constructors (PDF)
 
AI, Knowledge Representation and Graph Databases -
 Key Trends in Data Science
AI, Knowledge Representation and Graph Databases -
 Key Trends in Data ScienceAI, Knowledge Representation and Graph Databases -
 Key Trends in Data Science
AI, Knowledge Representation and Graph Databases -
 Key Trends in Data Science
 
BDACA - Lecture7
BDACA - Lecture7BDACA - Lecture7
BDACA - Lecture7
 
Data Tactics Data Science Brown Bag (April 2014)
Data Tactics Data Science Brown Bag (April 2014)Data Tactics Data Science Brown Bag (April 2014)
Data Tactics Data Science Brown Bag (April 2014)
 
A Blended Approach to Analytics at Data Tactics Corporation
A Blended Approach to Analytics at Data Tactics CorporationA Blended Approach to Analytics at Data Tactics Corporation
A Blended Approach to Analytics at Data Tactics Corporation
 
ArXiv Literature Exploration using Social Network Analysis
ArXiv Literature Exploration using Social Network AnalysisArXiv Literature Exploration using Social Network Analysis
ArXiv Literature Exploration using Social Network Analysis
 
Graph Databases and Graph Data Science in Neo4j
Graph Databases and Graph Data Science in Neo4jGraph Databases and Graph Data Science in Neo4j
Graph Databases and Graph Data Science in Neo4j
 
DS4G
DS4GDS4G
DS4G
 
Managing Metadata for Science and Technology Studies: the RISIS case
Managing Metadata for Science and Technology Studies: the RISIS caseManaging Metadata for Science and Technology Studies: the RISIS case
Managing Metadata for Science and Technology Studies: the RISIS case
 

Andere mochten auch

eRum2016 -RevoScaleR - Performance and Scalability R
eRum2016 -RevoScaleR - Performance and Scalability ReRum2016 -RevoScaleR - Performance and Scalability R
eRum2016 -RevoScaleR - Performance and Scalability RŁukasz Grala
 
FIWARE Complex Event Processing
FIWARE Complex Event ProcessingFIWARE Complex Event Processing
FIWARE Complex Event ProcessingMiguel González
 
Complex Event Processing
Complex Event ProcessingComplex Event Processing
Complex Event ProcessingJohn Plummer
 
R in finance: Introduction to R and Its Applications in Finance
R in finance: Introduction to R and Its Applications in FinanceR in finance: Introduction to R and Its Applications in Finance
R in finance: Introduction to R and Its Applications in FinanceLiang C. Zhang (張良丞)
 
Grouping & Summarizing Data in R
Grouping & Summarizing Data in RGrouping & Summarizing Data in R
Grouping & Summarizing Data in RJeffrey Breen
 
WSO2Con USA 2017: Scalable Real-time Complex Event Processing at Uber
WSO2Con USA 2017: Scalable Real-time Complex Event Processing at UberWSO2Con USA 2017: Scalable Real-time Complex Event Processing at Uber
WSO2Con USA 2017: Scalable Real-time Complex Event Processing at UberWSO2
 
GAFAnomics: New Economy, New Rules
GAFAnomics: New Economy, New RulesGAFAnomics: New Economy, New Rules
GAFAnomics: New Economy, New RulesFabernovel
 
Tips for data science competitions
Tips for data science competitionsTips for data science competitions
Tips for data science competitionsOwen Zhang
 

Andere mochten auch (9)

eRum2016 -RevoScaleR - Performance and Scalability R
eRum2016 -RevoScaleR - Performance and Scalability ReRum2016 -RevoScaleR - Performance and Scalability R
eRum2016 -RevoScaleR - Performance and Scalability R
 
FIWARE Complex Event Processing
FIWARE Complex Event ProcessingFIWARE Complex Event Processing
FIWARE Complex Event Processing
 
Complex Event Processing
Complex Event ProcessingComplex Event Processing
Complex Event Processing
 
R in finance: Introduction to R and Its Applications in Finance
R in finance: Introduction to R and Its Applications in FinanceR in finance: Introduction to R and Its Applications in Finance
R in finance: Introduction to R and Its Applications in Finance
 
Decision tree and random forest
Decision tree and random forestDecision tree and random forest
Decision tree and random forest
 
Grouping & Summarizing Data in R
Grouping & Summarizing Data in RGrouping & Summarizing Data in R
Grouping & Summarizing Data in R
 
WSO2Con USA 2017: Scalable Real-time Complex Event Processing at Uber
WSO2Con USA 2017: Scalable Real-time Complex Event Processing at UberWSO2Con USA 2017: Scalable Real-time Complex Event Processing at Uber
WSO2Con USA 2017: Scalable Real-time Complex Event Processing at Uber
 
GAFAnomics: New Economy, New Rules
GAFAnomics: New Economy, New RulesGAFAnomics: New Economy, New Rules
GAFAnomics: New Economy, New Rules
 
Tips for data science competitions
Tips for data science competitionsTips for data science competitions
Tips for data science competitions
 

Ähnlich wie Scalable Strategies for Computing with Massive Data: The Bigmemory Project

Data Mining: Future Trends and Applications
Data Mining: Future Trends and ApplicationsData Mining: Future Trends and Applications
Data Mining: Future Trends and ApplicationsIJMER
 
Data minig with Big data analysis
Data minig with Big data analysisData minig with Big data analysis
Data minig with Big data analysisPoonam Kshirsagar
 
Machine Learning meets Granular Computing
Machine Learning meets Granular ComputingMachine Learning meets Granular Computing
Machine Learning meets Granular ComputingJenny Midwinter
 
Big Data, Big Deal: For Future Big Data Scientists
Big Data, Big Deal: For Future Big Data ScientistsBig Data, Big Deal: For Future Big Data Scientists
Big Data, Big Deal: For Future Big Data ScientistsWay-Yen Lin
 
1. Web Mining – Web mining is an application of data mining for di.docx
1. Web Mining – Web mining is an application of data mining for di.docx1. Web Mining – Web mining is an application of data mining for di.docx
1. Web Mining – Web mining is an application of data mining for di.docxbraycarissa250
 
Lecture1 introduction to big data
Lecture1 introduction to big dataLecture1 introduction to big data
Lecture1 introduction to big datahktripathy
 
Optimizing Bigdata Processing by using Hybrid Hierarchically Distributed Data...
Optimizing Bigdata Processing by using Hybrid Hierarchically Distributed Data...Optimizing Bigdata Processing by using Hybrid Hierarchically Distributed Data...
Optimizing Bigdata Processing by using Hybrid Hierarchically Distributed Data...IJCSIS Research Publications
 
The Science of Data Science
The Science of Data Science The Science of Data Science
The Science of Data Science James Hendler
 
Pemanfaatan Big Data Dalam Riset 2023.pptx
Pemanfaatan Big Data Dalam Riset 2023.pptxPemanfaatan Big Data Dalam Riset 2023.pptx
Pemanfaatan Big Data Dalam Riset 2023.pptxelisarosa29
 
Map Reduce in Big fata
Map Reduce in Big fataMap Reduce in Big fata
Map Reduce in Big fataSuraj Sawant
 
Big data: Challenges, Practices and Technologies
Big data: Challenges, Practices and TechnologiesBig data: Challenges, Practices and Technologies
Big data: Challenges, Practices and TechnologiesNavneet Randhawa
 
Predictive Analytics: Context and Use Cases
Predictive Analytics: Context and Use CasesPredictive Analytics: Context and Use Cases
Predictive Analytics: Context and Use CasesKimberley Mitchell
 
Machine Learning and Data Analytics in Semiconductor Yield Management.pptx
Machine Learning and Data Analytics in Semiconductor Yield Management.pptxMachine Learning and Data Analytics in Semiconductor Yield Management.pptx
Machine Learning and Data Analytics in Semiconductor Yield Management.pptxyieldWerx Semiconductor
 
NoSQL & Big Data Analytics: History, Hype, Opportunities
NoSQL & Big Data Analytics: History, Hype, OpportunitiesNoSQL & Big Data Analytics: History, Hype, Opportunities
NoSQL & Big Data Analytics: History, Hype, OpportunitiesVishy Poosala
 

Ähnlich wie Scalable Strategies for Computing with Massive Data: The Bigmemory Project (20)

Data Mining: Future Trends and Applications
Data Mining: Future Trends and ApplicationsData Mining: Future Trends and Applications
Data Mining: Future Trends and Applications
 
Data minig with Big data analysis
Data minig with Big data analysisData minig with Big data analysis
Data minig with Big data analysis
 
Big data analytics
Big data analyticsBig data analytics
Big data analytics
 
Machine Learning meets Granular Computing
Machine Learning meets Granular ComputingMachine Learning meets Granular Computing
Machine Learning meets Granular Computing
 
Big Data, Big Deal: For Future Big Data Scientists
Big Data, Big Deal: For Future Big Data ScientistsBig Data, Big Deal: For Future Big Data Scientists
Big Data, Big Deal: For Future Big Data Scientists
 
Research Proposal
Research ProposalResearch Proposal
Research Proposal
 
1. Web Mining – Web mining is an application of data mining for di.docx
1. Web Mining – Web mining is an application of data mining for di.docx1. Web Mining – Web mining is an application of data mining for di.docx
1. Web Mining – Web mining is an application of data mining for di.docx
 
Lecture1 introduction to big data
Lecture1 introduction to big dataLecture1 introduction to big data
Lecture1 introduction to big data
 
Optimizing Bigdata Processing by using Hybrid Hierarchically Distributed Data...
Optimizing Bigdata Processing by using Hybrid Hierarchically Distributed Data...Optimizing Bigdata Processing by using Hybrid Hierarchically Distributed Data...
Optimizing Bigdata Processing by using Hybrid Hierarchically Distributed Data...
 
ICMCSI 2023 PPT 1074.pptx
ICMCSI 2023 PPT 1074.pptxICMCSI 2023 PPT 1074.pptx
ICMCSI 2023 PPT 1074.pptx
 
The Science of Data Science
The Science of Data Science The Science of Data Science
The Science of Data Science
 
Pemanfaatan Big Data Dalam Riset 2023.pptx
Pemanfaatan Big Data Dalam Riset 2023.pptxPemanfaatan Big Data Dalam Riset 2023.pptx
Pemanfaatan Big Data Dalam Riset 2023.pptx
 
Big Data & DS Analytics for PAARL
Big Data & DS Analytics for PAARLBig Data & DS Analytics for PAARL
Big Data & DS Analytics for PAARL
 
Map Reduce in Big fata
Map Reduce in Big fataMap Reduce in Big fata
Map Reduce in Big fata
 
Big data: Challenges, Practices and Technologies
Big data: Challenges, Practices and TechnologiesBig data: Challenges, Practices and Technologies
Big data: Challenges, Practices and Technologies
 
Predictive Analytics: Context and Use Cases
Predictive Analytics: Context and Use CasesPredictive Analytics: Context and Use Cases
Predictive Analytics: Context and Use Cases
 
Big data_郭惠民
Big data_郭惠民Big data_郭惠民
Big data_郭惠民
 
Machine Learning and Data Analytics in Semiconductor Yield Management.pptx
Machine Learning and Data Analytics in Semiconductor Yield Management.pptxMachine Learning and Data Analytics in Semiconductor Yield Management.pptx
Machine Learning and Data Analytics in Semiconductor Yield Management.pptx
 
NoSQL & Big Data Analytics: History, Hype, Opportunities
NoSQL & Big Data Analytics: History, Hype, OpportunitiesNoSQL & Big Data Analytics: History, Hype, Opportunities
NoSQL & Big Data Analytics: History, Hype, Opportunities
 
An introduction to data mining
An introduction to data miningAn introduction to data mining
An introduction to data mining
 

Scalable Strategies for Computing with Massive Data: The Bigmemory Project

  • 1. Motivation and Overview Strategies for computing with massive data Modeling with Massive Data Conclusion Scalable Strategies for Computing with Massive Data: The Bigmemory Project Michael J. Kane and John W. Emerson http://www.bigmemory.org/ http://www.stat.yale.edu/~mjk56/ Associate Research Scientist, Yale University http://www.stat.yale.edu/~jay/ Associate Professor of Statistics, Yale University Note: This is a modified version of a talk that was written by Jay. Michael J. Kane and John W. Emerson http://www.bigmemory.org/ Scalable Strategies for Computing with Massive Data: The Bigmemory Project
  • 2. Motivation and Overview Strategies for computing with massive data Modeling with Massive Data Conclusion Outline 1 Motivation and Overview 2 Strategies for computing with massive data 3 Modeling with Massive Data 4 Conclusion Michael J. Kane and John W. Emerson http://www.bigmemory.org/ Scalable Strategies for Computing with Massive Data: The Bigmemory Project
  • 3. Motivation and Overview Strategies for computing with massive data Modeling with Massive Data Conclusion A new era The analysis of very large data sets has recently become an active area of research in statistics and machine learning. Many new computational challenges arise when managing, exploring, and analyzing these data sets, challenges that effectively put the data beyond the reach of researchers who lack specialized software development skills of expensive hardware. “We have entered an era of massive scientific data collection, with a demand for answers to large-scale inference problems that lie beyond the scope of classical statistics.” – Efron (2005) “classical statistics” should include “mainstream computational statistics.” – Kane, Emerson, and Weston (in preparation, in reference to Efron’s quote) Michael J. Kane and John W. Emerson http://www.bigmemory.org/ Scalable Strategies for Computing with Massive Data: The Bigmemory Project
  • 4. Motivation and Overview Strategies for computing with massive data Modeling with Massive Data Conclusion Small data approaches don’t scale... Data analysts who want to explore massive data must be aware of the various pitfalls; adopt new approaches to avoid them. This talk will illustrate common challenges for dealing with massive data; provide general solutions for avoiding the pitfalls. Michael J. Kane and John W. Emerson http://www.bigmemory.org/ Scalable Strategies for Computing with Massive Data: The Bigmemory Project
  • 5. Motivation and Overview Strategies for computing with massive data Modeling with Massive Data Conclusion Example data sets Airline on-time data 2009 JSM Data Expo About 120 million commercial US airline flights over 20 years 29 variables, integer-valued or categorical (recoded as integer) About 12 gigabytes (GB) http://stat-computing.org/dataexpo/2009/ Netflix data About 100 million ratings from 500,000 customers for 17,000 movies About 2 GB stored as integers No statisticians on the winning team; hard to find statisticians on the leaderboard Top teams: access to expensive hardware; professional computer science and programming expertise http://www.netflixprize.com/ Michael J. Kane and John W. Emerson http://www.bigmemory.org/ Scalable Strategies for Computing with Massive Data: The Bigmemory Project
  • 6. Motivation and Overview Strategies for computing with massive data Modeling with Massive Data Conclusion R as the language for computing with data R is the lingua franca of statistics: The syntax is simple and well-suited for data exploration and analysis. It has excellent graphical capabilities. It is extensible, with over 2500 packages available on CRAN alone. It is open source and freely available for Windows/MacOS/Linux platforms. Michael J. Kane and John W. Emerson http://www.bigmemory.org/ Scalable Strategies for Computing with Massive Data: The Bigmemory Project
  • 7. Motivation and Overview Strategies for computing with massive data Modeling with Massive Data Conclusion The Bigmemory Project for managing and computing with massive matrices Set of packages (bigmemory, bigtabulate, biganalytics, synchronicity, and bigalgebra) that extends the R programming environment Core data structures are written in C++ Could be used to extend other high-level computing environments Michael J. Kane and John W. Emerson http://www.bigmemory.org/ Scalable Strategies for Computing with Massive Data: The Bigmemory Project
  • 8. Motivation and Overview Strategies for computing with massive data Modeling with Massive Data Conclusion The Bigmemory Project: http://www.bigmemory.org/ NMF DNAcopy irlba non-negative DNA copy truncated matrix number data SVDs on factorization analysis (big.)matrices methods biglm biganalytics bigalgebra generic regressions for statistical linear algebra for functions data too big to analyses with (big.)matrices fit in memory big.matrices stats bigmemory Core big.matrix statistical creation and functions manipution utils foreach bigtabulate synchronicity utility concurrent- fast tabulation mutual functions enabled and exclusions loops summaries R base packages The Bigmemory Project (not all are shown) doMC doNWS doSnow doSMP doRedis package parallel concurrent concurrent concurrent concurrent An R package and description backend for backend for backend for backend for backend for description SMP unix NetworkSpaces snow SMP machines redis A B A depends on (or imports) B Low-level parallel support B C B suggests C Michael J. Kane and John W. Emerson http://www.bigmemory.org/ Scalable Strategies for Computing with Massive Data: The Bigmemory Project
  • 9. Motivation and Overview Strategies for computing with massive data Modeling with Massive Data Conclusion Importing and managing massive data The Netflix data ∼ 2 GB: read.table() uses 3.7 GB total in the process of importing the data (taking 140 seconds): > net <- read.table("netflix.txt", header=FALSE, + sep="t", colClasses = "integer") > object.size(net) 1981443344 bytes With bigmemory, only the 2 GB is needed (no memory overhead, taking 130 seconds): > net <- read.big.matrix("netflix.txt", header=FALSE, + sep="t", type = "integer", > net[1:3,] [,1] [,2] [,3] [,4] [,5] [1,] 1 1 3 2005 9 [2,] 1 2 5 2005 5 [3,] 1 3 4 2005 10 Michael J. Kane and John W. Emerson http://www.bigmemory.org/ Scalable Strategies for Computing with Massive Data: The Bigmemory Project
  • 10. Motivation and Overview Strategies for computing with massive data Modeling with Massive Data Conclusion Importing and managing massive data read.table(): memory overhead as much as 100% of size of data data.frame and matrix objects not available in shared memory resulting data.frame is limited in size by available RAM, recommended maximum 10%-20% of RAM read.big.matrix(): matrix-like data (not data frames) no memory overhead faster that read.table() resulting big.matrix supports shared memory for efficient parallel programming resulting big.matrix supports file-backed objects for data larger-than-RAM Michael J. Kane and John W. Emerson http://www.bigmemory.org/ Scalable Strategies for Computing with Massive Data: The Bigmemory Project
  • 11. Motivation and Overview Strategies for computing with massive data Modeling with Massive Data Conclusion Importing and managing massive data With the full Airline data (∼ 12 GB), bigmemory’s file-backing allows you to work with the data even with only 4 GB of RAM, for example: > x <- read.big.matrix("airline.csv", header=TRUE, + backingfile="airline.bin", + descriptorfile="airline.desc", + type="integer") > x An object of class "big.matrix" Slot "address": <pointer: 0x3031fc0> > rm(x) > x <- attach.big.matrix("airline.desc") > dim(x) [1] 123534969 29 Michael J. Kane and John W. Emerson http://www.bigmemory.org/ Scalable Strategies for Computing with Massive Data: The Bigmemory Project
  • 12. Motivation and Overview Strategies for computing with massive data Modeling with Massive Data Conclusion Exploring massive data > summary(x[, "DepDelay"]) Min. 1st Qu. Median Mean -1410.000 -2.000 0.000 8.171 3rd Qu. Max. NA's 6.000 2601.000 2302136.000 > > quantile(x[, "DepDelay"], + probs=c(0.5, 0.9, 0.99), na.rm=TRUE) 50% 90% 99% 0 27 128 Michael J. Kane and John W. Emerson http://www.bigmemory.org/ Scalable Strategies for Computing with Massive Data: The Bigmemory Project
  • 13. Motivation and Overview Strategies for computing with massive data Modeling with Massive Data Conclusion Example: caching in action In a fresh R session on this laptop, newly rebooted: > library(bigmemory) > library(biganalytics) > xdesc <- dget("airline.desc") > x <- attach.big.matrix(xdesc) > system.time( numplanes <- colmax(x, "TailNum", + na.rm=TRUE) ) user system elapsed 0.770 0.550 6.144 > system.time( numplanes <- colmax(x, "TailNum", + na.rm=TRUE) ) user system elapsed 0.320 0.000 0.355 Michael J. Kane and John W. Emerson http://www.bigmemory.org/ Scalable Strategies for Computing with Massive Data: The Bigmemory Project
  • 14. Motivation and Overview Strategies for computing with massive data Modeling with Massive Data Conclusion Split-apply-combine Many computational problems in statistics are solved by performing the same calculation repeatedly on independent sets of data. These problems can be solved by partitioning the data (the split) performing a single calculation on each partition (the apply) returning the results in a specified format (the combine) Recent attention: it can be particularly efficient and easily lends itself to parallel computing “split-apply-combine” was coined by Hadley Wickham, but the approach has been supported on a number of different environments for some time under different names: SAS: by Google: MapReduce Apache: Hadoop Michael J. Kane and John W. Emerson http://www.bigmemory.org/ Scalable Strategies for Computing with Massive Data: The Bigmemory Project
  • 15. Motivation and Overview Strategies for computing with massive data Modeling with Massive Data Conclusion Aside: split() > x <- matrix(c(rnorm(5), sample(c(1, 2), 5, + replace = T)), 5, 2) > x [,1] [,2] [1,] -0.89691455 2 [2,] 0.18484918 1 [3,] 1.58784533 2 [4,] -1.13037567 1 [5,] -0.08025176 1 > split(x[, 1], x[, 2]) $`1` [1] 0.18484918 -1.13037567 -0.08025176 $`2` [1] -0.8969145 1.5878453 Michael J. Kane and John W. Emerson http://www.bigmemory.org/ Scalable Strategies for Computing with Massive Data: The Bigmemory Project
  • 16. Motivation and Overview Strategies for computing with massive data Modeling with Massive Data Conclusion Exploring massive data > GetDepQuantiles <- function(rows, data) { + return(quantile(data[rows, "DepDelay"], + probs=c(0.5, 0.9, 0.99), na.rm=TRUE)) + } > > groups <- split(1:nrow(x), x[,"DayOfWeek"]) > > qs <- sapply(groups, GetDepQuantiles, data=x) > > colnames(qs) <- c("Mon", "Tue", "Wed", "Thu", + "Fri", "Sat", "Sun") > qs Mon Tue Wed Thu Fri Sat Sun 50% 0 0 0 0 0 0 0 90% 25 23 25 30 33 23 27 99% 127 119 125 136 141 116 130 Michael J. Kane and John W. Emerson http://www.bigmemory.org/ Scalable Strategies for Computing with Massive Data: The Bigmemory Project
  • 17. Motivation and Overview Strategies for computing with massive data Modeling with Massive Data Conclusion Parallelizing the apply step The apply step runs on disjoint subsets of a data set: “embarrassingly parallel” We should be able to run the apply step in parallel but embarrassingly parallel doesn’t mean easy to parallelize Michael J. Kane and John W. Emerson http://www.bigmemory.org/ Scalable Strategies for Computing with Massive Data: The Bigmemory Project
  • 18. Motivation and Overview Strategies for computing with massive data Modeling with Massive Data Conclusion foreach for easy to parallelize, embarrassingly parallel problems > library(foreach) > > library(doMC) > registerDoMC(2) > > ans <- foreach(i = 1:10, .combine = c) %dopar% + { + i^2 + } > > ans [1] 1 4 9 16 25 36 49 64 81 100 Michael J. Kane and John W. Emerson http://www.bigmemory.org/ Scalable Strategies for Computing with Massive Data: The Bigmemory Project
  • 19. Motivation and Overview Strategies for computing with massive data Modeling with Massive Data Conclusion Concurrent programming with foreach (1995 only) > library(foreach) > library(doSNOW) > cl <- makeSOCKcluster(4) > registerDoSNOW(cl) > > x <- read.csv("1995.csv") > dim(x) [1] 5327435 29 Michael J. Kane and John W. Emerson http://www.bigmemory.org/ Scalable Strategies for Computing with Massive Data: The Bigmemory Project
  • 20. Motivation and Overview Strategies for computing with massive data Modeling with Massive Data Conclusion Concurrent programming with foreach (1995 only) 4 cores: 37 seconds; 3 cores: 23 seconds; 2 cores: 15 seconds, 1 core: 8 seconds. Memory overhead of 4 cores: > 2.5 GB! > groups <- split(1:nrow(x), x[,"DayOfWeek"]) > > qs <- foreach(g=groups, .combine=rbind) %dopar% { + GetDepQuantiles(g, data=x) + } > > daysOfWeek <- c("Mon", "Tues", "Wed", "Thu", + "Fri", "Sat", "Sun") > rownames(qs) <- daysOfWeek > t(qs) Mon Tues Wed Thu Fri Sat Sun 50% 0 0 0 1 1 0 1 90% 21 21 26 27 29 23 23 99% 102 107 116 117 115 104 102 Michael J. Kane and John W. Emerson http://www.bigmemory.org/ Scalable Strategies for Computing with Massive Data: The Bigmemory Project
  • 21. Motivation and Overview Strategies for computing with massive data Modeling with Massive Data Conclusion Memory usage: data.frame vs. big.matrix Michael J. Kane and John W. Emerson http://www.bigmemory.org/ Scalable Strategies for Computing with Massive Data: The Bigmemory Project
  • 22. Motivation and Overview Strategies for computing with massive data Modeling with Massive Data Conclusion Execution time: data.frame vs big.matrix Michael J. Kane and John W. Emerson http://www.bigmemory.org/ Scalable Strategies for Computing with Massive Data: The Bigmemory Project
  • 23. Motivation and Overview Strategies for computing with massive data Modeling with Massive Data Conclusion Plane ages: Is split-apply-combine really faster? 10 planes only (there are > 13,000 total planes). With 1 core: 74 seconds. With 2 cores: 38 seconds. All planes, 4 cores: ∼ 2 hours. > library(foreach) > library(doMC) > registerDoMC(cores=2) > tailNums <- unique(x[,"TailNum"]) > planeStart <- foreach(i=1:length(tailNums), + .combine=c) %dopar% { + x <- attach.big.matrix(xdesc) + yearInds <- which(x[,"TailNum"] == i) + y <- x[yearInds,c("Year", "Month")] + minYear <- min(y[,"Year"], na.rm=TRUE) + these <- which(y[,"Year"]==minYear) + minMonth <- min(y[these,"Month"], na.rm=TRUE) + return(12*minYear + minMonth) + } Michael J. Kane and John W. Emerson http://www.bigmemory.org/ Scalable Strategies for Computing with Massive Data: The Bigmemory Project
  • 24. Motivation and Overview Strategies for computing with massive data Modeling with Massive Data Conclusion A better solution: split-apply-combine! > birthmonth <- function(y) { + minYear <- min(y[,'Year'], na.rm=TRUE) + these <- which(y[,'Year']==minYear) + minMonth <- min(y[these,'Month'], na.rm=TRUE) + return(12*minYear + minMonth) + } > > time.0 <- system.time( { + planemap <- split(1:nrow(x), x[,"TailNum"]) + planeStart <- sapply( planemap, + function(i) birthmonth(x[i, c('Year','Month'), + drop=FALSE]) ) + } ) > > time.0 user system elapsed 53.520 2.020 78.925 Michael J. Kane and John W. Emerson http://www.bigmemory.org/ Scalable Strategies for Computing with Massive Data: The Bigmemory Project
  • 25. Motivation and Overview Strategies for computing with massive data Modeling with Massive Data Conclusion Parallel split-apply-combine Using 4 cores, we can reduce the time to ∼ 20 seconds (not including the read.big.matrix(), repeated here for a special reason: x <- read.big.matrix("airline.csv", header = TRUE, backingfile = "airline.bin", descriptorfile = "airline.desc", type = "integer", extraCols = "Age") planeindices <- split(1:nrow(x), x[, "TailNum"]) planeStart <- foreach(i = planeindices, .combine = c) %dopar% { birthmonth(x[i, c("Year", "Month"), drop = FALSE]) } x[, "Age"] <- x[, "Year"] * as.integer(12) + x[, "Month"] - as.integer(planeStart[x[, "TailNum"]]) Michael J. Kane and John W. Emerson http://www.bigmemory.org/ Scalable Strategies for Computing with Massive Data: The Bigmemory Project
  • 26. Motivation and Overview Strategies for computing with massive data Modeling with Massive Data Conclusion Massive linear models via biglm > library (biganalytics) > x <- attach.big.matrix("airline.desc") > blm <- biglm.big.matrix(DepDelay ~ Age, data = x) > summary(blm) Large data regression model: biglm(formula = formula, data = data , ...) Sample size = 84406323 Coef (95% CI) SE p (Intercept) 8.5889 8.5786 8.5991 0.0051 0 Age 0.0053 0.0051 0.0055 0.0001 0 Michael J. Kane and John W. Emerson http://www.bigmemory.org/ Scalable Strategies for Computing with Massive Data: The Bigmemory Project
  • 27. Motivation and Overview Strategies for computing with massive data Modeling with Massive Data Conclusion Netflix: a truncated singular value decomposition LOTR(ex) LOTR(ex) ST ST ST ST SW SW ST SW LOTR HP LOTR LOTR AP AP AP HP HP SW: I SW: II Horror Hospital WestWest Wing: Season4 The Wing: Season 1 The The West Wing: Season 23 The Horror Within The Shawshank Redemption Waterworld Rain Man Braveheart Mission: Impossible The Green Mile Men in Black Men in Black II Titanic The Rock Independence Day Michael J. Kane and John W. Emerson http://www.bigmemory.org/ Scalable Strategies for Computing with Massive Data: The Bigmemory Project
  • 28. Motivation and Overview Strategies for computing with massive data Modeling with Massive Data Conclusion Comparing the Bigmemory with a Database What distinguishes the Bigmemory Project from a database? both manage large/massive data both can be used for efficient subset selection A database is a general tool for managing data databases addition and deletion of data database can manage complex relationship between data A big.matrix object is numerical object underlying representation is compatible with BLAS/LAPACK libraries intended as a platform for development of numerical techniques for large/massive data Michael J. Kane and John W. Emerson http://www.bigmemory.org/ Scalable Strategies for Computing with Massive Data: The Bigmemory Project
  • 29. Motivation and Overview Strategies for computing with massive data Modeling with Massive Data Conclusion Summary Four new ways to work with very large sets of data: memory and file-mapped data structures, which provide access to arbitrarily large sets of data while retaining a look and feel that is familiar to data analysts; data structures that are shared across processor cores on a single computer, in order to support efficient parallel computing techniques when multiple processors are used; file-mapped data structures that allow concurrent access by the different nodes in a cluster of computers. a technology-agnostic means for implementing embarrassingly parallel loops These techniques are currently implemented for R, but are intended to provide a flexible framework for future developments in the field of statistical computing. Michael J. Kane and John W. Emerson http://www.bigmemory.org/ Scalable Strategies for Computing with Massive Data: The Bigmemory Project
  • 30. Motivation and Overview Strategies for computing with massive data Modeling with Massive Data Conclusion Thanks! http://www.bigmemory.org/ Dirk Eddelbuettel, Bryan Lewis, Steve Weston, and Martin Schultz, for their feedback and advice over the last three years Bell Laboratories (Rick Becker, John Chambers and Allan Wilks), for development of the S language Ross Ihaka and Robert Gentleman, for their work and unselfish vision for R The R Core team David Pollard, for pushing us to better communicate the contributions of the project to statisticians Michael J. Kane and John W. Emerson http://www.bigmemory.org/ Scalable Strategies for Computing with Massive Data: The Bigmemory Project
  • 31. Motivation and Overview Strategies for computing with massive data Modeling with Massive Data Conclusion OTHER SLIDES http://www.bigmemory.org/ http://www.stat.yale.edu/~jay/ and ...yale.edu/~jay/Brazil/SP/bigmemory/ Michael J. Kane and John W. Emerson http://www.bigmemory.org/ Scalable Strategies for Computing with Massive Data: The Bigmemory Project
  • 32. Motivation and Overview Strategies for computing with massive data Modeling with Massive Data Conclusion Overview: the Bigmemory Project Problems and challenges: R frequently makes copies of objects, which can be costly Guideline: R’s performance begins to degrade with objects more than about 10% of the address space, or when total objects consume more than about 1/3 of RAM. swapping: not sufficient chunking: inefficient, customized coding parallel programming: memory problems, non-portable solutions shared memory: essentially inaccessible to non-experts Key parts of the solutions: operating system caching shared memory file-backing for larger-than-RAM data a framework for platform-independent parallel programming (credit Steve Weston, independently of the BMP) Michael J. Kane and John W. Emerson http://www.bigmemory.org/ Scalable Strategies for Computing with Massive Data: The Bigmemory Project
  • 33. Motivation and Overview Strategies for computing with massive data Modeling with Massive Data Conclusion Extending capabilities versus extending the language Extending R’s capabilities: Providing a new algorithm, statistical analysis, or data structure Examples: lm(), or package bcp Most of the 2500+ packages on CRAN and elsewhere Extending the language: Example: grDevices, a low-level interface to create and manipulate graphics devices Example: grid graphics The packages of the Bigmemory Project: a developer’s interface to underlying operating system functionality which could not have been written in R itself a higher-level interface designed to mimic R’s matrix objects so that statisticians can use their current knowledge of computing with data sets as a starting point for dealing with massive ones. Michael J. Kane and John W. Emerson http://www.bigmemory.org/ Scalable Strategies for Computing with Massive Data: The Bigmemory Project