SlideShare ist ein Scribd-Unternehmen logo
1 von 45
Downloaden Sie, um offline zu lesen
Streaming Data,
Concurrency And R

     Rory Winston

   rory@theresearchkitchen.com
About Me




      Independent Software Consultant
      M.Sc. Applied Computing, 2000
      M.Sc. Finance, 2008
      Apache Committer
      Working in the financial sector for the last 7 years or so
      Interested in practical applications of functional languages and
      machine learning
      Relatively recent convert to R ( ≈ 2 years)
R - Pros and Cons


   Pro
         Designed by statisticians    Con
         Can be extremely elegant
                                            Designed by statisticians
         Comprehensive extension
                                            Can be clunky (S4)
         library
                                            Bewildering array of
         Open-source
                                            overlapping extensions
         Huge parallelization effort
                                            Inherently single-threaded
         Fantastic reporting
                                            Incredibly Popular
         capabilities
         Incredibly Popular
R - Pros and Cons


   Pro
         Designed by statisticians    Con
         Can be extremely elegant
                                            Designed by statisticians
         Comprehensive extension
                                            Can be clunky (S4)
         library
                                            Bewildering array of
         Open-source
                                            overlapping extensions
         Huge parallelization effort
                                            Inherently single-threaded
         Fantastic reporting
                                            Incredibly Popular
         capabilities
         Incredibly Popular
R - Pros and Cons


   Pro
         Designed by statisticians    Con
         Can be extremely elegant
                                            Designed by statisticians
         Comprehensive extension
                                            Can be clunky (S4)
         library
                                            Bewildering array of
         Open-source
                                            overlapping extensions
         Huge parallelization effort
                                            Inherently single-threaded
         Fantastic reporting
                                            Incredibly Popular
         capabilities
         Incredibly Popular
R - Pros and Cons


   Pro
         Designed by statisticians    Con
         Can be extremely elegant
                                            Designed by statisticians
         Comprehensive extension
                                            Can be clunky (S4)
         library
                                            Bewildering array of
         Open-source
                                            overlapping extensions
         Huge parallelization effort
                                            Inherently single-threaded
         Fantastic reporting
                                            Incredibly Popular
         capabilities
         Incredibly Popular
R - Pros and Cons


   Pro
         Designed by statisticians    Con
         Can be extremely elegant
                                            Designed by statisticians
         Comprehensive extension
                                            Can be clunky (S4)
         library
                                            Bewildering array of
         Open-source
                                            overlapping extensions
         Huge parallelization effort
                                            Inherently single-threaded
         Fantastic reporting
                                            Incredibly Popular
         capabilities
         Incredibly Popular
R - Pros and Cons


   Pro
         Designed by statisticians    Con
         Can be extremely elegant
                                            Designed by statisticians
         Comprehensive extension
                                            Can be clunky (S4)
         library
                                            Bewildering array of
         Open-source
                                            overlapping extensions
         Huge parallelization effort
                                            Inherently single-threaded
         Fantastic reporting
                                            Incredibly Popular
         capabilities
         Incredibly Popular
R - Pros and Cons


   Pro
         Designed by statisticians    Con
         Can be extremely elegant
                                            Designed by statisticians
         Comprehensive extension
                                            Can be clunky (S4)
         library
                                            Bewildering array of
         Open-source
                                            overlapping extensions
         Huge parallelization effort
                                            Inherently single-threaded
         Fantastic reporting
                                            Incredibly Popular
         capabilities
         Incredibly Popular
R - Pros and Cons


   Pro
         Designed by statisticians    Con
         Can be extremely elegant
                                            Designed by statisticians
         Comprehensive extension
                                            Can be clunky (S4)
         library
                                            Bewildering array of
         Open-source
                                            overlapping extensions
         Huge parallelization effort
                                            Inherently single-threaded
         Fantastic reporting
                                            Incredibly Popular
         capabilities
         Incredibly Popular
R - Pros and Cons


   Pro
         Designed by statisticians    Con
         Can be extremely elegant
                                            Designed by statisticians
         Comprehensive extension
                                            Can be clunky (S4)
         library
                                            Bewildering array of
         Open-source
                                            overlapping extensions
         Huge parallelization effort
                                            Inherently single-threaded
         Fantastic reporting
                                            Incredibly Popular
         capabilities
         Incredibly Popular
R - Pros and Cons


   Pro
         Designed by statisticians    Con
         Can be extremely elegant
                                            Designed by statisticians
         Comprehensive extension
                                            Can be clunky (S4)
         library
                                            Bewildering array of
         Open-source
                                            overlapping extensions
         Huge parallelization effort
                                            Inherently single-threaded
         Fantastic reporting
                                            Incredibly Popular
         capabilities
         Incredibly Popular
R - Pros and Cons


   Pro
         Designed by statisticians    Con
         Can be extremely elegant
                                            Designed by statisticians
         Comprehensive extension
                                            Can be clunky (S4)
         library
                                            Bewildering array of
         Open-source
                                            overlapping extensions
         Huge parallelization effort
                                            Inherently single-threaded
         Fantastic reporting
                                            Incredibly Popular
         capabilities
         Incredibly Popular
R - Pros and Cons


   Pro
         Designed by statisticians    Con
         Can be extremely elegant
                                            Designed by statisticians
         Comprehensive extension
                                            Can be clunky (S4)
         library
                                            Bewildering array of
         Open-source
                                            overlapping extensions
         Huge parallelization effort
                                            Inherently single-threaded
         Fantastic reporting
                                            Incredibly Popular
         capabilities
         Incredibly Popular
R - Pros and Cons


   Pro
         Designed by statisticians    Con
         Can be extremely elegant
                                            Designed by statisticians
         Comprehensive extension
                                            Can be clunky (S4)
         library
                                            Bewildering array of
         Open-source
                                            overlapping extensions
         Huge parallelization effort
                                            Inherently single-threaded
         Fantastic reporting
                                            Incredibly Popular
         capabilities
         Incredibly Popular
R - Pros and Cons


   Pro
         Designed by statisticians    Con
         Can be extremely elegant
                                            Designed by statisticians
         Comprehensive extension
                                            Can be clunky (S4)
         library
                                            Bewildering array of
         Open-source
                                            overlapping extensions
         Huge parallelization effort
                                            Inherently single-threaded
         Fantastic reporting
                                            Incredibly Popular
         capabilities
         Incredibly Popular
R - Pros and Cons


   Pro
         Designed by statisticians    Con
         Can be extremely elegant
                                            Designed by statisticians
         Comprehensive extension
                                            Can be clunky (S4)
         library
                                            Bewildering array of
         Open-source
                                            overlapping extensions
         Huge parallelization effort
                                            Inherently single-threaded
         Fantastic reporting
                                            Incredibly Popular
         capabilities
         Incredibly Popular
Parallelization vs. Concurrency



        R interpreter is single threaded
        Some historical context for this (BLAS implementations)
        Not necessarily a limitation in the general context
        Multithreading can be complex and problematic
        Instead a focus on parallelization:
             Distributed computation: gridR, nws, snow
             Multicore/multi-cpu scaling: Rmpi, Romp, pnmath/pnmath0
             Interfaces to Pthreads/PBLAS/OpenMP/MPI/Globus/etc.
        Parallelization suits cpu-bound large data processing
        applications
Other Scalability and Performance Work




        JIT/bytecode compilation (Ra)
        Implicit vectorization a la Matlab (code analysis)
        Large (≥ RAM) dataset handling (bigmemory,ff)
        Many incremental performance improvements (e.g. less
        internal copying)
        Next: GPU/massive multicore...?
What Benefit Concurrency?




       Real-time (streaming to be more precise) data analysis
       Growing Interest in using R for streaming data, not just offline
       analyis
       GUI toolkit integration
       Fine-grained control over independent task execution
       "I believe that explicit concurrency management tools (i.e. a
       threads toolkit) are what we really need in R at this point." -
       Luke Tierney, 2001
Will There Be A Multithreaded R?


        Short answer is: probably not
        At least not in its current incarnation
        Internal workings of the interpreter not particularly amenable
        to concurrency:
            Functions can manipulate caller state («- vs. <-)
            Lazy evaluation machinery (promises)
            Dynamic State, garbage collection, etc.
            Scoping: global environments
            Management of resources: streams, I/O, connections, sinks
        Implications for current code
        Possibly in the next language evolution (cf. Ihaka?)
        Large amount of work (but potentially do-able)
Will There Be A Multithreaded R?


        Short answer is: probably not
        At least not in its current incarnation
        Internal workings of the interpreter not particularly amenable
        to concurrency:
            Functions can manipulate caller state («- vs. <-)
            Lazy evaluation machinery (promises)
            Dynamic State, garbage collection, etc.
            Scoping: global environments
            Management of resources: streams, I/O, connections, sinks
        Implications for current code
        Possibly in the next language evolution (cf. Ihaka?)
        Large amount of work (but potentially do-able)
Will There Be A Multithreaded R?


        Short answer is: probably not
        At least not in its current incarnation
        Internal workings of the interpreter not particularly amenable
        to concurrency:
            Functions can manipulate caller state («- vs. <-)
            Lazy evaluation machinery (promises)
            Dynamic State, garbage collection, etc.
            Scoping: global environments
            Management of resources: streams, I/O, connections, sinks
        Implications for current code
        Possibly in the next language evolution (cf. Ihaka?)
        Large amount of work (but potentially do-able)
Will There Be A Multithreaded R?


        Short answer is: probably not
        At least not in its current incarnation
        Internal workings of the interpreter not particularly amenable
        to concurrency:
            Functions can manipulate caller state («- vs. <-)
            Lazy evaluation machinery (promises)
            Dynamic State, garbage collection, etc.
            Scoping: global environments
            Management of resources: streams, I/O, connections, sinks
        Implications for current code
        Possibly in the next language evolution (cf. Ihaka?)
        Large amount of work (but potentially do-able)
Will There Be A Multithreaded R?


        Short answer is: probably not
        At least not in its current incarnation
        Internal workings of the interpreter not particularly amenable
        to concurrency:
            Functions can manipulate caller state («- vs. <-)
            Lazy evaluation machinery (promises)
            Dynamic State, garbage collection, etc.
            Scoping: global environments
            Management of resources: streams, I/O, connections, sinks
        Implications for current code
        Possibly in the next language evolution (cf. Ihaka?)
        Large amount of work (but potentially do-able)
Will There Be A Multithreaded R?


        Short answer is: probably not
        At least not in its current incarnation
        Internal workings of the interpreter not particularly amenable
        to concurrency:
            Functions can manipulate caller state («- vs. <-)
            Lazy evaluation machinery (promises)
            Dynamic State, garbage collection, etc.
            Scoping: global environments
            Management of resources: streams, I/O, connections, sinks
        Implications for current code
        Possibly in the next language evolution (cf. Ihaka?)
        Large amount of work (but potentially do-able)
Will There Be A Multithreaded R?


        Short answer is: probably not
        At least not in its current incarnation
        Internal workings of the interpreter not particularly amenable
        to concurrency:
            Functions can manipulate caller state («- vs. <-)
            Lazy evaluation machinery (promises)
            Dynamic State, garbage collection, etc.
            Scoping: global environments
            Management of resources: streams, I/O, connections, sinks
        Implications for current code
        Possibly in the next language evolution (cf. Ihaka?)
        Large amount of work (but potentially do-able)
Will There Be A Multithreaded R?


        Short answer is: probably not
        At least not in its current incarnation
        Internal workings of the interpreter not particularly amenable
        to concurrency:
            Functions can manipulate caller state («- vs. <-)
            Lazy evaluation machinery (promises)
            Dynamic State, garbage collection, etc.
            Scoping: global environments
            Management of resources: streams, I/O, connections, sinks
        Implications for current code
        Possibly in the next language evolution (cf. Ihaka?)
        Large amount of work (but potentially do-able)
Will There Be A Multithreaded R?


        Short answer is: probably not
        At least not in its current incarnation
        Internal workings of the interpreter not particularly amenable
        to concurrency:
            Functions can manipulate caller state («- vs. <-)
            Lazy evaluation machinery (promises)
            Dynamic State, garbage collection, etc.
            Scoping: global environments
            Management of resources: streams, I/O, connections, sinks
        Implications for current code
        Possibly in the next language evolution (cf. Ihaka?)
        Large amount of work (but potentially do-able)
Will There Be A Multithreaded R?


        Short answer is: probably not
        At least not in its current incarnation
        Internal workings of the interpreter not particularly amenable
        to concurrency:
            Functions can manipulate caller state («- vs. <-)
            Lazy evaluation machinery (promises)
            Dynamic State, garbage collection, etc.
            Scoping: global environments
            Management of resources: streams, I/O, connections, sinks
        Implications for current code
        Possibly in the next language evolution (cf. Ihaka?)
        Large amount of work (but potentially do-able)
Will There Be A Multithreaded R?


        Short answer is: probably not
        At least not in its current incarnation
        Internal workings of the interpreter not particularly amenable
        to concurrency:
            Functions can manipulate caller state («- vs. <-)
            Lazy evaluation machinery (promises)
            Dynamic State, garbage collection, etc.
            Scoping: global environments
            Management of resources: streams, I/O, connections, sinks
        Implications for current code
        Possibly in the next language evolution (cf. Ihaka?)
        Large amount of work (but potentially do-able)
Will There Be A Multithreaded R?


        Short answer is: probably not
        At least not in its current incarnation
        Internal workings of the interpreter not particularly amenable
        to concurrency:
            Functions can manipulate caller state («- vs. <-)
            Lazy evaluation machinery (promises)
            Dynamic State, garbage collection, etc.
            Scoping: global environments
            Management of resources: streams, I/O, connections, sinks
        Implications for current code
        Possibly in the next language evolution (cf. Ihaka?)
        Large amount of work (but potentially do-able)
Example Application




        Based on work I did last year and presented at UseR! 2008
        Wrote a real-time and historical market data service from
        Reuters/R
        The real-time interface used the Reuters C++ API
        R extension in C++ that spawned listening thread and
        handled updates
Simplified Architecture




                                R


                         extension (C++)



                           realtime bus
Example Usage



          rsub <- function(duration, items, callback)


   The call rsub will subscribe to the specified rate(s) for the duration
   of time specified by duration (ms). When a tick arrives, the
   callback function callback is invoked, with a data frame
   containing the fields specified in items.

   Multiple market data items may be subscribed to, and any
   combination of fields may be be specified.

   Uses the underlying RFA API, which provides a C++ interface to
   real-time market updates.
Real-Time Example


   # Specify field names to retrieve
   fields <- c("BID","ASK","TIMCOR")

   # Subscribe to EUR/USD and GBP/USD ticks
   items <- list()
   items[[1]] <- c("IDN_SELECTFEED", "EUR=", fields)
   items[[2]] <- c("IDN_SELECTFEED", "GBP=", fields)

   # Simple Callback Function
   callback <- function(df) { print(paste("Received",df)) }

   # Subscribe for 1 hour
   ONE_HOUR <- 1000*(60)^2
   rsub(ONE_HOUR, items, callback)
Issues With This Approach




        As R interpreter is single threaded, cannot spawn thread for
        callbacks
        Thus, interpreter thread is locked for the duration of
        subscription
        Not a great user experience
        Need to find alternative mechanism
Alternative Approach



        If we cannot run subscriber threads in-process, need to
        decouple
        Standard approach: add an extra layer and use some form of
        IPC
        For instance, we could:
            Subscribe in a dedicated R process (A)
            Push incoming data onto a socket
            R process (B) reads from a listening socket
        Sockets could also be another IPC primitive, e.g. pipes
        Also note that R supports asynchronous I/O (?isIncomplete)
        Look at the ibrokers package for examples of this
The bigmemoRy package



       From the description: "Use C++ to create, store,
       access, and manipulate massive matrices"
       Allows creation of large matrices
       These matrices can be mapped to files/shared memory
       It is the shared memory functionality that we will use
       The next version (3.0) will be unveiled at UseR! 2009

   big.matrix(nrow, ncol, type = "integer", ....)
   shared.big.matrix(nrow, ncol, type = "integer", ...)
   filebacked.big.matrix(nrow, ncol, type = "integer", ...)
Sample Usage




   > library(bigmemory) # Note: I'm using pre-release
   > X <- shared.big.matrix(type="double", ncol=1000, nrow=1000)
   > X
   An object of class “big.matrix”
   Slot "address":
   <pointer: 0x7378a0>
Create Shared Memory Descriptor

   > desc <- describe(X)
   > desc
   $sharedType
   [1] "SharedMemory"

   $sharedName
   [1] "53f14925-dca1-42a8-a547-e1bccae999ce"

   $nrow
   [1] 1000

   $ncol
   [1] 1000

   $rowNames
   NULL
Export the Descriptor




    In R session 1:

    > dput(desc, file="~/matrix.desc")

    In R session 2:

    > library(bigmemory)
    > desc <- dget("~/matrix.desc")
    > X <- attach.big.matrix(desc)

    Now R sessions A and B share the same big.matrix instance
Share Data Between Sessions




   R session 1:

   > X[1,1] <- 1.2345

   R session 2:

   > X[1,1]
   [1] 1.2345

   Thus, streaming data can be continuously fed into session A
   And concurrently processed in session B
Summary




      Lack of threads not a barrier to concurrent analysis
      Packages like bigmemory, nws, etc. facilitate decoupling via
      IPC
      nws goes a step further, with a distributed workspace
      Many applications for streaming data:
          Data collection/monitoring
          Development of pricing/risk algorithms
          Low-frequency execution (??)
          ...
References




        http://cran.r-project.org/web/packages/bigmemory/
        http://www.cs.uiowa.edu/ luke/R/thrgui/
        http://www.milbo.users.sonic.net/ra/index.html
        http://www.cs.kent.ac.uk/projects/cxxr/
        http://www.theresearchkitchen.com/blog

Weitere ähnliche Inhalte

Ähnlich wie Realtime r

Unleashing the Potential: Navigating the Versatility and Simplicity of Python...
Unleashing the Potential: Navigating the Versatility and Simplicity of Python...Unleashing the Potential: Navigating the Versatility and Simplicity of Python...
Unleashing the Potential: Navigating the Versatility and Simplicity of Python...Flexsin
 
IIIF: International Image Interoperability Framework @ DLF2012
IIIF: International Image Interoperability Framework @ DLF2012IIIF: International Image Interoperability Framework @ DLF2012
IIIF: International Image Interoperability Framework @ DLF2012Tom-Cramer
 
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Herman Wu
 
Workflows in the Virtual Observatory
Workflows in the Virtual ObservatoryWorkflows in the Virtual Observatory
Workflows in the Virtual ObservatoryJose Enrique Ruiz
 
Python for Data Engineering: Why Do Data Engineers Use Python?
Python for Data Engineering: Why Do Data Engineers Use Python?Python for Data Engineering: Why Do Data Engineers Use Python?
Python for Data Engineering: Why Do Data Engineers Use Python?hemayadav41
 
Cloud Deployments with Apache Hadoop and Apache HBase
Cloud Deployments with Apache Hadoop and Apache HBaseCloud Deployments with Apache Hadoop and Apache HBase
Cloud Deployments with Apache Hadoop and Apache HBaseDATAVERSITY
 
Why I don't use Semantic Web technologies anymore, event if they still influe...
Why I don't use Semantic Web technologies anymore, event if they still influe...Why I don't use Semantic Web technologies anymore, event if they still influe...
Why I don't use Semantic Web technologies anymore, event if they still influe...Gautier Poupeau
 
Mission to NARs with Apache NiFi
Mission to NARs with Apache NiFiMission to NARs with Apache NiFi
Mission to NARs with Apache NiFiHortonworks
 
(Costless) Software Abstractions for Parallel Architectures
(Costless) Software Abstractions for Parallel Architectures(Costless) Software Abstractions for Parallel Architectures
(Costless) Software Abstractions for Parallel ArchitecturesJoel Falcou
 
Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019Wes McKinney
 
End-to-End Big Data AI with Analytics Zoo
End-to-End Big Data AI with Analytics ZooEnd-to-End Big Data AI with Analytics Zoo
End-to-End Big Data AI with Analytics ZooJason Dai
 
Introduction to Spark R with R studio - Mr. Pragith
Introduction to Spark R with R studio - Mr. Pragith Introduction to Spark R with R studio - Mr. Pragith
Introduction to Spark R with R studio - Mr. Pragith Sigmoid
 
Open Compute and the History of the Open Source Data Center
Open Compute and the History of the Open Source Data CenterOpen Compute and the History of the Open Source Data Center
Open Compute and the History of the Open Source Data CenterCole Crawford
 
Using Aspects for Language Portability (SCAM 2010)
Using Aspects for Language Portability (SCAM 2010)Using Aspects for Language Portability (SCAM 2010)
Using Aspects for Language Portability (SCAM 2010)lennartkats
 
Revolution R Enterprise: 100% R and More (14 Mar 2013)
Revolution R Enterprise: 100% R and More (14 Mar 2013)Revolution R Enterprise: 100% R and More (14 Mar 2013)
Revolution R Enterprise: 100% R and More (14 Mar 2013)Revolution Analytics
 

Ähnlich wie Realtime r (18)

Unleashing the Potential: Navigating the Versatility and Simplicity of Python...
Unleashing the Potential: Navigating the Versatility and Simplicity of Python...Unleashing the Potential: Navigating the Versatility and Simplicity of Python...
Unleashing the Potential: Navigating the Versatility and Simplicity of Python...
 
Decision trees in hadoop
Decision trees in hadoopDecision trees in hadoop
Decision trees in hadoop
 
IIIF: International Image Interoperability Framework @ DLF2012
IIIF: International Image Interoperability Framework @ DLF2012IIIF: International Image Interoperability Framework @ DLF2012
IIIF: International Image Interoperability Framework @ DLF2012
 
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
 
Workflows in the Virtual Observatory
Workflows in the Virtual ObservatoryWorkflows in the Virtual Observatory
Workflows in the Virtual Observatory
 
Python for Data Engineering: Why Do Data Engineers Use Python?
Python for Data Engineering: Why Do Data Engineers Use Python?Python for Data Engineering: Why Do Data Engineers Use Python?
Python for Data Engineering: Why Do Data Engineers Use Python?
 
Cloud Deployments with Apache Hadoop and Apache HBase
Cloud Deployments with Apache Hadoop and Apache HBaseCloud Deployments with Apache Hadoop and Apache HBase
Cloud Deployments with Apache Hadoop and Apache HBase
 
Why I don't use Semantic Web technologies anymore, event if they still influe...
Why I don't use Semantic Web technologies anymore, event if they still influe...Why I don't use Semantic Web technologies anymore, event if they still influe...
Why I don't use Semantic Web technologies anymore, event if they still influe...
 
Mission to NARs with Apache NiFi
Mission to NARs with Apache NiFiMission to NARs with Apache NiFi
Mission to NARs with Apache NiFi
 
(Costless) Software Abstractions for Parallel Architectures
(Costless) Software Abstractions for Parallel Architectures(Costless) Software Abstractions for Parallel Architectures
(Costless) Software Abstractions for Parallel Architectures
 
Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019
 
End-to-End Big Data AI with Analytics Zoo
End-to-End Big Data AI with Analytics ZooEnd-to-End Big Data AI with Analytics Zoo
End-to-End Big Data AI with Analytics Zoo
 
Introduction to Spark R with R studio - Mr. Pragith
Introduction to Spark R with R studio - Mr. Pragith Introduction to Spark R with R studio - Mr. Pragith
Introduction to Spark R with R studio - Mr. Pragith
 
Open Compute and the History of the Open Source Data Center
Open Compute and the History of the Open Source Data CenterOpen Compute and the History of the Open Source Data Center
Open Compute and the History of the Open Source Data Center
 
Using Aspects for Language Portability (SCAM 2010)
Using Aspects for Language Portability (SCAM 2010)Using Aspects for Language Portability (SCAM 2010)
Using Aspects for Language Portability (SCAM 2010)
 
Revolution R Enterprise: 100% R and More (14 Mar 2013)
Revolution R Enterprise: 100% R and More (14 Mar 2013)Revolution R Enterprise: 100% R and More (14 Mar 2013)
Revolution R Enterprise: 100% R and More (14 Mar 2013)
 
Big Data Analysis Starts with R
Big Data Analysis Starts with RBig Data Analysis Starts with R
Big Data Analysis Starts with R
 
Revolution Analytics Podcast
Revolution Analytics PodcastRevolution Analytics Podcast
Revolution Analytics Podcast
 

Mehr von Ajay Ohri

Introduction to R ajay Ohri
Introduction to R ajay OhriIntroduction to R ajay Ohri
Introduction to R ajay OhriAjay Ohri
 
Introduction to R
Introduction to RIntroduction to R
Introduction to RAjay Ohri
 
Social Media and Fake News in the 2016 Election
Social Media and Fake News in the 2016 ElectionSocial Media and Fake News in the 2016 Election
Social Media and Fake News in the 2016 ElectionAjay Ohri
 
Download Python for R Users pdf for free
Download Python for R Users pdf for freeDownload Python for R Users pdf for free
Download Python for R Users pdf for freeAjay Ohri
 
Install spark on_windows10
Install spark on_windows10Install spark on_windows10
Install spark on_windows10Ajay Ohri
 
Ajay ohri Resume
Ajay ohri ResumeAjay ohri Resume
Ajay ohri ResumeAjay Ohri
 
Statistics for data scientists
Statistics for  data scientistsStatistics for  data scientists
Statistics for data scientistsAjay Ohri
 
National seminar on emergence of internet of things (io t) trends and challe...
National seminar on emergence of internet of things (io t)  trends and challe...National seminar on emergence of internet of things (io t)  trends and challe...
National seminar on emergence of internet of things (io t) trends and challe...Ajay Ohri
 
Tools and techniques for data science
Tools and techniques for data scienceTools and techniques for data science
Tools and techniques for data scienceAjay Ohri
 
How Big Data ,Cloud Computing ,Data Science can help business
How Big Data ,Cloud Computing ,Data Science can help businessHow Big Data ,Cloud Computing ,Data Science can help business
How Big Data ,Cloud Computing ,Data Science can help businessAjay Ohri
 
Training in Analytics and Data Science
Training in Analytics and Data ScienceTraining in Analytics and Data Science
Training in Analytics and Data ScienceAjay Ohri
 
Software Testing for Data Scientists
Software Testing for Data ScientistsSoftware Testing for Data Scientists
Software Testing for Data ScientistsAjay Ohri
 
A Data Science Tutorial in Python
A Data Science Tutorial in PythonA Data Science Tutorial in Python
A Data Science Tutorial in PythonAjay Ohri
 
How does cryptography work? by Jeroen Ooms
How does cryptography work?  by Jeroen OomsHow does cryptography work?  by Jeroen Ooms
How does cryptography work? by Jeroen OomsAjay Ohri
 
Using R for Social Media and Sports Analytics
Using R for Social Media and Sports AnalyticsUsing R for Social Media and Sports Analytics
Using R for Social Media and Sports AnalyticsAjay Ohri
 
Kush stats alpha
Kush stats alpha Kush stats alpha
Kush stats alpha Ajay Ohri
 
Analyze this
Analyze thisAnalyze this
Analyze thisAjay Ohri
 

Mehr von Ajay Ohri (20)

Introduction to R ajay Ohri
Introduction to R ajay OhriIntroduction to R ajay Ohri
Introduction to R ajay Ohri
 
Introduction to R
Introduction to RIntroduction to R
Introduction to R
 
Social Media and Fake News in the 2016 Election
Social Media and Fake News in the 2016 ElectionSocial Media and Fake News in the 2016 Election
Social Media and Fake News in the 2016 Election
 
Pyspark
PysparkPyspark
Pyspark
 
Download Python for R Users pdf for free
Download Python for R Users pdf for freeDownload Python for R Users pdf for free
Download Python for R Users pdf for free
 
Install spark on_windows10
Install spark on_windows10Install spark on_windows10
Install spark on_windows10
 
Ajay ohri Resume
Ajay ohri ResumeAjay ohri Resume
Ajay ohri Resume
 
Statistics for data scientists
Statistics for  data scientistsStatistics for  data scientists
Statistics for data scientists
 
National seminar on emergence of internet of things (io t) trends and challe...
National seminar on emergence of internet of things (io t)  trends and challe...National seminar on emergence of internet of things (io t)  trends and challe...
National seminar on emergence of internet of things (io t) trends and challe...
 
Tools and techniques for data science
Tools and techniques for data scienceTools and techniques for data science
Tools and techniques for data science
 
How Big Data ,Cloud Computing ,Data Science can help business
How Big Data ,Cloud Computing ,Data Science can help businessHow Big Data ,Cloud Computing ,Data Science can help business
How Big Data ,Cloud Computing ,Data Science can help business
 
Training in Analytics and Data Science
Training in Analytics and Data ScienceTraining in Analytics and Data Science
Training in Analytics and Data Science
 
Tradecraft
Tradecraft   Tradecraft
Tradecraft
 
Software Testing for Data Scientists
Software Testing for Data ScientistsSoftware Testing for Data Scientists
Software Testing for Data Scientists
 
Craps
CrapsCraps
Craps
 
A Data Science Tutorial in Python
A Data Science Tutorial in PythonA Data Science Tutorial in Python
A Data Science Tutorial in Python
 
How does cryptography work? by Jeroen Ooms
How does cryptography work?  by Jeroen OomsHow does cryptography work?  by Jeroen Ooms
How does cryptography work? by Jeroen Ooms
 
Using R for Social Media and Sports Analytics
Using R for Social Media and Sports AnalyticsUsing R for Social Media and Sports Analytics
Using R for Social Media and Sports Analytics
 
Kush stats alpha
Kush stats alpha Kush stats alpha
Kush stats alpha
 
Analyze this
Analyze thisAnalyze this
Analyze this
 

Kürzlich hochgeladen

How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 

Kürzlich hochgeladen (20)

How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 

Realtime r

  • 1. Streaming Data, Concurrency And R Rory Winston rory@theresearchkitchen.com
  • 2. About Me Independent Software Consultant M.Sc. Applied Computing, 2000 M.Sc. Finance, 2008 Apache Committer Working in the financial sector for the last 7 years or so Interested in practical applications of functional languages and machine learning Relatively recent convert to R ( ≈ 2 years)
  • 3. R - Pros and Cons Pro Designed by statisticians Con Can be extremely elegant Designed by statisticians Comprehensive extension Can be clunky (S4) library Bewildering array of Open-source overlapping extensions Huge parallelization effort Inherently single-threaded Fantastic reporting Incredibly Popular capabilities Incredibly Popular
  • 4. R - Pros and Cons Pro Designed by statisticians Con Can be extremely elegant Designed by statisticians Comprehensive extension Can be clunky (S4) library Bewildering array of Open-source overlapping extensions Huge parallelization effort Inherently single-threaded Fantastic reporting Incredibly Popular capabilities Incredibly Popular
  • 5. R - Pros and Cons Pro Designed by statisticians Con Can be extremely elegant Designed by statisticians Comprehensive extension Can be clunky (S4) library Bewildering array of Open-source overlapping extensions Huge parallelization effort Inherently single-threaded Fantastic reporting Incredibly Popular capabilities Incredibly Popular
  • 6. R - Pros and Cons Pro Designed by statisticians Con Can be extremely elegant Designed by statisticians Comprehensive extension Can be clunky (S4) library Bewildering array of Open-source overlapping extensions Huge parallelization effort Inherently single-threaded Fantastic reporting Incredibly Popular capabilities Incredibly Popular
  • 7. R - Pros and Cons Pro Designed by statisticians Con Can be extremely elegant Designed by statisticians Comprehensive extension Can be clunky (S4) library Bewildering array of Open-source overlapping extensions Huge parallelization effort Inherently single-threaded Fantastic reporting Incredibly Popular capabilities Incredibly Popular
  • 8. R - Pros and Cons Pro Designed by statisticians Con Can be extremely elegant Designed by statisticians Comprehensive extension Can be clunky (S4) library Bewildering array of Open-source overlapping extensions Huge parallelization effort Inherently single-threaded Fantastic reporting Incredibly Popular capabilities Incredibly Popular
  • 9. R - Pros and Cons Pro Designed by statisticians Con Can be extremely elegant Designed by statisticians Comprehensive extension Can be clunky (S4) library Bewildering array of Open-source overlapping extensions Huge parallelization effort Inherently single-threaded Fantastic reporting Incredibly Popular capabilities Incredibly Popular
  • 10. R - Pros and Cons Pro Designed by statisticians Con Can be extremely elegant Designed by statisticians Comprehensive extension Can be clunky (S4) library Bewildering array of Open-source overlapping extensions Huge parallelization effort Inherently single-threaded Fantastic reporting Incredibly Popular capabilities Incredibly Popular
  • 11. R - Pros and Cons Pro Designed by statisticians Con Can be extremely elegant Designed by statisticians Comprehensive extension Can be clunky (S4) library Bewildering array of Open-source overlapping extensions Huge parallelization effort Inherently single-threaded Fantastic reporting Incredibly Popular capabilities Incredibly Popular
  • 12. R - Pros and Cons Pro Designed by statisticians Con Can be extremely elegant Designed by statisticians Comprehensive extension Can be clunky (S4) library Bewildering array of Open-source overlapping extensions Huge parallelization effort Inherently single-threaded Fantastic reporting Incredibly Popular capabilities Incredibly Popular
  • 13. R - Pros and Cons Pro Designed by statisticians Con Can be extremely elegant Designed by statisticians Comprehensive extension Can be clunky (S4) library Bewildering array of Open-source overlapping extensions Huge parallelization effort Inherently single-threaded Fantastic reporting Incredibly Popular capabilities Incredibly Popular
  • 14. R - Pros and Cons Pro Designed by statisticians Con Can be extremely elegant Designed by statisticians Comprehensive extension Can be clunky (S4) library Bewildering array of Open-source overlapping extensions Huge parallelization effort Inherently single-threaded Fantastic reporting Incredibly Popular capabilities Incredibly Popular
  • 15. R - Pros and Cons Pro Designed by statisticians Con Can be extremely elegant Designed by statisticians Comprehensive extension Can be clunky (S4) library Bewildering array of Open-source overlapping extensions Huge parallelization effort Inherently single-threaded Fantastic reporting Incredibly Popular capabilities Incredibly Popular
  • 16. R - Pros and Cons Pro Designed by statisticians Con Can be extremely elegant Designed by statisticians Comprehensive extension Can be clunky (S4) library Bewildering array of Open-source overlapping extensions Huge parallelization effort Inherently single-threaded Fantastic reporting Incredibly Popular capabilities Incredibly Popular
  • 17. R - Pros and Cons Pro Designed by statisticians Con Can be extremely elegant Designed by statisticians Comprehensive extension Can be clunky (S4) library Bewildering array of Open-source overlapping extensions Huge parallelization effort Inherently single-threaded Fantastic reporting Incredibly Popular capabilities Incredibly Popular
  • 18. Parallelization vs. Concurrency R interpreter is single threaded Some historical context for this (BLAS implementations) Not necessarily a limitation in the general context Multithreading can be complex and problematic Instead a focus on parallelization: Distributed computation: gridR, nws, snow Multicore/multi-cpu scaling: Rmpi, Romp, pnmath/pnmath0 Interfaces to Pthreads/PBLAS/OpenMP/MPI/Globus/etc. Parallelization suits cpu-bound large data processing applications
  • 19. Other Scalability and Performance Work JIT/bytecode compilation (Ra) Implicit vectorization a la Matlab (code analysis) Large (≥ RAM) dataset handling (bigmemory,ff) Many incremental performance improvements (e.g. less internal copying) Next: GPU/massive multicore...?
  • 20. What Benefit Concurrency? Real-time (streaming to be more precise) data analysis Growing Interest in using R for streaming data, not just offline analyis GUI toolkit integration Fine-grained control over independent task execution "I believe that explicit concurrency management tools (i.e. a threads toolkit) are what we really need in R at this point." - Luke Tierney, 2001
  • 21. Will There Be A Multithreaded R? Short answer is: probably not At least not in its current incarnation Internal workings of the interpreter not particularly amenable to concurrency: Functions can manipulate caller state («- vs. <-) Lazy evaluation machinery (promises) Dynamic State, garbage collection, etc. Scoping: global environments Management of resources: streams, I/O, connections, sinks Implications for current code Possibly in the next language evolution (cf. Ihaka?) Large amount of work (but potentially do-able)
  • 22. Will There Be A Multithreaded R? Short answer is: probably not At least not in its current incarnation Internal workings of the interpreter not particularly amenable to concurrency: Functions can manipulate caller state («- vs. <-) Lazy evaluation machinery (promises) Dynamic State, garbage collection, etc. Scoping: global environments Management of resources: streams, I/O, connections, sinks Implications for current code Possibly in the next language evolution (cf. Ihaka?) Large amount of work (but potentially do-able)
  • 23. Will There Be A Multithreaded R? Short answer is: probably not At least not in its current incarnation Internal workings of the interpreter not particularly amenable to concurrency: Functions can manipulate caller state («- vs. <-) Lazy evaluation machinery (promises) Dynamic State, garbage collection, etc. Scoping: global environments Management of resources: streams, I/O, connections, sinks Implications for current code Possibly in the next language evolution (cf. Ihaka?) Large amount of work (but potentially do-able)
  • 24. Will There Be A Multithreaded R? Short answer is: probably not At least not in its current incarnation Internal workings of the interpreter not particularly amenable to concurrency: Functions can manipulate caller state («- vs. <-) Lazy evaluation machinery (promises) Dynamic State, garbage collection, etc. Scoping: global environments Management of resources: streams, I/O, connections, sinks Implications for current code Possibly in the next language evolution (cf. Ihaka?) Large amount of work (but potentially do-able)
  • 25. Will There Be A Multithreaded R? Short answer is: probably not At least not in its current incarnation Internal workings of the interpreter not particularly amenable to concurrency: Functions can manipulate caller state («- vs. <-) Lazy evaluation machinery (promises) Dynamic State, garbage collection, etc. Scoping: global environments Management of resources: streams, I/O, connections, sinks Implications for current code Possibly in the next language evolution (cf. Ihaka?) Large amount of work (but potentially do-able)
  • 26. Will There Be A Multithreaded R? Short answer is: probably not At least not in its current incarnation Internal workings of the interpreter not particularly amenable to concurrency: Functions can manipulate caller state («- vs. <-) Lazy evaluation machinery (promises) Dynamic State, garbage collection, etc. Scoping: global environments Management of resources: streams, I/O, connections, sinks Implications for current code Possibly in the next language evolution (cf. Ihaka?) Large amount of work (but potentially do-able)
  • 27. Will There Be A Multithreaded R? Short answer is: probably not At least not in its current incarnation Internal workings of the interpreter not particularly amenable to concurrency: Functions can manipulate caller state («- vs. <-) Lazy evaluation machinery (promises) Dynamic State, garbage collection, etc. Scoping: global environments Management of resources: streams, I/O, connections, sinks Implications for current code Possibly in the next language evolution (cf. Ihaka?) Large amount of work (but potentially do-able)
  • 28. Will There Be A Multithreaded R? Short answer is: probably not At least not in its current incarnation Internal workings of the interpreter not particularly amenable to concurrency: Functions can manipulate caller state («- vs. <-) Lazy evaluation machinery (promises) Dynamic State, garbage collection, etc. Scoping: global environments Management of resources: streams, I/O, connections, sinks Implications for current code Possibly in the next language evolution (cf. Ihaka?) Large amount of work (but potentially do-able)
  • 29. Will There Be A Multithreaded R? Short answer is: probably not At least not in its current incarnation Internal workings of the interpreter not particularly amenable to concurrency: Functions can manipulate caller state («- vs. <-) Lazy evaluation machinery (promises) Dynamic State, garbage collection, etc. Scoping: global environments Management of resources: streams, I/O, connections, sinks Implications for current code Possibly in the next language evolution (cf. Ihaka?) Large amount of work (but potentially do-able)
  • 30. Will There Be A Multithreaded R? Short answer is: probably not At least not in its current incarnation Internal workings of the interpreter not particularly amenable to concurrency: Functions can manipulate caller state («- vs. <-) Lazy evaluation machinery (promises) Dynamic State, garbage collection, etc. Scoping: global environments Management of resources: streams, I/O, connections, sinks Implications for current code Possibly in the next language evolution (cf. Ihaka?) Large amount of work (but potentially do-able)
  • 31. Will There Be A Multithreaded R? Short answer is: probably not At least not in its current incarnation Internal workings of the interpreter not particularly amenable to concurrency: Functions can manipulate caller state («- vs. <-) Lazy evaluation machinery (promises) Dynamic State, garbage collection, etc. Scoping: global environments Management of resources: streams, I/O, connections, sinks Implications for current code Possibly in the next language evolution (cf. Ihaka?) Large amount of work (but potentially do-able)
  • 32. Will There Be A Multithreaded R? Short answer is: probably not At least not in its current incarnation Internal workings of the interpreter not particularly amenable to concurrency: Functions can manipulate caller state («- vs. <-) Lazy evaluation machinery (promises) Dynamic State, garbage collection, etc. Scoping: global environments Management of resources: streams, I/O, connections, sinks Implications for current code Possibly in the next language evolution (cf. Ihaka?) Large amount of work (but potentially do-able)
  • 33. Example Application Based on work I did last year and presented at UseR! 2008 Wrote a real-time and historical market data service from Reuters/R The real-time interface used the Reuters C++ API R extension in C++ that spawned listening thread and handled updates
  • 34. Simplified Architecture R extension (C++) realtime bus
  • 35. Example Usage rsub <- function(duration, items, callback) The call rsub will subscribe to the specified rate(s) for the duration of time specified by duration (ms). When a tick arrives, the callback function callback is invoked, with a data frame containing the fields specified in items. Multiple market data items may be subscribed to, and any combination of fields may be be specified. Uses the underlying RFA API, which provides a C++ interface to real-time market updates.
  • 36. Real-Time Example # Specify field names to retrieve fields <- c("BID","ASK","TIMCOR") # Subscribe to EUR/USD and GBP/USD ticks items <- list() items[[1]] <- c("IDN_SELECTFEED", "EUR=", fields) items[[2]] <- c("IDN_SELECTFEED", "GBP=", fields) # Simple Callback Function callback <- function(df) { print(paste("Received",df)) } # Subscribe for 1 hour ONE_HOUR <- 1000*(60)^2 rsub(ONE_HOUR, items, callback)
  • 37. Issues With This Approach As R interpreter is single threaded, cannot spawn thread for callbacks Thus, interpreter thread is locked for the duration of subscription Not a great user experience Need to find alternative mechanism
  • 38. Alternative Approach If we cannot run subscriber threads in-process, need to decouple Standard approach: add an extra layer and use some form of IPC For instance, we could: Subscribe in a dedicated R process (A) Push incoming data onto a socket R process (B) reads from a listening socket Sockets could also be another IPC primitive, e.g. pipes Also note that R supports asynchronous I/O (?isIncomplete) Look at the ibrokers package for examples of this
  • 39. The bigmemoRy package From the description: "Use C++ to create, store, access, and manipulate massive matrices" Allows creation of large matrices These matrices can be mapped to files/shared memory It is the shared memory functionality that we will use The next version (3.0) will be unveiled at UseR! 2009 big.matrix(nrow, ncol, type = "integer", ....) shared.big.matrix(nrow, ncol, type = "integer", ...) filebacked.big.matrix(nrow, ncol, type = "integer", ...)
  • 40. Sample Usage > library(bigmemory) # Note: I'm using pre-release > X <- shared.big.matrix(type="double", ncol=1000, nrow=1000) > X An object of class “big.matrix” Slot "address": <pointer: 0x7378a0>
  • 41. Create Shared Memory Descriptor > desc <- describe(X) > desc $sharedType [1] "SharedMemory" $sharedName [1] "53f14925-dca1-42a8-a547-e1bccae999ce" $nrow [1] 1000 $ncol [1] 1000 $rowNames NULL
  • 42. Export the Descriptor In R session 1: > dput(desc, file="~/matrix.desc") In R session 2: > library(bigmemory) > desc <- dget("~/matrix.desc") > X <- attach.big.matrix(desc) Now R sessions A and B share the same big.matrix instance
  • 43. Share Data Between Sessions R session 1: > X[1,1] <- 1.2345 R session 2: > X[1,1] [1] 1.2345 Thus, streaming data can be continuously fed into session A And concurrently processed in session B
  • 44. Summary Lack of threads not a barrier to concurrent analysis Packages like bigmemory, nws, etc. facilitate decoupling via IPC nws goes a step further, with a distributed workspace Many applications for streaming data: Data collection/monitoring Development of pricing/risk algorithms Low-frequency execution (??) ...
  • 45. References http://cran.r-project.org/web/packages/bigmemory/ http://www.cs.uiowa.edu/ luke/R/thrgui/ http://www.milbo.users.sonic.net/ra/index.html http://www.cs.kent.ac.uk/projects/cxxr/ http://www.theresearchkitchen.com/blog