SlideShare a Scribd company logo
1 of 28
Download to read offline
Introduction to the Future of R




                                                   Avram Aelony
                                                         November 2010




Wednesday, November 17, 2010
Talk Outline:
     1. Strengths

     II. Criticisms

     III. Challenges

     IV. Remedies and Solutions

     V. The Future




Wednesday, November 17, 2010
Quick disclaimer:
        - I don’t consider myself an R expert

        - I don’t have a crystal ball informing of the Future

        - This talk is about polite observations

        - The future is dynamic

            YMMD <- your-mileage-may-differ()




Wednesday, November 17, 2010
                                                                ?
R’s Strengths
        - a many good things, too many to mention individually

         ... but let’s try...




Wednesday, November 17, 2010
Strengths of R

    - A high quality statistical platform, yielding reproducible results

    - Open Source, free and available

    - Large, active community

    - Intuitive language structure

    - Data as rows and columns

    - Package plugin architecture - there are many packages, top packages in widespread use

    - Distributed contributions written/offered/controlled by many/multiple individuals

    - Data processing for most individual needs.

    - Emerging success and increasing corporate adoption
        e.g. some corporate needs (often used for prototyping and adhoc analytics)



Wednesday, November 17, 2010
Strengths of R

   More succinctly... based on a paraphrasing of a post by Ted Dunning *

   1. Library

   II. Language

   III. Community


   * http://www.stat.columbia.edu/~cook/movabletype/archives/2010/09/
   the_future_of_r.html




Wednesday, November 17, 2010
Criticisms of R
   - Small grievances: syntax, elegance, and managing complexity
  “Most packages are very good, but I regret to say some are pretty inefficient and others downright
  dangerous.”
            -Bill Venables, quote from 2007
                   http://www.mail-archive.com/r-help@r-project.org/msg06853.html


  “...R functions used to be lean and mean, and now they’re full of exception-handling and calls to other
  packages. R functions are spaghetti-like messes of connections in which I keep expecting to run into
  syntax like “GOTO 120...”
            - comment taken from Gelman blog on the future of R.
           http://www.stat.columbia.edu/~cook/movabletype/archives/2010/09/the_future_of_r.html




 - Larger grievances: memory and inefficiency
 “One of the most vexing issues in R is memory. For anyone who works
 with large datasets - even if you have 64-bit R running and lots (e.g.,
 18Gb) of RAM, memory can still confound, frustrate, and stymie even
 experienced R users.”

                                 http://www.matthewckeller.com/html/memory.html




Wednesday, November 17, 2010
However, greater challenges for R lie ahead

        1. Big Data is coming...

        II. Isn’t Big Data already here ?



              How can we imagine an ideal environment to address Big Data?




Wednesday, November 17, 2010
- What is Big Data?

     "Every 2 Days We Create As Much Information As We Did Up To 2003"
           - Eric Schmidt, Chairman & CEO, Google.
                        http://techcrunch.com/2010/08/04/schmidt-data/




      "Data is abundant, Information is useful, Knowledge is precious."
               http://hadoop-karma.blogspot.com/2010/03/how-much-data-is-generated-on-internet.html




     - Freshness, this data will self destruct in 5 seconds... !!

           "How Much Time Do You Have Before Web‐Generated Leads Go Cold?"
               http://www.matrixintegratedmarketing.com/MIT.pdf



      Get ready:
           “Web Scale Big Data - 100’s of Terabytes”
                  -John Sichi, Facebook, on intended usage with Hive.
                        http://www.slideshare.net/jsichi/hive-evolution-apachecon-2010 slide #6.




Wednesday, November 17, 2010
What is Big Data?




                               Wikipedia - http://en.wikipedia.org/wiki/Big_data




Wednesday, November 17, 2010
                                ?
Solving the “Big” Data problem


             ... as I see it,

                          there are 5 competing possible solution “avenues”




Wednesday, November 17, 2010
The “Big” Data problem:

    Solution #1

                  Use R in Conjunction with other specialized tools.


       Examples:
       - R remains a language for small datasets but has “hooks” and “bridges”
         that enable use with MapReduce style tools (Hadoop, Streaming, Hive, Pig, Cascading,
       others...)




Wednesday, November 17, 2010
The “Big” Data problem:

    Solution #2
                 Packages that enable new functionality for reading
                        and processing very large data sets

      Examples:
      - Saptarshi Guha’s RHIPE (R and Hadoop Processing Environment)
      - Kane & Emerson’s bigmemory
      - Adler et al.‘s ff package
      - Henrik Bengtsson’s R.huge package (deprecated)
      - (many new yet-to-be-developed possibilities here )

      So....
        enhance functions, but
        no enhancements to the core language


Wednesday, November 17, 2010
The “Big” Data problem:

    Solution #3
         Same language but have R “do the right thing”
                      under the hood.
   Examples:
   - Out of memory algorithms,
        think: “I see you’re trying to analyze a sizable amount of data...”

   - Either seamlessly or after user approval to go ahead...
    # perhaps, perhaps...
    d <- read.table(fn=”s3//:mybucket.name”, enormous.data=TRUE)



     or if possible, enhance core language as well as
     functionality!!!




Wednesday, November 17, 2010
The “Big” Data problem:

    Solution #4 - Completely start over




                                                                                              2008

                        http://www.stat.auckland.ac.nz/%7Eihaka/downloads/Compstat-2008.pdf




Wednesday, November 17, 2010
The “Big” Data problem:


                                                                                                2010

                               http://www.stat.auckland.ac.nz/%7Eihaka/downloads/JSM-2010.pdf




Wednesday, November 17, 2010
The “Big” Data problem:

       The Ihaka/Lang “Back to the Future” paper came out in 2008.

       The Ihaka “Lessons Learned” 2010 paper mentions:

           - the need of an “effective language for handling large-scale computations”

           - nostalgia for Lisp


       Have there been any Lisp-like advances since then?

       What about Clojure ?




Wednesday, November 17, 2010
The “Big” Data problem:

    Solution #5 - Does Clojure fit the bill ?
           H0: Clojure already has many of the things Ross Ihaka would ask for
           H1: Really?


                                                         -Rich Hickey
                                                         http://clojure.org/rationale




                    Clojure may be seen as a solution, or as an example path for R to
                               follow, improve upon, or choose to differ...




Wednesday, November 17, 2010
Clojure
                               -Rich Hickey
                                  http://clojure.org




Wednesday, November 17, 2010
The problem with many new languages is that initially there are no libraries...

      Clojure already has many, and can use any Java library directly as necessary.


     - Core Clojure

     - Incanter: "a Clojure-based, R-like platform for statistical computing and graphics"
                 http://incanter.org/

     - Infer:           "a (Clojure) library for machine learning and statistical inference,
                      designed to be used in real production systems."
                         https://github.com/bradford/infer

     - Cascalog: “Data processing on Hadoop without the hassle”
                “a Clojure-based query language for Hadoop”




Wednesday, November 17, 2010
What will the Future really hold for R ?




Wednesday, November 17, 2010
Thanks for listening...




Wednesday, November 17, 2010
Appendix:

                   A few slides on Clojure, and three
                   powerful Clojure libraries:

                   Incanter
                   Infer
                   Cascalog




Wednesday, November 17, 2010
Clojure - a quick tour
                                  -Rich Hickey
                                     http://clojure.org




Wednesday, November 17, 2010
David Edgar Liebke’s Incanter




                               Please see http://incanter.org/docs/data-sorcery-new.pdf
                                for an excellent intro to Incanter.




Wednesday, November 17, 2010
Below are example snippets from Incanter




Wednesday, November 17, 2010
Bradford Cross’ Infer:
          "a (Clojure) library for machine learning and statistical inference, designed
              to be used in real production systems."




            https://github.com/bradford/infer




Wednesday, November 17, 2010
Nathan Marz’s Cascalog:
       http://nathanmarz.com/blog/introducing-cascalog-a-clojure-based-query-language-for-hado.html




Wednesday, November 17, 2010

More Related Content

Viewers also liked

R Journal 2009 1
R Journal 2009 1R Journal 2009 1
R Journal 2009 1Ajay Ohri
 
Accessing Databases from R
Accessing Databases from RAccessing Databases from R
Accessing Databases from Rkmettler
 
Los Angeles R users group - Dec 14 2010 - Part 2
Los Angeles R users group - Dec 14 2010 - Part 2Los Angeles R users group - Dec 14 2010 - Part 2
Los Angeles R users group - Dec 14 2010 - Part 2rusersla
 
Merge Multiple CSV in single data frame using R
Merge Multiple CSV in single data frame using RMerge Multiple CSV in single data frame using R
Merge Multiple CSV in single data frame using RYogesh Khandelwal
 
R Programming: Importing Data In R
R Programming: Importing Data In RR Programming: Importing Data In R
R Programming: Importing Data In RRsquared Academy
 
National seminar on emergence of internet of things (io t) trends and challe...
National seminar on emergence of internet of things (io t)  trends and challe...National seminar on emergence of internet of things (io t)  trends and challe...
National seminar on emergence of internet of things (io t) trends and challe...Ajay Ohri
 
R Markdown Tutorial For Beginners
R Markdown Tutorial For BeginnersR Markdown Tutorial For Beginners
R Markdown Tutorial For BeginnersRsquared Academy
 
Tools and techniques for data science
Tools and techniques for data scienceTools and techniques for data science
Tools and techniques for data scienceAjay Ohri
 
RMySQL Tutorial For Beginners
RMySQL Tutorial For BeginnersRMySQL Tutorial For Beginners
RMySQL Tutorial For BeginnersRsquared Academy
 

Viewers also liked (9)

R Journal 2009 1
R Journal 2009 1R Journal 2009 1
R Journal 2009 1
 
Accessing Databases from R
Accessing Databases from RAccessing Databases from R
Accessing Databases from R
 
Los Angeles R users group - Dec 14 2010 - Part 2
Los Angeles R users group - Dec 14 2010 - Part 2Los Angeles R users group - Dec 14 2010 - Part 2
Los Angeles R users group - Dec 14 2010 - Part 2
 
Merge Multiple CSV in single data frame using R
Merge Multiple CSV in single data frame using RMerge Multiple CSV in single data frame using R
Merge Multiple CSV in single data frame using R
 
R Programming: Importing Data In R
R Programming: Importing Data In RR Programming: Importing Data In R
R Programming: Importing Data In R
 
National seminar on emergence of internet of things (io t) trends and challe...
National seminar on emergence of internet of things (io t)  trends and challe...National seminar on emergence of internet of things (io t)  trends and challe...
National seminar on emergence of internet of things (io t) trends and challe...
 
R Markdown Tutorial For Beginners
R Markdown Tutorial For BeginnersR Markdown Tutorial For Beginners
R Markdown Tutorial For Beginners
 
Tools and techniques for data science
Tools and techniques for data scienceTools and techniques for data science
Tools and techniques for data science
 
RMySQL Tutorial For Beginners
RMySQL Tutorial For BeginnersRMySQL Tutorial For Beginners
RMySQL Tutorial For Beginners
 

Similar to Los Angeles R users group - Nov 17 2010 - Part 2

Linked data and Muruca @ COST a32 - Munich
Linked data and Muruca @ COST a32 - MunichLinked data and Muruca @ COST a32 - Munich
Linked data and Muruca @ COST a32 - MunichChristian Morbidoni
 
Seattle Data Geeks: Hadoop and Beyond
Seattle Data Geeks: Hadoop and BeyondSeattle Data Geeks: Hadoop and Beyond
Seattle Data Geeks: Hadoop and BeyondPaco Nathan
 
How to ReadTheDocs
How to ReadTheDocsHow to ReadTheDocs
How to ReadTheDocsJohn Costa
 
The Platypus Problem
The Platypus ProblemThe Platypus Problem
The Platypus ProblemJeff Eaton
 
The Network The Next Frontier for Devops ?
The Network   The Next Frontier for Devops ?The Network   The Next Frontier for Devops ?
The Network The Next Frontier for Devops ?John Willis
 
The Reluctant SysAdmin : 360|iDev Austin 2010
The Reluctant SysAdmin : 360|iDev Austin 2010The Reluctant SysAdmin : 360|iDev Austin 2010
The Reluctant SysAdmin : 360|iDev Austin 2010Voxilate
 
Presentationnosqlmah
PresentationnosqlmahPresentationnosqlmah
Presentationnosqlmahp3rnilla
 
Big Data @ Bodensee Barcamp 2010
Big Data @ Bodensee Barcamp 2010Big Data @ Bodensee Barcamp 2010
Big Data @ Bodensee Barcamp 2010c1sc0
 
I want to know more about compuerized text analysis
I want to know more about   compuerized text analysisI want to know more about   compuerized text analysis
I want to know more about compuerized text analysisLuke Czarnecki
 
The Hitchhiker's Guide to Machine Learning with Python & Apache Spark
The Hitchhiker's Guide to Machine Learning with Python & Apache SparkThe Hitchhiker's Guide to Machine Learning with Python & Apache Spark
The Hitchhiker's Guide to Machine Learning with Python & Apache SparkKrishna Sankar
 
TFS Talk by Hackathorn 20100527 v2
TFS Talk by Hackathorn 20100527 v2TFS Talk by Hackathorn 20100527 v2
TFS Talk by Hackathorn 20100527 v2Richard Hackathorn
 
Tds — big science dec 2021
Tds — big science dec 2021Tds — big science dec 2021
Tds — big science dec 2021Gérard Dupont
 
Open Source Software and Libraries
Open Source Software and LibrariesOpen Source Software and Libraries
Open Source Software and LibrariesEllyssa Kroski
 

Similar to Los Angeles R users group - Nov 17 2010 - Part 2 (20)

Linked data and Muruca @ COST a32 - Munich
Linked data and Muruca @ COST a32 - MunichLinked data and Muruca @ COST a32 - Munich
Linked data and Muruca @ COST a32 - Munich
 
Oss swot
Oss swotOss swot
Oss swot
 
noSQL @ QCon SP
noSQL @ QCon SPnoSQL @ QCon SP
noSQL @ QCon SP
 
Seattle Data Geeks: Hadoop and Beyond
Seattle Data Geeks: Hadoop and BeyondSeattle Data Geeks: Hadoop and Beyond
Seattle Data Geeks: Hadoop and Beyond
 
How to ReadTheDocs
How to ReadTheDocsHow to ReadTheDocs
How to ReadTheDocs
 
My dotJS Talk
My dotJS TalkMy dotJS Talk
My dotJS Talk
 
The Platypus Problem
The Platypus ProblemThe Platypus Problem
The Platypus Problem
 
The Network The Next Frontier for Devops ?
The Network   The Next Frontier for Devops ?The Network   The Next Frontier for Devops ?
The Network The Next Frontier for Devops ?
 
The Reluctant SysAdmin : 360|iDev Austin 2010
The Reluctant SysAdmin : 360|iDev Austin 2010The Reluctant SysAdmin : 360|iDev Austin 2010
The Reluctant SysAdmin : 360|iDev Austin 2010
 
Presentationnosqlmah
PresentationnosqlmahPresentationnosqlmah
Presentationnosqlmah
 
Big Data @ Bodensee Barcamp 2010
Big Data @ Bodensee Barcamp 2010Big Data @ Bodensee Barcamp 2010
Big Data @ Bodensee Barcamp 2010
 
Web3uploaded
Web3uploadedWeb3uploaded
Web3uploaded
 
Torocpsummit 130116180230-phpapp02
Torocpsummit 130116180230-phpapp02Torocpsummit 130116180230-phpapp02
Torocpsummit 130116180230-phpapp02
 
On Storing Big Data
On Storing Big DataOn Storing Big Data
On Storing Big Data
 
I want to know more about compuerized text analysis
I want to know more about   compuerized text analysisI want to know more about   compuerized text analysis
I want to know more about compuerized text analysis
 
The Hitchhiker's Guide to Machine Learning with Python & Apache Spark
The Hitchhiker's Guide to Machine Learning with Python & Apache SparkThe Hitchhiker's Guide to Machine Learning with Python & Apache Spark
The Hitchhiker's Guide to Machine Learning with Python & Apache Spark
 
TFS Talk by Hackathorn 20100527 v2
TFS Talk by Hackathorn 20100527 v2TFS Talk by Hackathorn 20100527 v2
TFS Talk by Hackathorn 20100527 v2
 
Tds — big science dec 2021
Tds — big science dec 2021Tds — big science dec 2021
Tds — big science dec 2021
 
Open Source Software and Libraries
Open Source Software and LibrariesOpen Source Software and Libraries
Open Source Software and Libraries
 
Scaling the (evolving) web data –at low cost-
Scaling the (evolving) web data –at low cost-Scaling the (evolving) web data –at low cost-
Scaling the (evolving) web data –at low cost-
 

More from rusersla

LA R meetup - Nov 2013 - Eric Klusman
LA R meetup - Nov 2013 - Eric KlusmanLA R meetup - Nov 2013 - Eric Klusman
LA R meetup - Nov 2013 - Eric Klusmanrusersla
 
useR2011 - Whitcher
useR2011 - WhitcheruseR2011 - Whitcher
useR2011 - Whitcherrusersla
 
useR2011 - Rougier
useR2011 - RougieruseR2011 - Rougier
useR2011 - Rougierrusersla
 
useR2011 - Huber
useR2011 - HuberuseR2011 - Huber
useR2011 - Huberrusersla
 
useR2011 - Gromping
useR2011 - Gromping useR2011 - Gromping
useR2011 - Gromping rusersla
 
useR2011 - Edlefsen
useR2011 - EdlefsenuseR2011 - Edlefsen
useR2011 - Edlefsenrusersla
 
Los Angeles R users group - July 12 2011 - Part 1
Los Angeles R users group - July 12 2011 - Part 1Los Angeles R users group - July 12 2011 - Part 1
Los Angeles R users group - July 12 2011 - Part 1rusersla
 
Los Angeles R users group - July 12 2011 - Part 2
Los Angeles R users group - July 12 2011 - Part 2Los Angeles R users group - July 12 2011 - Part 2
Los Angeles R users group - July 12 2011 - Part 2rusersla
 
Los Angeles R users group - Dec 14 2010 - Part 1
Los Angeles R users group - Dec 14 2010 - Part 1Los Angeles R users group - Dec 14 2010 - Part 1
Los Angeles R users group - Dec 14 2010 - Part 1rusersla
 
Los Angeles R users group - Dec 14 2010 - Part 3
Los Angeles R users group - Dec 14 2010 - Part 3Los Angeles R users group - Dec 14 2010 - Part 3
Los Angeles R users group - Dec 14 2010 - Part 3rusersla
 

More from rusersla (10)

LA R meetup - Nov 2013 - Eric Klusman
LA R meetup - Nov 2013 - Eric KlusmanLA R meetup - Nov 2013 - Eric Klusman
LA R meetup - Nov 2013 - Eric Klusman
 
useR2011 - Whitcher
useR2011 - WhitcheruseR2011 - Whitcher
useR2011 - Whitcher
 
useR2011 - Rougier
useR2011 - RougieruseR2011 - Rougier
useR2011 - Rougier
 
useR2011 - Huber
useR2011 - HuberuseR2011 - Huber
useR2011 - Huber
 
useR2011 - Gromping
useR2011 - Gromping useR2011 - Gromping
useR2011 - Gromping
 
useR2011 - Edlefsen
useR2011 - EdlefsenuseR2011 - Edlefsen
useR2011 - Edlefsen
 
Los Angeles R users group - July 12 2011 - Part 1
Los Angeles R users group - July 12 2011 - Part 1Los Angeles R users group - July 12 2011 - Part 1
Los Angeles R users group - July 12 2011 - Part 1
 
Los Angeles R users group - July 12 2011 - Part 2
Los Angeles R users group - July 12 2011 - Part 2Los Angeles R users group - July 12 2011 - Part 2
Los Angeles R users group - July 12 2011 - Part 2
 
Los Angeles R users group - Dec 14 2010 - Part 1
Los Angeles R users group - Dec 14 2010 - Part 1Los Angeles R users group - Dec 14 2010 - Part 1
Los Angeles R users group - Dec 14 2010 - Part 1
 
Los Angeles R users group - Dec 14 2010 - Part 3
Los Angeles R users group - Dec 14 2010 - Part 3Los Angeles R users group - Dec 14 2010 - Part 3
Los Angeles R users group - Dec 14 2010 - Part 3
 

Los Angeles R users group - Nov 17 2010 - Part 2

  • 1. Introduction to the Future of R Avram Aelony November 2010 Wednesday, November 17, 2010
  • 2. Talk Outline: 1. Strengths II. Criticisms III. Challenges IV. Remedies and Solutions V. The Future Wednesday, November 17, 2010
  • 3. Quick disclaimer: - I don’t consider myself an R expert - I don’t have a crystal ball informing of the Future - This talk is about polite observations - The future is dynamic YMMD <- your-mileage-may-differ() Wednesday, November 17, 2010 ?
  • 4. R’s Strengths - a many good things, too many to mention individually ... but let’s try... Wednesday, November 17, 2010
  • 5. Strengths of R - A high quality statistical platform, yielding reproducible results - Open Source, free and available - Large, active community - Intuitive language structure - Data as rows and columns - Package plugin architecture - there are many packages, top packages in widespread use - Distributed contributions written/offered/controlled by many/multiple individuals - Data processing for most individual needs. - Emerging success and increasing corporate adoption e.g. some corporate needs (often used for prototyping and adhoc analytics) Wednesday, November 17, 2010
  • 6. Strengths of R More succinctly... based on a paraphrasing of a post by Ted Dunning * 1. Library II. Language III. Community * http://www.stat.columbia.edu/~cook/movabletype/archives/2010/09/ the_future_of_r.html Wednesday, November 17, 2010
  • 7. Criticisms of R - Small grievances: syntax, elegance, and managing complexity “Most packages are very good, but I regret to say some are pretty inefficient and others downright dangerous.” -Bill Venables, quote from 2007 http://www.mail-archive.com/r-help@r-project.org/msg06853.html “...R functions used to be lean and mean, and now they’re full of exception-handling and calls to other packages. R functions are spaghetti-like messes of connections in which I keep expecting to run into syntax like “GOTO 120...” - comment taken from Gelman blog on the future of R. http://www.stat.columbia.edu/~cook/movabletype/archives/2010/09/the_future_of_r.html - Larger grievances: memory and inefficiency “One of the most vexing issues in R is memory. For anyone who works with large datasets - even if you have 64-bit R running and lots (e.g., 18Gb) of RAM, memory can still confound, frustrate, and stymie even experienced R users.” http://www.matthewckeller.com/html/memory.html Wednesday, November 17, 2010
  • 8. However, greater challenges for R lie ahead 1. Big Data is coming... II. Isn’t Big Data already here ? How can we imagine an ideal environment to address Big Data? Wednesday, November 17, 2010
  • 9. - What is Big Data? "Every 2 Days We Create As Much Information As We Did Up To 2003" - Eric Schmidt, Chairman & CEO, Google. http://techcrunch.com/2010/08/04/schmidt-data/ "Data is abundant, Information is useful, Knowledge is precious." http://hadoop-karma.blogspot.com/2010/03/how-much-data-is-generated-on-internet.html - Freshness, this data will self destruct in 5 seconds... !! "How Much Time Do You Have Before Web‐Generated Leads Go Cold?" http://www.matrixintegratedmarketing.com/MIT.pdf Get ready: “Web Scale Big Data - 100’s of Terabytes” -John Sichi, Facebook, on intended usage with Hive. http://www.slideshare.net/jsichi/hive-evolution-apachecon-2010 slide #6. Wednesday, November 17, 2010
  • 10. What is Big Data? Wikipedia - http://en.wikipedia.org/wiki/Big_data Wednesday, November 17, 2010 ?
  • 11. Solving the “Big” Data problem ... as I see it, there are 5 competing possible solution “avenues” Wednesday, November 17, 2010
  • 12. The “Big” Data problem: Solution #1 Use R in Conjunction with other specialized tools. Examples: - R remains a language for small datasets but has “hooks” and “bridges” that enable use with MapReduce style tools (Hadoop, Streaming, Hive, Pig, Cascading, others...) Wednesday, November 17, 2010
  • 13. The “Big” Data problem: Solution #2 Packages that enable new functionality for reading and processing very large data sets Examples: - Saptarshi Guha’s RHIPE (R and Hadoop Processing Environment) - Kane & Emerson’s bigmemory - Adler et al.‘s ff package - Henrik Bengtsson’s R.huge package (deprecated) - (many new yet-to-be-developed possibilities here ) So.... enhance functions, but no enhancements to the core language Wednesday, November 17, 2010
  • 14. The “Big” Data problem: Solution #3 Same language but have R “do the right thing” under the hood. Examples: - Out of memory algorithms, think: “I see you’re trying to analyze a sizable amount of data...” - Either seamlessly or after user approval to go ahead... # perhaps, perhaps... d <- read.table(fn=”s3//:mybucket.name”, enormous.data=TRUE) or if possible, enhance core language as well as functionality!!! Wednesday, November 17, 2010
  • 15. The “Big” Data problem: Solution #4 - Completely start over 2008 http://www.stat.auckland.ac.nz/%7Eihaka/downloads/Compstat-2008.pdf Wednesday, November 17, 2010
  • 16. The “Big” Data problem: 2010 http://www.stat.auckland.ac.nz/%7Eihaka/downloads/JSM-2010.pdf Wednesday, November 17, 2010
  • 17. The “Big” Data problem: The Ihaka/Lang “Back to the Future” paper came out in 2008. The Ihaka “Lessons Learned” 2010 paper mentions: - the need of an “effective language for handling large-scale computations” - nostalgia for Lisp Have there been any Lisp-like advances since then? What about Clojure ? Wednesday, November 17, 2010
  • 18. The “Big” Data problem: Solution #5 - Does Clojure fit the bill ? H0: Clojure already has many of the things Ross Ihaka would ask for H1: Really? -Rich Hickey http://clojure.org/rationale Clojure may be seen as a solution, or as an example path for R to follow, improve upon, or choose to differ... Wednesday, November 17, 2010
  • 19. Clojure -Rich Hickey http://clojure.org Wednesday, November 17, 2010
  • 20. The problem with many new languages is that initially there are no libraries... Clojure already has many, and can use any Java library directly as necessary. - Core Clojure - Incanter: "a Clojure-based, R-like platform for statistical computing and graphics" http://incanter.org/ - Infer: "a (Clojure) library for machine learning and statistical inference, designed to be used in real production systems." https://github.com/bradford/infer - Cascalog: “Data processing on Hadoop without the hassle” “a Clojure-based query language for Hadoop” Wednesday, November 17, 2010
  • 21. What will the Future really hold for R ? Wednesday, November 17, 2010
  • 23. Appendix: A few slides on Clojure, and three powerful Clojure libraries: Incanter Infer Cascalog Wednesday, November 17, 2010
  • 24. Clojure - a quick tour -Rich Hickey http://clojure.org Wednesday, November 17, 2010
  • 25. David Edgar Liebke’s Incanter Please see http://incanter.org/docs/data-sorcery-new.pdf for an excellent intro to Incanter. Wednesday, November 17, 2010
  • 26. Below are example snippets from Incanter Wednesday, November 17, 2010
  • 27. Bradford Cross’ Infer: "a (Clojure) library for machine learning and statistical inference, designed to be used in real production systems." https://github.com/bradford/infer Wednesday, November 17, 2010
  • 28. Nathan Marz’s Cascalog: http://nathanmarz.com/blog/introducing-cascalog-a-clojure-based-query-language-for-hado.html Wednesday, November 17, 2010