SlideShare a Scribd company logo
1 of 24
Download to read offline
R and the Data Science
   ToolKit (RDSTK)
             Ryan Elmore
    National Renewable Energy Lab

             @rtelmore
        rtelmore@gmail.com
What is Data Science?
What is Data Science?
•   Short Answer: Statistics!
What is Data Science?
         •      Short Answer: Statistics!

         •      Long Answer: Some
                hybrid of math/statistics,
                hacking skills, substantive
                expertise, possibly
                nunchaku and/or bow
                hunting skills [1]

         •      However nebulous,
                there is a lot of hype                                                  http://www.dataists.com/2010/09/the-data-science-venn-diagram/


                around the term.
[1] - “You know, like nunchaku skills, bow hunting skills, computer hacking skills... Girls only want boyfriends who have great skills.” -- Napoleon Dynamite
Hype?
Hype Indeed!
And Even More Hype...
The Economist May 14-20,
2011: “Corporate chefs
are in demand again, office
rents are soaring and the
pay being offered to
talented folk in fashionable
fields like data science
is reaching Hollywood
levels.”
DATASCIENCETOOLKIT
• www.datasciencetoolkit.com
• Pete Warden [2] is the author
• You can download and run it as a self-
  contained virtual machine, python,
  javascript, or use the API online.
• To call the API, you can make either a GET
  or POST request.
• No R package...until now
    [2] - You may remember him from classics such as his fight with facebook and/or the iphone tracking stuff.
RDSTK
• github.com/rtelmore/RDSTK
• All of the DSTK functionality is supported
  except for geodict and file2text. (hmmm,
  hack?)
• Dependencies: plyr, rcurl, rjson
• Typical usage involves passing in a text
  string and a data.frame or json string is
Github
RDSTK:
       The R Package for DSTK
Supported functions:
   ★   street2coordinates
   ★   coordinates2politics     getURL
   ★   ip2coordinates
   ★   text2sentences
   ★   html2text
   ★   html2story             curlPerform
   ★   text2people
   ★   text2times
street2coordinates()
street2coordinates <- function(address, session = getCurlHandle()){
  api <- "http://www.datasciencetoolkit.org/street2coordinates/"
  get.addy <- getURL(paste(api, URLencode(address), sep = ""),
                     curl = session)
  result <- ldply(fromJSON(get.addy), data.frame)
  names(result)[1] <- "full.address"
  return(result)
}


address can be “5874 Green Dr., Florence, KY”
session allows the user to specify curl parameters
fromJSON and ldply R-ify everything
street2coordinates using
   the DSTK website
text2people()
text2people <- function(text, session=getCurlHandle()) {
  api <- "http://www.datasciencetoolkit.org/text2people"
  r = dynCurlReader()
  curlPerform(postfields=text, url=api, post=1L,
              writefunction=r$update, curl=session)
  result <- ldply(fromJSON(r$value()), data.frame)
  return(result)
}


Similar to previous function in its inputs
Note that we are calling the API using a POST
request (postfields=text)
Go To The R Session
Building an R Package

Resources:
   ★   Writing R Extensions
   ★   http://cran.r-project.org/doc/manuals/R-exts.html
   ★   Friedrich Leisch’s Tutorial
Package Creation
• package.skeleton(name, list, et al.) # RTFM
• Essentially, this will create the directory
  hierarchy, help files, auxiliary files, etc.
• Once everything is in working order, it’s a
  simple R CMD BUILD package_name from
  the command line to create tar.gz
• Then R CMD CHECK package_name
Description File
Package: RDSTK
Type: Package
Title: An R wrapper for the Data Science Toolkit API
Version: 1.0
Depends: plyr, rjson, RCurl
Date: 2011-04-30
Author: Ryan Elmore
Maintainer: Ryan Elmore <rtelmore@gmail.com>
Description: This package provides an R interface to Pete Warden's Data
              Science Toolkit. See www.datasciencetoolkit.org for more
              information. The source code for this package can be
               found at github.com/rtelmore/RDSTK Happy hacking!
License: BSD
LazyLoad: yes
Data Science in Action!
• I asked everybody where they are from if
  not Denver.
• 16 or so respondents and I added a few
  myself
• Inspired by a recent post on FlowingData [3]
• Unfortunately, I asked this before I knew
  how to use the RDSTK!
           [3] - http://flowingdata.com/2011/05/11/how-to-map-connections-with-great-circles/
Where are we from?




     Note to @RevoDavid: had to cut out Adelaide for aesthetics!
U.S.-centric Version
                            Going to Belgium, not CA




Adelaide, Aus
Code
• I’ll put it on github [4] and my blog [5]
• Packages: maps & geosphere (& ggplot2)
    map("state", col="#f2f2f2", fill=TRUE, bg="white", lwd=0.15)
    for(i in 1:dim(places)[1]){
      inter <- gcIntermediate(denver[2:1], places[i, 5:4], n=50,
                               addStartEnd=TRUE)
      lines(inter, col="navy")
    }


•   I tried using ggplot2, but couldn’t limit the lat
    and long appropriately.
                            [4] - github.com/rtelmore
                        [5] - thelogcabin.wordpress.com
Summary
• The RDSTK is available on github; I haven’t
  even thought about packaging it up for
  CRAN yet.
• Feel free to add city2coordinates; the
  night’s project!
• Unfortunately, the term ‘data science’ has a
  lot of traction.
• We all need a raise! :)
Acknowledgments
• Pete Warden for making the DSTK
• Twitter and the #rstats hashtag
• TwitteR and InfoChimps R packages
• Duncan Temple Lang for answering a
  question on the Rstats-help mailing list
• Andy Gayton [6] and “Noah” from the
  StackOverflow site.
          [6] - Check out staticloud.com for all your static website hosting needs!

More Related Content

What's hot

100 Billion Documents And Counting: Rebuilding Message Search at Slack - Josh...
100 Billion Documents And Counting: Rebuilding Message Search at Slack - Josh...100 Billion Documents And Counting: Rebuilding Message Search at Slack - Josh...
100 Billion Documents And Counting: Rebuilding Message Search at Slack - Josh...Lucidworks
 
PuppetConf 2017: How People Actually Write Puppet- Gareth Rushgrove, Puppet
PuppetConf 2017: How People Actually Write Puppet- Gareth Rushgrove, PuppetPuppetConf 2017: How People Actually Write Puppet- Gareth Rushgrove, Puppet
PuppetConf 2017: How People Actually Write Puppet- Gareth Rushgrove, PuppetPuppet
 
Block replication on HDFS
Block replication on HDFSBlock replication on HDFS
Block replication on HDFSKoos van Strien
 
The Reality of Digital Transfer @ArchivesNZ
The Reality of Digital Transfer @ArchivesNZThe Reality of Digital Transfer @ArchivesNZ
The Reality of Digital Transfer @ArchivesNZRoss Spencer
 

What's hot (7)

100 Billion Documents And Counting: Rebuilding Message Search at Slack - Josh...
100 Billion Documents And Counting: Rebuilding Message Search at Slack - Josh...100 Billion Documents And Counting: Rebuilding Message Search at Slack - Josh...
100 Billion Documents And Counting: Rebuilding Message Search at Slack - Josh...
 
PuppetConf 2017: How People Actually Write Puppet- Gareth Rushgrove, Puppet
PuppetConf 2017: How People Actually Write Puppet- Gareth Rushgrove, PuppetPuppetConf 2017: How People Actually Write Puppet- Gareth Rushgrove, Puppet
PuppetConf 2017: How People Actually Write Puppet- Gareth Rushgrove, Puppet
 
Block replication on HDFS
Block replication on HDFSBlock replication on HDFS
Block replication on HDFS
 
The Reality of Digital Transfer @ArchivesNZ
The Reality of Digital Transfer @ArchivesNZThe Reality of Digital Transfer @ArchivesNZ
The Reality of Digital Transfer @ArchivesNZ
 
Jisc
JiscJisc
Jisc
 
Ring
RingRing
Ring
 
Cscope and ctags
Cscope and ctagsCscope and ctags
Cscope and ctags
 

Similar to DRUG - RDSTK Talk

Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupNed Shawa
 
Querying 1.8 billion reddit comments with python
Querying 1.8 billion reddit comments with pythonQuerying 1.8 billion reddit comments with python
Querying 1.8 billion reddit comments with pythonDaniel Rodriguez
 
Getting Started with Hadoop
Getting Started with HadoopGetting Started with Hadoop
Getting Started with HadoopJosh Devins
 
Hands on with Apache Spark
Hands on with Apache SparkHands on with Apache Spark
Hands on with Apache SparkDan Lynn
 
Extensions on PostgreSQL
Extensions on PostgreSQLExtensions on PostgreSQL
Extensions on PostgreSQLAlpaca
 
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...Databricks
 
Adios hadoop, Hola Spark! T3chfest 2015
Adios hadoop, Hola Spark! T3chfest 2015Adios hadoop, Hola Spark! T3chfest 2015
Adios hadoop, Hola Spark! T3chfest 2015dhiguero
 
Spark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
Spark ETL Techniques - Creating An Optimal Fantasy Baseball RosterSpark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
Spark ETL Techniques - Creating An Optimal Fantasy Baseball RosterDon Drake
 
Scaling PyData Up and Out
Scaling PyData Up and OutScaling PyData Up and Out
Scaling PyData Up and OutTravis Oliphant
 
Quadrupling your elephants - RDF and the Hadoop ecosystem
Quadrupling your elephants - RDF and the Hadoop ecosystemQuadrupling your elephants - RDF and the Hadoop ecosystem
Quadrupling your elephants - RDF and the Hadoop ecosystemRob Vesse
 
Code as Data workshop: Using source{d} Engine to extract insights from git re...
Code as Data workshop: Using source{d} Engine to extract insights from git re...Code as Data workshop: Using source{d} Engine to extract insights from git re...
Code as Data workshop: Using source{d} Engine to extract insights from git re...source{d}
 
Apache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code WorkshopApache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code WorkshopAmanda Casari
 
Big data beyond the JVM - DDTX 2018
Big data beyond the JVM -  DDTX 2018Big data beyond the JVM -  DDTX 2018
Big data beyond the JVM - DDTX 2018Holden Karau
 
Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018Holden Karau
 
Introduction to Apache Spark
Introduction to Apache Spark Introduction to Apache Spark
Introduction to Apache Spark Hubert Fan Chiang
 
Php melb cqrs-ddd-predaddy
Php melb cqrs-ddd-predaddyPhp melb cqrs-ddd-predaddy
Php melb cqrs-ddd-predaddyDouglas Reith
 
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Josef A. Habdank
 
Jump Start on Apache Spark 2.2 with Databricks
Jump Start on Apache Spark 2.2 with DatabricksJump Start on Apache Spark 2.2 with Databricks
Jump Start on Apache Spark 2.2 with DatabricksAnyscale
 
Introduction to MapReduce Data Transformations
Introduction to MapReduce Data TransformationsIntroduction to MapReduce Data Transformations
Introduction to MapReduce Data Transformationsswooledge
 

Similar to DRUG - RDSTK Talk (20)

Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetup
 
Querying 1.8 billion reddit comments with python
Querying 1.8 billion reddit comments with pythonQuerying 1.8 billion reddit comments with python
Querying 1.8 billion reddit comments with python
 
Getting Started with Hadoop
Getting Started with HadoopGetting Started with Hadoop
Getting Started with Hadoop
 
Hands on with Apache Spark
Hands on with Apache SparkHands on with Apache Spark
Hands on with Apache Spark
 
Extensions on PostgreSQL
Extensions on PostgreSQLExtensions on PostgreSQL
Extensions on PostgreSQL
 
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
 
Adios hadoop, Hola Spark! T3chfest 2015
Adios hadoop, Hola Spark! T3chfest 2015Adios hadoop, Hola Spark! T3chfest 2015
Adios hadoop, Hola Spark! T3chfest 2015
 
Spark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
Spark ETL Techniques - Creating An Optimal Fantasy Baseball RosterSpark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
Spark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
 
Scaling PyData Up and Out
Scaling PyData Up and OutScaling PyData Up and Out
Scaling PyData Up and Out
 
Quadrupling your elephants - RDF and the Hadoop ecosystem
Quadrupling your elephants - RDF and the Hadoop ecosystemQuadrupling your elephants - RDF and the Hadoop ecosystem
Quadrupling your elephants - RDF and the Hadoop ecosystem
 
Code as Data workshop: Using source{d} Engine to extract insights from git re...
Code as Data workshop: Using source{d} Engine to extract insights from git re...Code as Data workshop: Using source{d} Engine to extract insights from git re...
Code as Data workshop: Using source{d} Engine to extract insights from git re...
 
Apache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code WorkshopApache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code Workshop
 
Big data beyond the JVM - DDTX 2018
Big data beyond the JVM -  DDTX 2018Big data beyond the JVM -  DDTX 2018
Big data beyond the JVM - DDTX 2018
 
Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018
 
Introduction to Apache Spark
Introduction to Apache Spark Introduction to Apache Spark
Introduction to Apache Spark
 
Php melb cqrs-ddd-predaddy
Php melb cqrs-ddd-predaddyPhp melb cqrs-ddd-predaddy
Php melb cqrs-ddd-predaddy
 
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
 
Jump Start on Apache Spark 2.2 with Databricks
Jump Start on Apache Spark 2.2 with DatabricksJump Start on Apache Spark 2.2 with Databricks
Jump Start on Apache Spark 2.2 with Databricks
 
Pig latin
Pig latinPig latin
Pig latin
 
Introduction to MapReduce Data Transformations
Introduction to MapReduce Data TransformationsIntroduction to MapReduce Data Transformations
Introduction to MapReduce Data Transformations
 

Recently uploaded

Building AI-Driven Apps Using Semantic Kernel.pptx
Building AI-Driven Apps Using Semantic Kernel.pptxBuilding AI-Driven Apps Using Semantic Kernel.pptx
Building AI-Driven Apps Using Semantic Kernel.pptxUdaiappa Ramachandran
 
Nanopower In Semiconductor Industry.pdf
Nanopower  In Semiconductor Industry.pdfNanopower  In Semiconductor Industry.pdf
Nanopower In Semiconductor Industry.pdfPedro Manuel
 
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IES VE
 
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesAI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesMd Hossain Ali
 
Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Commit University
 
Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.YounusS2
 
UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7DianaGray10
 
Linked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesLinked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesDavid Newbury
 
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCostKubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCostMatt Ray
 
Bird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemBird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemAsko Soukka
 
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDEADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDELiveplex
 
Empowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintEmpowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintMahmoud Rabie
 
Computer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsComputer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsSeth Reyes
 
COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online CollaborationCOMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online Collaborationbruanjhuli
 
UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6DianaGray10
 
Designing A Time bound resource download URL
Designing A Time bound resource download URLDesigning A Time bound resource download URL
Designing A Time bound resource download URLRuncy Oommen
 
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration WorkflowsIgniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration WorkflowsSafe Software
 
NIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopNIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopBachir Benyammi
 
Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )Brian Pichman
 
Comparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioComparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioChristian Posta
 

Recently uploaded (20)

Building AI-Driven Apps Using Semantic Kernel.pptx
Building AI-Driven Apps Using Semantic Kernel.pptxBuilding AI-Driven Apps Using Semantic Kernel.pptx
Building AI-Driven Apps Using Semantic Kernel.pptx
 
Nanopower In Semiconductor Industry.pdf
Nanopower  In Semiconductor Industry.pdfNanopower  In Semiconductor Industry.pdf
Nanopower In Semiconductor Industry.pdf
 
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
 
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesAI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
 
Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)
 
Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.
 
UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7
 
Linked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesLinked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond Ontologies
 
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCostKubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
 
Bird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemBird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystem
 
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDEADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
 
Empowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintEmpowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership Blueprint
 
Computer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsComputer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and Hazards
 
COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online CollaborationCOMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
 
UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6
 
Designing A Time bound resource download URL
Designing A Time bound resource download URLDesigning A Time bound resource download URL
Designing A Time bound resource download URL
 
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration WorkflowsIgniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
 
NIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopNIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 Workshop
 
Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )
 
Comparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioComparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and Istio
 

DRUG - RDSTK Talk

  • 1. R and the Data Science ToolKit (RDSTK) Ryan Elmore National Renewable Energy Lab @rtelmore rtelmore@gmail.com
  • 2. What is Data Science?
  • 3. What is Data Science? • Short Answer: Statistics!
  • 4. What is Data Science? • Short Answer: Statistics! • Long Answer: Some hybrid of math/statistics, hacking skills, substantive expertise, possibly nunchaku and/or bow hunting skills [1] • However nebulous, there is a lot of hype http://www.dataists.com/2010/09/the-data-science-venn-diagram/ around the term. [1] - “You know, like nunchaku skills, bow hunting skills, computer hacking skills... Girls only want boyfriends who have great skills.” -- Napoleon Dynamite
  • 7. And Even More Hype... The Economist May 14-20, 2011: “Corporate chefs are in demand again, office rents are soaring and the pay being offered to talented folk in fashionable fields like data science is reaching Hollywood levels.”
  • 8. DATASCIENCETOOLKIT • www.datasciencetoolkit.com • Pete Warden [2] is the author • You can download and run it as a self- contained virtual machine, python, javascript, or use the API online. • To call the API, you can make either a GET or POST request. • No R package...until now [2] - You may remember him from classics such as his fight with facebook and/or the iphone tracking stuff.
  • 9. RDSTK • github.com/rtelmore/RDSTK • All of the DSTK functionality is supported except for geodict and file2text. (hmmm, hack?) • Dependencies: plyr, rcurl, rjson • Typical usage involves passing in a text string and a data.frame or json string is
  • 11. RDSTK: The R Package for DSTK Supported functions: ★ street2coordinates ★ coordinates2politics getURL ★ ip2coordinates ★ text2sentences ★ html2text ★ html2story curlPerform ★ text2people ★ text2times
  • 12. street2coordinates() street2coordinates <- function(address, session = getCurlHandle()){ api <- "http://www.datasciencetoolkit.org/street2coordinates/" get.addy <- getURL(paste(api, URLencode(address), sep = ""), curl = session) result <- ldply(fromJSON(get.addy), data.frame) names(result)[1] <- "full.address" return(result) } address can be “5874 Green Dr., Florence, KY” session allows the user to specify curl parameters fromJSON and ldply R-ify everything
  • 13. street2coordinates using the DSTK website
  • 14. text2people() text2people <- function(text, session=getCurlHandle()) { api <- "http://www.datasciencetoolkit.org/text2people" r = dynCurlReader() curlPerform(postfields=text, url=api, post=1L, writefunction=r$update, curl=session) result <- ldply(fromJSON(r$value()), data.frame) return(result) } Similar to previous function in its inputs Note that we are calling the API using a POST request (postfields=text)
  • 15. Go To The R Session
  • 16. Building an R Package Resources: ★ Writing R Extensions ★ http://cran.r-project.org/doc/manuals/R-exts.html ★ Friedrich Leisch’s Tutorial
  • 17. Package Creation • package.skeleton(name, list, et al.) # RTFM • Essentially, this will create the directory hierarchy, help files, auxiliary files, etc. • Once everything is in working order, it’s a simple R CMD BUILD package_name from the command line to create tar.gz • Then R CMD CHECK package_name
  • 18. Description File Package: RDSTK Type: Package Title: An R wrapper for the Data Science Toolkit API Version: 1.0 Depends: plyr, rjson, RCurl Date: 2011-04-30 Author: Ryan Elmore Maintainer: Ryan Elmore <rtelmore@gmail.com> Description: This package provides an R interface to Pete Warden's Data Science Toolkit. See www.datasciencetoolkit.org for more information. The source code for this package can be found at github.com/rtelmore/RDSTK Happy hacking! License: BSD LazyLoad: yes
  • 19. Data Science in Action! • I asked everybody where they are from if not Denver. • 16 or so respondents and I added a few myself • Inspired by a recent post on FlowingData [3] • Unfortunately, I asked this before I knew how to use the RDSTK! [3] - http://flowingdata.com/2011/05/11/how-to-map-connections-with-great-circles/
  • 20. Where are we from? Note to @RevoDavid: had to cut out Adelaide for aesthetics!
  • 21. U.S.-centric Version Going to Belgium, not CA Adelaide, Aus
  • 22. Code • I’ll put it on github [4] and my blog [5] • Packages: maps & geosphere (& ggplot2) map("state", col="#f2f2f2", fill=TRUE, bg="white", lwd=0.15) for(i in 1:dim(places)[1]){ inter <- gcIntermediate(denver[2:1], places[i, 5:4], n=50, addStartEnd=TRUE) lines(inter, col="navy") } • I tried using ggplot2, but couldn’t limit the lat and long appropriately. [4] - github.com/rtelmore [5] - thelogcabin.wordpress.com
  • 23. Summary • The RDSTK is available on github; I haven’t even thought about packaging it up for CRAN yet. • Feel free to add city2coordinates; the night’s project! • Unfortunately, the term ‘data science’ has a lot of traction. • We all need a raise! :)
  • 24. Acknowledgments • Pete Warden for making the DSTK • Twitter and the #rstats hashtag • TwitteR and InfoChimps R packages • Duncan Temple Lang for answering a question on the Rstats-help mailing list • Andy Gayton [6] and “Noah” from the StackOverflow site. [6] - Check out staticloud.com for all your static website hosting needs!