SlideShare a Scribd company logo
1 of 24
Download to read offline
R and the Data Science
   ToolKit (RDSTK)
             Ryan Elmore
    National Renewable Energy Lab

             @rtelmore
        rtelmore@gmail.com
What is Data Science?
What is Data Science?
•   Short Answer: Statistics!
What is Data Science?
         •      Short Answer: Statistics!

         •      Long Answer: Some
                hybrid of math/statistics,
                hacking skills, substantive
                expertise, possibly
                nunchaku and/or bow
                hunting skills [1]

         •      However nebulous,
                there is a lot of hype                                                  http://www.dataists.com/2010/09/the-data-science-venn-diagram/


                around the term.
[1] - “You know, like nunchaku skills, bow hunting skills, computer hacking skills... Girls only want boyfriends who have great skills.” -- Napoleon Dynamite
Hype?
Hype Indeed!
And Even More Hype...
The Economist May 14-20,
2011: “Corporate chefs
are in demand again, office
rents are soaring and the
pay being offered to
talented folk in fashionable
fields like data science
is reaching Hollywood
levels.”
DATASCIENCETOOLKIT
• www.datasciencetoolkit.com
• Pete Warden [2] is the author
• You can download and run it as a self-
  contained virtual machine, python,
  javascript, or use the API online.
• To call the API, you can make either a GET
  or POST request.
• No R package...until now
    [2] - You may remember him from classics such as his fight with facebook and/or the iphone tracking stuff.
RDSTK
• github.com/rtelmore/RDSTK
• All of the DSTK functionality is supported
  except for geodict and file2text. (hmmm,
  hack?)
• Dependencies: plyr, rcurl, rjson
• Typical usage involves passing in a text
  string and a data.frame or json string is
Github
RDSTK:
       The R Package for DSTK
Supported functions:
   ★   street2coordinates
   ★   coordinates2politics     getURL
   ★   ip2coordinates
   ★   text2sentences
   ★   html2text
   ★   html2story             curlPerform
   ★   text2people
   ★   text2times
street2coordinates()
street2coordinates <- function(address, session = getCurlHandle()){
  api <- "http://www.datasciencetoolkit.org/street2coordinates/"
  get.addy <- getURL(paste(api, URLencode(address), sep = ""),
                     curl = session)
  result <- ldply(fromJSON(get.addy), data.frame)
  names(result)[1] <- "full.address"
  return(result)
}


address can be “5874 Green Dr., Florence, KY”
session allows the user to specify curl parameters
fromJSON and ldply R-ify everything
street2coordinates using
   the DSTK website
text2people()
text2people <- function(text, session=getCurlHandle()) {
  api <- "http://www.datasciencetoolkit.org/text2people"
  r = dynCurlReader()
  curlPerform(postfields=text, url=api, post=1L,
              writefunction=r$update, curl=session)
  result <- ldply(fromJSON(r$value()), data.frame)
  return(result)
}


Similar to previous function in its inputs
Note that we are calling the API using a POST
request (postfields=text)
Go To The R Session
Building an R Package

Resources:
   ★   Writing R Extensions
   ★   http://cran.r-project.org/doc/manuals/R-exts.html
   ★   Friedrich Leisch’s Tutorial
Package Creation
• package.skeleton(name, list, et al.) # RTFM
• Essentially, this will create the directory
  hierarchy, help files, auxiliary files, etc.
• Once everything is in working order, it’s a
  simple R CMD BUILD package_name from
  the command line to create tar.gz
• Then R CMD CHECK package_name
Description File
Package: RDSTK
Type: Package
Title: An R wrapper for the Data Science Toolkit API
Version: 1.0
Depends: plyr, rjson, RCurl
Date: 2011-04-30
Author: Ryan Elmore
Maintainer: Ryan Elmore <rtelmore@gmail.com>
Description: This package provides an R interface to Pete Warden's Data
              Science Toolkit. See www.datasciencetoolkit.org for more
              information. The source code for this package can be
               found at github.com/rtelmore/RDSTK Happy hacking!
License: BSD
LazyLoad: yes
Data Science in Action!
• I asked everybody where they are from if
  not Denver.
• 16 or so respondents and I added a few
  myself
• Inspired by a recent post on FlowingData [3]
• Unfortunately, I asked this before I knew
  how to use the RDSTK!
           [3] - http://flowingdata.com/2011/05/11/how-to-map-connections-with-great-circles/
Where are we from?




     Note to @RevoDavid: had to cut out Adelaide for aesthetics!
U.S.-centric Version
                            Going to Belgium, not CA




Adelaide, Aus
Code
• I’ll put it on github [4] and my blog [5]
• Packages: maps & geosphere (& ggplot2)
    map("state", col="#f2f2f2", fill=TRUE, bg="white", lwd=0.15)
    for(i in 1:dim(places)[1]){
      inter <- gcIntermediate(denver[2:1], places[i, 5:4], n=50,
                               addStartEnd=TRUE)
      lines(inter, col="navy")
    }


•   I tried using ggplot2, but couldn’t limit the lat
    and long appropriately.
                            [4] - github.com/rtelmore
                        [5] - thelogcabin.wordpress.com
Summary
• The RDSTK is available on github; I haven’t
  even thought about packaging it up for
  CRAN yet.
• Feel free to add city2coordinates; the
  night’s project!
• Unfortunately, the term ‘data science’ has a
  lot of traction.
• We all need a raise! :)
Acknowledgments
• Pete Warden for making the DSTK
• Twitter and the #rstats hashtag
• TwitteR and InfoChimps R packages
• Duncan Temple Lang for answering a
  question on the Rstats-help mailing list
• Andy Gayton [6] and “Noah” from the
  StackOverflow site.
          [6] - Check out staticloud.com for all your static website hosting needs!

More Related Content

What's hot

100 Billion Documents And Counting: Rebuilding Message Search at Slack - Josh...
100 Billion Documents And Counting: Rebuilding Message Search at Slack - Josh...100 Billion Documents And Counting: Rebuilding Message Search at Slack - Josh...
100 Billion Documents And Counting: Rebuilding Message Search at Slack - Josh...Lucidworks
 
PuppetConf 2017: How People Actually Write Puppet- Gareth Rushgrove, Puppet
PuppetConf 2017: How People Actually Write Puppet- Gareth Rushgrove, PuppetPuppetConf 2017: How People Actually Write Puppet- Gareth Rushgrove, Puppet
PuppetConf 2017: How People Actually Write Puppet- Gareth Rushgrove, PuppetPuppet
 
Block replication on HDFS
Block replication on HDFSBlock replication on HDFS
Block replication on HDFSKoos van Strien
 
The Reality of Digital Transfer @ArchivesNZ
The Reality of Digital Transfer @ArchivesNZThe Reality of Digital Transfer @ArchivesNZ
The Reality of Digital Transfer @ArchivesNZRoss Spencer
 

What's hot (7)

100 Billion Documents And Counting: Rebuilding Message Search at Slack - Josh...
100 Billion Documents And Counting: Rebuilding Message Search at Slack - Josh...100 Billion Documents And Counting: Rebuilding Message Search at Slack - Josh...
100 Billion Documents And Counting: Rebuilding Message Search at Slack - Josh...
 
PuppetConf 2017: How People Actually Write Puppet- Gareth Rushgrove, Puppet
PuppetConf 2017: How People Actually Write Puppet- Gareth Rushgrove, PuppetPuppetConf 2017: How People Actually Write Puppet- Gareth Rushgrove, Puppet
PuppetConf 2017: How People Actually Write Puppet- Gareth Rushgrove, Puppet
 
Block replication on HDFS
Block replication on HDFSBlock replication on HDFS
Block replication on HDFS
 
The Reality of Digital Transfer @ArchivesNZ
The Reality of Digital Transfer @ArchivesNZThe Reality of Digital Transfer @ArchivesNZ
The Reality of Digital Transfer @ArchivesNZ
 
Jisc
JiscJisc
Jisc
 
Ring
RingRing
Ring
 
Cscope and ctags
Cscope and ctagsCscope and ctags
Cscope and ctags
 

Similar to DRUG - RDSTK Talk

Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupNed Shawa
 
Querying 1.8 billion reddit comments with python
Querying 1.8 billion reddit comments with pythonQuerying 1.8 billion reddit comments with python
Querying 1.8 billion reddit comments with pythonDaniel Rodriguez
 
Getting Started with Hadoop
Getting Started with HadoopGetting Started with Hadoop
Getting Started with HadoopJosh Devins
 
Hands on with Apache Spark
Hands on with Apache SparkHands on with Apache Spark
Hands on with Apache SparkDan Lynn
 
Extensions on PostgreSQL
Extensions on PostgreSQLExtensions on PostgreSQL
Extensions on PostgreSQLAlpaca
 
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...Databricks
 
Adios hadoop, Hola Spark! T3chfest 2015
Adios hadoop, Hola Spark! T3chfest 2015Adios hadoop, Hola Spark! T3chfest 2015
Adios hadoop, Hola Spark! T3chfest 2015dhiguero
 
Spark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
Spark ETL Techniques - Creating An Optimal Fantasy Baseball RosterSpark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
Spark ETL Techniques - Creating An Optimal Fantasy Baseball RosterDon Drake
 
Scaling PyData Up and Out
Scaling PyData Up and OutScaling PyData Up and Out
Scaling PyData Up and OutTravis Oliphant
 
Quadrupling your elephants - RDF and the Hadoop ecosystem
Quadrupling your elephants - RDF and the Hadoop ecosystemQuadrupling your elephants - RDF and the Hadoop ecosystem
Quadrupling your elephants - RDF and the Hadoop ecosystemRob Vesse
 
Code as Data workshop: Using source{d} Engine to extract insights from git re...
Code as Data workshop: Using source{d} Engine to extract insights from git re...Code as Data workshop: Using source{d} Engine to extract insights from git re...
Code as Data workshop: Using source{d} Engine to extract insights from git re...source{d}
 
Apache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code WorkshopApache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code WorkshopAmanda Casari
 
Big data beyond the JVM - DDTX 2018
Big data beyond the JVM -  DDTX 2018Big data beyond the JVM -  DDTX 2018
Big data beyond the JVM - DDTX 2018Holden Karau
 
Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018Holden Karau
 
Introduction to Apache Spark
Introduction to Apache Spark Introduction to Apache Spark
Introduction to Apache Spark Hubert Fan Chiang
 
Php melb cqrs-ddd-predaddy
Php melb cqrs-ddd-predaddyPhp melb cqrs-ddd-predaddy
Php melb cqrs-ddd-predaddyDouglas Reith
 
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Josef A. Habdank
 
Jump Start on Apache Spark 2.2 with Databricks
Jump Start on Apache Spark 2.2 with DatabricksJump Start on Apache Spark 2.2 with Databricks
Jump Start on Apache Spark 2.2 with DatabricksAnyscale
 
Introduction to MapReduce Data Transformations
Introduction to MapReduce Data TransformationsIntroduction to MapReduce Data Transformations
Introduction to MapReduce Data Transformationsswooledge
 

Similar to DRUG - RDSTK Talk (20)

Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetup
 
Querying 1.8 billion reddit comments with python
Querying 1.8 billion reddit comments with pythonQuerying 1.8 billion reddit comments with python
Querying 1.8 billion reddit comments with python
 
Getting Started with Hadoop
Getting Started with HadoopGetting Started with Hadoop
Getting Started with Hadoop
 
Hands on with Apache Spark
Hands on with Apache SparkHands on with Apache Spark
Hands on with Apache Spark
 
Extensions on PostgreSQL
Extensions on PostgreSQLExtensions on PostgreSQL
Extensions on PostgreSQL
 
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
 
Adios hadoop, Hola Spark! T3chfest 2015
Adios hadoop, Hola Spark! T3chfest 2015Adios hadoop, Hola Spark! T3chfest 2015
Adios hadoop, Hola Spark! T3chfest 2015
 
Spark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
Spark ETL Techniques - Creating An Optimal Fantasy Baseball RosterSpark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
Spark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
 
Scaling PyData Up and Out
Scaling PyData Up and OutScaling PyData Up and Out
Scaling PyData Up and Out
 
Quadrupling your elephants - RDF and the Hadoop ecosystem
Quadrupling your elephants - RDF and the Hadoop ecosystemQuadrupling your elephants - RDF and the Hadoop ecosystem
Quadrupling your elephants - RDF and the Hadoop ecosystem
 
Code as Data workshop: Using source{d} Engine to extract insights from git re...
Code as Data workshop: Using source{d} Engine to extract insights from git re...Code as Data workshop: Using source{d} Engine to extract insights from git re...
Code as Data workshop: Using source{d} Engine to extract insights from git re...
 
Apache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code WorkshopApache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code Workshop
 
Big data beyond the JVM - DDTX 2018
Big data beyond the JVM -  DDTX 2018Big data beyond the JVM -  DDTX 2018
Big data beyond the JVM - DDTX 2018
 
Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018
 
Introduction to Apache Spark
Introduction to Apache Spark Introduction to Apache Spark
Introduction to Apache Spark
 
Php melb cqrs-ddd-predaddy
Php melb cqrs-ddd-predaddyPhp melb cqrs-ddd-predaddy
Php melb cqrs-ddd-predaddy
 
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
 
Jump Start on Apache Spark 2.2 with Databricks
Jump Start on Apache Spark 2.2 with DatabricksJump Start on Apache Spark 2.2 with Databricks
Jump Start on Apache Spark 2.2 with Databricks
 
Pig latin
Pig latinPig latin
Pig latin
 
Introduction to MapReduce Data Transformations
Introduction to MapReduce Data TransformationsIntroduction to MapReduce Data Transformations
Introduction to MapReduce Data Transformations
 

Recently uploaded

From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 

Recently uploaded (20)

From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 

DRUG - RDSTK Talk

  • 1. R and the Data Science ToolKit (RDSTK) Ryan Elmore National Renewable Energy Lab @rtelmore rtelmore@gmail.com
  • 2. What is Data Science?
  • 3. What is Data Science? • Short Answer: Statistics!
  • 4. What is Data Science? • Short Answer: Statistics! • Long Answer: Some hybrid of math/statistics, hacking skills, substantive expertise, possibly nunchaku and/or bow hunting skills [1] • However nebulous, there is a lot of hype http://www.dataists.com/2010/09/the-data-science-venn-diagram/ around the term. [1] - “You know, like nunchaku skills, bow hunting skills, computer hacking skills... Girls only want boyfriends who have great skills.” -- Napoleon Dynamite
  • 7. And Even More Hype... The Economist May 14-20, 2011: “Corporate chefs are in demand again, office rents are soaring and the pay being offered to talented folk in fashionable fields like data science is reaching Hollywood levels.”
  • 8. DATASCIENCETOOLKIT • www.datasciencetoolkit.com • Pete Warden [2] is the author • You can download and run it as a self- contained virtual machine, python, javascript, or use the API online. • To call the API, you can make either a GET or POST request. • No R package...until now [2] - You may remember him from classics such as his fight with facebook and/or the iphone tracking stuff.
  • 9. RDSTK • github.com/rtelmore/RDSTK • All of the DSTK functionality is supported except for geodict and file2text. (hmmm, hack?) • Dependencies: plyr, rcurl, rjson • Typical usage involves passing in a text string and a data.frame or json string is
  • 11. RDSTK: The R Package for DSTK Supported functions: ★ street2coordinates ★ coordinates2politics getURL ★ ip2coordinates ★ text2sentences ★ html2text ★ html2story curlPerform ★ text2people ★ text2times
  • 12. street2coordinates() street2coordinates <- function(address, session = getCurlHandle()){ api <- "http://www.datasciencetoolkit.org/street2coordinates/" get.addy <- getURL(paste(api, URLencode(address), sep = ""), curl = session) result <- ldply(fromJSON(get.addy), data.frame) names(result)[1] <- "full.address" return(result) } address can be “5874 Green Dr., Florence, KY” session allows the user to specify curl parameters fromJSON and ldply R-ify everything
  • 13. street2coordinates using the DSTK website
  • 14. text2people() text2people <- function(text, session=getCurlHandle()) { api <- "http://www.datasciencetoolkit.org/text2people" r = dynCurlReader() curlPerform(postfields=text, url=api, post=1L, writefunction=r$update, curl=session) result <- ldply(fromJSON(r$value()), data.frame) return(result) } Similar to previous function in its inputs Note that we are calling the API using a POST request (postfields=text)
  • 15. Go To The R Session
  • 16. Building an R Package Resources: ★ Writing R Extensions ★ http://cran.r-project.org/doc/manuals/R-exts.html ★ Friedrich Leisch’s Tutorial
  • 17. Package Creation • package.skeleton(name, list, et al.) # RTFM • Essentially, this will create the directory hierarchy, help files, auxiliary files, etc. • Once everything is in working order, it’s a simple R CMD BUILD package_name from the command line to create tar.gz • Then R CMD CHECK package_name
  • 18. Description File Package: RDSTK Type: Package Title: An R wrapper for the Data Science Toolkit API Version: 1.0 Depends: plyr, rjson, RCurl Date: 2011-04-30 Author: Ryan Elmore Maintainer: Ryan Elmore <rtelmore@gmail.com> Description: This package provides an R interface to Pete Warden's Data Science Toolkit. See www.datasciencetoolkit.org for more information. The source code for this package can be found at github.com/rtelmore/RDSTK Happy hacking! License: BSD LazyLoad: yes
  • 19. Data Science in Action! • I asked everybody where they are from if not Denver. • 16 or so respondents and I added a few myself • Inspired by a recent post on FlowingData [3] • Unfortunately, I asked this before I knew how to use the RDSTK! [3] - http://flowingdata.com/2011/05/11/how-to-map-connections-with-great-circles/
  • 20. Where are we from? Note to @RevoDavid: had to cut out Adelaide for aesthetics!
  • 21. U.S.-centric Version Going to Belgium, not CA Adelaide, Aus
  • 22. Code • I’ll put it on github [4] and my blog [5] • Packages: maps & geosphere (& ggplot2) map("state", col="#f2f2f2", fill=TRUE, bg="white", lwd=0.15) for(i in 1:dim(places)[1]){ inter <- gcIntermediate(denver[2:1], places[i, 5:4], n=50, addStartEnd=TRUE) lines(inter, col="navy") } • I tried using ggplot2, but couldn’t limit the lat and long appropriately. [4] - github.com/rtelmore [5] - thelogcabin.wordpress.com
  • 23. Summary • The RDSTK is available on github; I haven’t even thought about packaging it up for CRAN yet. • Feel free to add city2coordinates; the night’s project! • Unfortunately, the term ‘data science’ has a lot of traction. • We all need a raise! :)
  • 24. Acknowledgments • Pete Warden for making the DSTK • Twitter and the #rstats hashtag • TwitteR and InfoChimps R packages • Duncan Temple Lang for answering a question on the Rstats-help mailing list • Andy Gayton [6] and “Noah” from the StackOverflow site. [6] - Check out staticloud.com for all your static website hosting needs!