1) Application Programming Interfaces (APIs) allow programs and services to access and share data across different computers, operating systems, and programming languages through standardized protocols like HTTP.
2) JSON is a popular data format for APIs because it is lightweight, human-readable, and can represent complex data structures like lists and dictionaries. The R package jsonIO can convert between R lists and their JSON representations.
3) TopicWatch is a platform that provides text analytics through a web API. The R package TopicWatchr is a wrapper that simplifies accessing this API and working with the returned data.
2. Application Programming Interfaces
Why?
I want my code to have access to your code or data... from a different
computer!
we might be using different operating systems!
different programming languages!
have different compression capabilites!
security!
etc.
At least you don't have to install tons of code or download all of the data.
3. The Internet Suggests a Solution
HyperText Transfer Protocol: HTTP
Since the WWW has caught on, HTTP has become a dominant protocol.
Pretty much all computers support some kind of HTTP client
Browsers are just fancy HTTP clients
R can be a client too!
Duncan Temple Lang's RCurl package offers R access to libcurl, a popular HTTP library.
4. But what data will we transfer?
HTTP gives us a nearly universal way to pass data between machines, now we have to decide what format
messages ought to have.
Let's choose something lightweight and human readable
(so no XML :p)
but it should be something easily serializable, and should have some structure
JSON is the popular choice
5. JSON
JSON looks like this:
1 {
2 "hello" : "world",
3 "universe" : 42,
4 "pizza" : nil,
5 "cookies" : ["chocolate", "molasses", "oatmeal"],
6 "eggs" : {
7 "over" : "easy"
8 }
9 }
JSON has types, can be nested, and has analogies (e.g. 'dicts' or 'hashes' or 'maps') in most major programming
languages.
smells like a list in R
The JSONIO , also by Duncan Temple Lang, takes R lists to and from their JSON representations.
6. Numerous Examples
Computational
geocoding, Google, et al.
face-recognition, face.com
prediction, Google
Data
Federal Register
Bloomberg
"Data APIs/feeds available as packages in R"
asked on stats.exchange.com a couple of months ago. The list of packages included:
quantmod , tseries , flmport , WSI , RGoogleTrends , RGoogleDocs , twitteR , Zillow , RNYTimes ,
UScensus2000 , infochimps , rdatamarket , factualR , RDSTK , RBloomberg , LIM , RTAQ , IBrokers ,
rnpn , RClimate
7. API example: TopicWatch
TopicWatch is a platform for text analytics and visualization
currently developing 3 interfaces to the API:
iPad app
web app
R package
We collect streaming data from a variety of sources including Twitter, RSS feeds, government publications,
and others.
8. API Outline
The API is still under development, and is unstable. We're always adding new features and polishing old ones.
Just a few concrete capabilites that are already running:
time series of n-gram frequencies & counts
aggregated at several resolutions
n-grams ranked by frequency
also aggregated a several resolution
can be filtered by sub grams
raw documents that contain a gram
topics that contain a gram
time series counts of documents that contain co-occurring n-grams
ranking grams by usage change between any two times
9. TopicWatchr
The R package is thin wrapper for the HTTP API. It (unsurprisingly) works
by
sending a request to a URL
parsing JSON results
re-arranging lists into data frames
But it has some nice functionality to make working with the API a bit
smoother:
parses timestamps in data
paginates large requests automatically
handles authentication
10. Example 1: Presidential Candidates
Code to get data:
1 library(TopicWatchr)
2 set_credentials("PRUG", "12345")
3
4 candidates <- c("Herman Cain", "Mitt Romney", "Rick Perry",
5 "Newt Gingrich", "Ron Paul", "Michelle Bachmann",
6 "Jon Huntsman", "Rick Santorum")
7
8 twitter_counts <- wordCounts("twitter_sample", candidates)
9 rss_counts <- wordCounts("rss-majorpapers", candidates)
The wordCounts function constructs the proper API call, makes the call, and arranges the results into a data
frame. Each data frame looks like this:
'data.frame': 5 obs. of 9 variables:
$ times : POSIXct, format: "2011-11-15 08:00:00" "2011-11-15 08:30:00" ...
$ Herman Cain : num 0 0.00148 0 0.00326 0.00274
$ Mitt Romney : num 0 0.00148 0 0.00326 0.00548
$ Rick Perry : num 0 0.00148 0 0 0
$ Newt Gingrich : num 0 0.00148 0 0.00326 0
$ Ron Paul : num 0 0 0 0 0
$ Michelle Bachmann: num 0 0 0 0 0
$ Jon Huntsman : num 0 0 0 0 0
$ Rick Santorum : num 0 0.00148 0 0 0
Then we combine data frames and polish with ggplot2 ...
13. `Likely' phrases from earlier today:
Twitter: "im going back :) lt3 please follow back :) lt3 please"
Technology RSS feeds: "user interface displays users click scheme federal trade commission ftc antitrust
complaint outside occupy wall street"
same source, seeded with the word "statistics": "statistics showing highlights google apps like behavioral
advertising refers obliquely suggested session sounded viable business edition"
Politics RSS feeds: "washington university battleground poll numbers superfan badge request may become
president obama administration asked whether congress approval"
Major papers RSS feeds: "percent stake throughout california chapter 11 years ago effectively sealed george
w street movement prefers birds early"
Federal Register: "revision incorporates provisions related investigative actions could result based upon fresh
prunes grown ornamentals ca fip"
14. Feeling Adventurous?
We're looking for beta testers for the R package! In Shackleton's words, what to expect:
...BITTER COLD, LONG MONTHS OF COMPLETE DARKNESS, CONSTANT DANGER, SAFE RETURN DOUBTFUL...
But it can still be fun! You can talk with me about it, or get in touch later at
homer@luckysort.com