SlideShare ist ein Scribd-Unternehmen logo
1 von 49
Downloaden Sie, um offline zu lesen
Language-agnostic data analysis workflows and
reproducible research
Andrew John Lowe
Wigner Research Centre for Physics,
Hungarian Academy of Sciences
28 April 2017
Overview
This talk: language-agnostic (or polyglot) analysis workflows
I’ll show how it is possible to work in multiple languages and
switch between them without leaving the workflow you started
Additionally, I’ll demonstrate how an entire workflow can be
encapsulated in a markdown file that is rendered to a
publishable paper with cross-references and a bibliography (and
with raw a LATEX file produced as a by-product) in a simple
process, making the whole analysis workflow reproducible
Which tool/language is best?
TMVA, scikit-learn, h2o, caret, mlr, WEKA, Shogun, . . .
C++, Python, R, Java, MATLAB/Octave, Julia, . . .
Many languages, environments and tools
People may have strong opinions
A false polychotomy?
Some tools are best suited to specific problems than others
Be a polyglot!
Approaches
Save-and-load
Language-specific bindings
Notebook
knitr
Save-and-load approach
Flat files
Language agnostic
Formats like text, CSV, and JSON are well-supported
Breaks workflow
Data types may not be preserved (e.g., datetime, NULL)
New binary format Feather solves this
Feather
Feather: A fast on-disk format for data frames for R and Python,
powered by Apache Arrow, developed by Wes McKinney and Hadley
Wickham
# In R:
library(feather)
path <- "my_data.feather"
write_feather(df, path)
# In Python:
import feather
path = 'my_data.feather'
df = feather.read_dataframe(path)
Other languages, such as Julia or Scala (for Spark users), can read
and write Feather files without knowledge of details of Python or R
RootTreeToR
RootTreeToR allows users to import ROOT data directly into R
Written by Adam Lyon (Fermilab), presented at useR! 2007
cdcvs.fnal.gov/redmine/projects/roottreetor
Requires ROOT to be installed, but no need to run ROOT
Can export R data.frames to ROOT trees
# Open and load ROOT tree:
rt <- openRootChain("TreeName", "FileName")
N <- nEntries(rt) # number of rows of data
# Names of branches:
branches <- RootTreeToR::getNames(rt)
# Read in a subset of branches (varsList), M rows:
df <- toR(rt, varsList, nEntries=M)
# Use writeDFToRoot to write a data.frame to ROOT
root_numpy
root_numpy is a Python extension module that provides an interface
between ROOT and NumPy. Example from root_numpy homepage:
import ROOT
from root_numpy import root2array, root2rec, tree2rec
from root_numpy.testdata import get_filepath
filename = get_filepath('test.root')
# Convert a TTree in a ROOT file into a NumPy
# structured array:
arr = root2array(filename, 'tree')
# Convert a TTree in a ROOT file into a NumPy
# record array:
rec = root2rec(filename, 'tree')
# Get the TTree from the ROOT file:
rfile = ROOT.TFile(filename)
intree = rfile.Get('tree')
# Convert the TTree into a NumPy record array:
rec = tree2rec(intree)
Language-specific approaches
rPython (using Python from R)
library(rPython)
python.call( "len", 1:3 )
a <- 1:4
b <- 5:8
python.exec( "def concat(a,b): return a+b" )
python.call( "concat", a, b)
python.assign( "a", "hola hola" )
python.method.call( "a", "split", " " )
rpy2 (using R from Python)
from rpy2.robjects.lib.dplyr import (filter,
mutate,
group_by,
summarize)
dataf = (DataFrame(mtcars) >>
filter('gear>3') >>
mutate(powertoweight='hp*36/wt') >>
group_by('gear') >>
summarize(mean_ptw='mean(powertoweight)'))
dataf
Others
Calling Python from Julia: PyCall
Calling R from Julia: RCall
Calling Python from Excel: xlwings
Calling Python from Ruby: Rupy
. . .
Notebook approach
Jupyter Notebook
Each notebook has one main language
Cells can contain other languages through “magic”
For example: ggplot2 in Jupyter Notebook
Many Jupyter kernels available
ROOTbooks: Notebooks running ROOT Jupyter kernel
Assume that most people here have already seen these
Beaker Notebook
Each notebook can contain any language
Many languages are supported
Auto translation of data (copied)
http://beakernotebook.com/
Figure 1: “A universal translator for data scientists”
Beaker Notebook
Beaker’s individual cells support different languages within the same
notebook and allow you to pass data from one cell to another —
e.g. Python to R to JavaScript — seamlessly. This autotranslate
functionality allows you to use the language best suited to the
problem.
Figure 2: “A polyglot workflow in Beaker.”
Reproducible research
Typical research pipeline
1. Measured data
processing code
−−−−−−−−−→ analysis data
2. Analysis data
analysis code
−−−−−−−→ computational results
3. Computational results
presentation code
−−−−−−−−−−→ plots, tables, numbers
4. Plots, tables, numbers + text
manual editing
−−−−−−−−→ ARTICLE
5. (Maybe) put the data and code on the web somewhere
Seldom done
Challenges with this approach
Onus is on researchers to make their data and code available
Authors may need to undertake considerable effort to put data
and code on the web
Readers must download data and code individually and piece
together which data go with which code sections, etc.
Typically, authors just put stuff on the web
Readers just download the data and try to figure it out, piece
together the software and run it
Authors/readers must manually interact with websites
There is no single document to integrate data analysis with
textual representations; i.e. data, code, and text are not linked
Your experiment may impose restrictions on what kind of data
can be shared publicly. Nevertheless, making an analysis
reproducible benefits both your collaboration colleagues and
your future self!
Literate Statistical Programming
Original idea comes from Donald Knuth
An article is a stream of text and code
Analysis code is divided into text and code “chunks”
Presentation code formats results (tables, figures, etc.)
Article text explains what is going on
Literate programs are weaved to produce human-readable
documents and tangled to produce machine-readable
documents
knitr
knitr is a system for literate (statistical) programming
Uses R as the programming language, but others are allowed
Benefits: text and code are all in one place, results
automatically updated to reflect external changes, workflow
supports reproducibility from the outset
Almost no extra effort required by the researcher after the
article has been written
Develop your analysis interactively, like in a Jupyter Notebook
When finished, do rmarkdown::render("my_paper.Rmd") to
get PDF/HTML/doc
Like Notebooks, but output is publication quality, and you get
the raw LATEX also as a freebie
R Markdown
This is an R Markdown presentation. Markdown is a simple
formatting syntax for authoring HTML, PDF, and MS Word
documents. For more details on using R Markdown see
http://rmarkdown.rstudio.com
It looks like LATEX, or to be more precise, it’s a Beamer
presentation, but this slide was created like:
## R Markdown
This is an R Markdown presentation. Markdown is a simple
formatting syntax for authoring HTML, PDF, and MS Word
documents. For more details on using R Markdown see
<http://rmarkdown.rstudio.com>
It looks like LaTeX, or to be more precise, it's a
**Beamer** presentation, but this slide was created like:
R Markdown (continued)
The Markdown syntax has some enhancements. For example, you
can include LATEX equations, like this:
i γµ
∂µψ − mcψ = 0 (1)
We can also add tables and figures, just as we would do in LATEX:
Table 1: Effectiveness of Insect Sprays. The mean counts of insects in
agricultural experimental units treated with different insecticides.
Count Spray
A 14.500000
B 15.333333
C 2.083333
D 4.916667
E 3.500000
F 16.666667
Example code chunk
A code chunk using R is defined like this:
```{r}
print('hello world!')
```
A code chunk using some other execution engine is defined like this:
```{<execution-engine>}
print('hello world!')
```
For example:
```{python}
x = 'hello, python world!'
print(x.split(' '))
```
Example R code chunk
ggplot(
data = gapminder, aes(x = lifeExp, y = gdpPercap)) +
geom_point(
aes(color = continent, shape = continent)) +
scale_y_log10()
1e+03
1e+04
1e+05
40 60 80
lifeExp
gdpPercap
continent
Africa
Americas
Asia
Europe
Oceania
Code execution engines
Although R is the default code execution engine and a
first-class citizen in the knitr system, many other code
execution engines are available; this is the current list: awk,
bash, coffee, gawk, groovy, haskell, lein, mysql, node, octave,
perl, psql, python, Rscript, ruby, sas, scala, sed, sh, stata, zsh,
highlight, Rcpp, tikz, dot, c, fortran, fortran95, asy, cat, asis,
stan, block, block2, js, css, sql, go
I’ve already shown examples of R and Python code chunks
earlier in this talk
Except for R, all chunks are executed in separate sessions, so
the variables cannot be directly shared. If we want to make use
of objects created in previous chunks, we usually have to write
them to files (as side effects). For the bash engine, we can use
Sys.setenv() to export variables from R to bash.
Code chunks can be read from external files, in the event that
you don’t want to put everything in a monster-length
markdown document (use bookdown)
FORTRAN
Define a subroutine:
C Fortran test
subroutine fexp(n, x)
double precision x
C output
integer n, i
C input value
do 10 i=1,n
x=dexp(dcos(dsin(dble(float(i)))))
10 continue
return
end
FORTRAN (continued)
Call it from R:
# Call the function from R:
res = .Fortran("fexp", n=100000L, x=0)
str(res)
## List of 2
## $ n: int 100000
## $ x: num 2.72
Haskell
[x | x <- [1..10], odd x]
## [1,3,5,7,9]
SQL
db = dbConnect(RSQLite::SQLite(), dbname = ":memory:")
DROP TABLE IF EXISTS packages
CREATE TABLE packages (id INTEGER, name TEXT)
INSERT INTO packages VALUES (1, 'readr'), (2, 'tm')
SELECT * FROM packages
/* Can direct query results to an R data frame */
Table 2: 2 records
id name
1 readr
2 tm
C
C code is compiled automatically:
void square(double *x) {
*x = *x * *x;
}
// Compiler now runs...
## gcc -std=gnu99 -I/usr/share/R/include -DNDEBUG -fpi
## gcc -std=gnu99 -shared -L/usr/lib/R/lib -Wl,-Bsymbolic-f
Test the square() function:
.C('square', 9)
## [[1]]
## [1] 81
Rcpp
The C++ code is compiled through the Rcpp package; here we
show how we can write a function in C+ that can be called from R:
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
NumericVector timesTwo(NumericVector x) {
return x * 2;
}
# In R:
print(timesTwo(3.1415926))
## [1] 6.283185
cat
A special engine cat can be used to save the content of a code
chunk to a file using the cat() function, defined like this:
```{cat engine.opts=list(file='source.cxx')}
// Some code here...
```
The contents can be anything, but perhaps the most interesting use
is as container for source code that you compile and run later.
BASH
echo hello bash!
echo 'a b c' | sed 's/ /|/g'
# We can also compile the source.cxx file that we
# created in the previous code chunk; maybe compile
# against ROOT libraries, etc.
# Then run the application
## hello bash!
## a|b|c
We can run anything that we can run from the command line,
including stuff we built in previous code chunks
Data exchange
Since the Python engine executes code in an external process,
exchanging data between R chunks and Python chunks is done via
the file system. If you are exchanging data frames, you can use the
Feather package for very high performance transfer of even large
data frames between Python and R:
import pandas
import feather
# Read flights data and select flights to O'Hare
flights = pandas.read_csv("flights.csv")
flights = flights[flights['dest'] == "ORD"]
# Select carrier and delay columns and drop rows with
# missing values
flights = flights[['carrier','dep_delay','arr_delay']]
flights = flights.dropna()
print flights.head(10)
# Write to feather file for reading from R
feather.write_dataframe(flights, "flights.feather")
Data exchange (continued)
library(feather)
library(ggplot2)
# Read from feather and plot
flights <- read_feather("flights.feather")
ggplot(flights, aes(carrier, arr_delay)) +
geom_point() +
geom_jitter()
Data exchange (summary)
ROOT ↔ R: RootTreeToR
ROOT ↔ Python: root_numpy
Python ↔ R: Feather
Scala, Julia and other languages supported
ROOT → Octave/MATLAB: I wrote an interface to do this
Knitron
The knitron package, available on GitHub but not yet on
CRAN, is intended to allow users the ability to use
IPython/Jupyter and matplotlib in R Markdown code chunks
and render them with knitr
According to the author’s website, “Knitron works by lazily
starting a global IPython kernel the first time a code chunk
gets rendered by knitr and this kernel is reused for all
consecutive chunks. This way all the computation done in any
previous chunk is available in the current chunk, providing
R-like behaviour for Python”.
Knitron example
from mpl_toolkits.mplot3d import Axes3D
import matplotlib.pyplot as plt
import numpy as np
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
for c, z in zip(['r', 'g', 'b', 'y'], [30, 20, 10, 0]):
xs = np.arange(20)
ys = np.random.rand(20)
cs = [c] * len(xs)
cs[0] = 'c'
ax.bar(xs, ys, zs=z, zdir='y', color=cs, alpha=0.8)
ax.set_xlabel('X')
ax.set_ylabel('Y')
ax.set_zlabel('Z')
plt.savefig("myplot.pdf")
## GLib-GIO-Message: Using the 'memory' GSettings backend.
Knitron example (continued)
Then plot the figure using a standard knitr mechanism, for example:
cat("![My matplotlib plot.](myplot.pdf)")
X
0
5
10
15
Y
0.04
0.02
0.00
0.02
0.04
0.06
Z
0.0
0.2
0.4
0.6
0.8
1.0
Knitron example (continued)
Alternatively, we can use the following recipe for JPEG/PNG files to
get a figure caption and allow cross-referencing and image resizing:
require(png); require(grid)
img <- readPNG("myplot.png")
grid.raster(img)
Figure 4: This figure has been produced from matplotlib in a Python
code chunk using the knitron R package.
ROOT
In the R tutorial that I gave at the IML Workshop last month, I
spoke about the possibility of using code execution engines
other than R, but noted then that there is no way to use
ROOT as a code execution engine
Well, now there is!
I have written an interface to allow ROOT to be used as a
code execution engine
Simple proof-of-concept at the moment; nothing too
sophisticated at this stage
Runs root -q -b macro.C and runs the macro
If not sufficient, it’s always possible to run more complicated
analyses, perhaps using stand-alone C++ code using ROOT
libraries, or using your experiment’s software framework, in a
UNIX shell code chunk
ROOT code execution engine example
// Builds a graph with errors, displays it and saves it
// as image. First, include some header files (within,
// CINT these will be ignored).
#include "TCanvas.h"
#include "TROOT.h"
#include "TGraphErrors.h"
#include "TF1.h"
#include "TLegend.h"
#include "TArrow.h"
#include "TLatex.h"
void macro1(){
// The values and the errors on the Y axis
const int n_points=10;
double x_vals[n_points]=
{1,2,3,4,5,6,7,8,9,10};
lenght [cm]
2 4 6 8 10
Arb.Units
0
10
20
30
40
50
60
Measurement XYZ
Lab. Lesson 1
Exp. Points
Th. Law
Deviation
Maximum
Measurement XYZ
Figure 5: This is the graph generated by the ROOT macro.
ROOT + markdown
ROOT + Jupyter Notebook = ROOTbook
Develop your analysis interactively
Work collaboratively
Reproducible document
Contains the plots and numeric results
Version control is difficult
Not immediately fit for publication in an academic journal
ROOT + R markdown = “ROOTdown”?
Interactivity not currently available for ROOT (needs work)
Plots and numeric results only in the rendered document
Plain text format → version control is simple
Immediately fit for publication in an academic journal
Example paper output
See toy_example_paper.pdf to see example output from knitr
This Beamer presentation, but rendered as a paper instead
To demonstrate that the necessary infrastructure works
Contains (hyperlinked) cross-references and bibliography
Note that I wouldn’t have to do anything special to make this
happen for a real paper; I ran a single command to run all the
“analysis code” and generate the document with all the plots,
tables, numerical results, etc.
Raw LATEX (to submit to journal) generated as a by-product
Messages to take away
It is already possible to write a reproducible analysis in your
favourite programming language and have your paper rendered
fit for publication
You can mix and match programming languages in a single
uninterrupted workflow
Enabling an analysis to be performed in a single unbroken
workflow greatly facilitates reproducibility
There are ways to exchange data between code chunks that are
written in different programming languages
ROOT, Python, R, . . .
Works nicely with version control (it’s just plain text!)
You can now embed your ROOT analysis code in a
reproducible academic publication
Acknowledgements
This work is supported by NKFIH OTKA grant K120660.

Weitere ähnliche Inhalte

Was ist angesagt?

Unit 3
Unit  3Unit  3
Unit 3siddr
 
1. python programming
1. python programming1. python programming
1. python programmingsreeLekha51
 
web programming UNIT VIII python by Bhavsingh Maloth
web programming UNIT VIII python by Bhavsingh Malothweb programming UNIT VIII python by Bhavsingh Maloth
web programming UNIT VIII python by Bhavsingh MalothBhavsingh Maloth
 
What we can learn from Rebol?
What we can learn from Rebol?What we can learn from Rebol?
What we can learn from Rebol?lichtkind
 
web programming Unit VI PPT by Bhavsingh Maloth
web programming Unit VI PPT  by Bhavsingh Malothweb programming Unit VI PPT  by Bhavsingh Maloth
web programming Unit VI PPT by Bhavsingh MalothBhavsingh Maloth
 
Win pcap filtering expression syntax
Win pcap  filtering expression syntaxWin pcap  filtering expression syntax
Win pcap filtering expression syntaxVota Ppt
 
220 runtime environments
220 runtime environments220 runtime environments
220 runtime environmentsJ'tong Atong
 
101 3.7 search text files using regular expressions
101 3.7 search text files using regular expressions101 3.7 search text files using regular expressions
101 3.7 search text files using regular expressionsAcácio Oliveira
 
WEB PROGRAMMING UNIT VI BY BHAVSINGH MALOTH
WEB PROGRAMMING UNIT VI BY BHAVSINGH MALOTHWEB PROGRAMMING UNIT VI BY BHAVSINGH MALOTH
WEB PROGRAMMING UNIT VI BY BHAVSINGH MALOTHBhavsingh Maloth
 
Text Mining Infrastructure in R
Text Mining Infrastructure in RText Mining Infrastructure in R
Text Mining Infrastructure in RAshraf Uddin
 
intro unix/linux 05
intro unix/linux 05intro unix/linux 05
intro unix/linux 05duquoi
 
Parallella: Embedded HPC For Everybody
Parallella: Embedded HPC For EverybodyParallella: Embedded HPC For Everybody
Parallella: Embedded HPC For Everybodyjerlbeck
 
Program Structure in GNU/Linux (ELF Format)
Program Structure in GNU/Linux (ELF Format)Program Structure in GNU/Linux (ELF Format)
Program Structure in GNU/Linux (ELF Format)Varun Mahajan
 
Python and Machine Learning
Python and Machine LearningPython and Machine Learning
Python and Machine Learningtrygub
 
Introduction to Data Mining with R and Data Import/Export in R
Introduction to Data Mining with R and Data Import/Export in RIntroduction to Data Mining with R and Data Import/Export in R
Introduction to Data Mining with R and Data Import/Export in RYanchang Zhao
 
RProtoBuf: protocol buffers for R
RProtoBuf: protocol buffers for RRProtoBuf: protocol buffers for R
RProtoBuf: protocol buffers for RRomain Francois
 

Was ist angesagt? (20)

Unit 3
Unit  3Unit  3
Unit 3
 
1. python programming
1. python programming1. python programming
1. python programming
 
web programming UNIT VIII python by Bhavsingh Maloth
web programming UNIT VIII python by Bhavsingh Malothweb programming UNIT VIII python by Bhavsingh Maloth
web programming UNIT VIII python by Bhavsingh Maloth
 
What we can learn from Rebol?
What we can learn from Rebol?What we can learn from Rebol?
What we can learn from Rebol?
 
web programming Unit VI PPT by Bhavsingh Maloth
web programming Unit VI PPT  by Bhavsingh Malothweb programming Unit VI PPT  by Bhavsingh Maloth
web programming Unit VI PPT by Bhavsingh Maloth
 
File mangement
File mangementFile mangement
File mangement
 
Win pcap filtering expression syntax
Win pcap  filtering expression syntaxWin pcap  filtering expression syntax
Win pcap filtering expression syntax
 
220 runtime environments
220 runtime environments220 runtime environments
220 runtime environments
 
101 3.7 search text files using regular expressions
101 3.7 search text files using regular expressions101 3.7 search text files using regular expressions
101 3.7 search text files using regular expressions
 
WEB PROGRAMMING UNIT VI BY BHAVSINGH MALOTH
WEB PROGRAMMING UNIT VI BY BHAVSINGH MALOTHWEB PROGRAMMING UNIT VI BY BHAVSINGH MALOTH
WEB PROGRAMMING UNIT VI BY BHAVSINGH MALOTH
 
The Bund language
The Bund languageThe Bund language
The Bund language
 
Text Mining Infrastructure in R
Text Mining Infrastructure in RText Mining Infrastructure in R
Text Mining Infrastructure in R
 
intro unix/linux 05
intro unix/linux 05intro unix/linux 05
intro unix/linux 05
 
Srgoc dotnet
Srgoc dotnetSrgoc dotnet
Srgoc dotnet
 
Parallella: Embedded HPC For Everybody
Parallella: Embedded HPC For EverybodyParallella: Embedded HPC For Everybody
Parallella: Embedded HPC For Everybody
 
Program Structure in GNU/Linux (ELF Format)
Program Structure in GNU/Linux (ELF Format)Program Structure in GNU/Linux (ELF Format)
Program Structure in GNU/Linux (ELF Format)
 
Unit VI
Unit VI Unit VI
Unit VI
 
Python and Machine Learning
Python and Machine LearningPython and Machine Learning
Python and Machine Learning
 
Introduction to Data Mining with R and Data Import/Export in R
Introduction to Data Mining with R and Data Import/Export in RIntroduction to Data Mining with R and Data Import/Export in R
Introduction to Data Mining with R and Data Import/Export in R
 
RProtoBuf: protocol buffers for R
RProtoBuf: protocol buffers for RRProtoBuf: protocol buffers for R
RProtoBuf: protocol buffers for R
 

Ähnlich wie Language-agnostic data analysis workflows and reproducible research

Reproducible research (and literate programming) in R
Reproducible research (and literate programming) in RReproducible research (and literate programming) in R
Reproducible research (and literate programming) in Rliz__is
 
R Introduction
R IntroductionR Introduction
R Introductionschamber
 
Data Science - Part II - Working with R & R studio
Data Science - Part II -  Working with R & R studioData Science - Part II -  Working with R & R studio
Data Science - Part II - Working with R & R studioDerek Kane
 
Using R on High Performance Computers
Using R on High Performance ComputersUsing R on High Performance Computers
Using R on High Performance ComputersDave Hiltbrand
 
BUSINESS ANALYTICS WITH R SOFTWARE DIAST
BUSINESS ANALYTICS WITH R SOFTWARE DIASTBUSINESS ANALYTICS WITH R SOFTWARE DIAST
BUSINESS ANALYTICS WITH R SOFTWARE DIASTHaritikaChhatwal1
 
Modeling in R Programming Language for Beginers.ppt
Modeling in R Programming Language for Beginers.pptModeling in R Programming Language for Beginers.ppt
Modeling in R Programming Language for Beginers.pptanshikagoel52
 
Reproducible Computational Research in R
Reproducible Computational Research in RReproducible Computational Research in R
Reproducible Computational Research in RSamuel Bosch
 
Lecture1_R.pdf
Lecture1_R.pdfLecture1_R.pdf
Lecture1_R.pdfBusyBird2
 
Workshop presentation hands on r programming
Workshop presentation hands on r programmingWorkshop presentation hands on r programming
Workshop presentation hands on r programmingNimrita Koul
 
r,rstats,r language,r packages
r,rstats,r language,r packagesr,rstats,r language,r packages
r,rstats,r language,r packagesAjay Ohri
 
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache Spark Workshop, Apr. 2016, Euangelos LinardosApache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache Spark Workshop, Apr. 2016, Euangelos LinardosEuangelos Linardos
 
Docopt, beautiful command-line options for R, user2014
Docopt, beautiful command-line options for R,  user2014Docopt, beautiful command-line options for R,  user2014
Docopt, beautiful command-line options for R, user2014Edwin de Jonge
 
Best corporate-r-programming-training-in-mumbai
Best corporate-r-programming-training-in-mumbaiBest corporate-r-programming-training-in-mumbai
Best corporate-r-programming-training-in-mumbaiUnmesh Baile
 
postgres loader
postgres loaderpostgres loader
postgres loaderINRIA-OAK
 
summer training report on python
summer training report on pythonsummer training report on python
summer training report on pythonShubham Yadav
 
Reproducible Research in R and R Studio
Reproducible Research in R and R StudioReproducible Research in R and R Studio
Reproducible Research in R and R StudioSusan Johnston
 

Ähnlich wie Language-agnostic data analysis workflows and reproducible research (20)

Reproducible research (and literate programming) in R
Reproducible research (and literate programming) in RReproducible research (and literate programming) in R
Reproducible research (and literate programming) in R
 
R Introduction
R IntroductionR Introduction
R Introduction
 
Data Science - Part II - Working with R & R studio
Data Science - Part II -  Working with R & R studioData Science - Part II -  Working with R & R studio
Data Science - Part II - Working with R & R studio
 
Using R on High Performance Computers
Using R on High Performance ComputersUsing R on High Performance Computers
Using R on High Performance Computers
 
BUSINESS ANALYTICS WITH R SOFTWARE DIAST
BUSINESS ANALYTICS WITH R SOFTWARE DIASTBUSINESS ANALYTICS WITH R SOFTWARE DIAST
BUSINESS ANALYTICS WITH R SOFTWARE DIAST
 
Lecture1_R.ppt
Lecture1_R.pptLecture1_R.ppt
Lecture1_R.ppt
 
Lecture1_R.ppt
Lecture1_R.pptLecture1_R.ppt
Lecture1_R.ppt
 
Lecture1 r
Lecture1 rLecture1 r
Lecture1 r
 
Modeling in R Programming Language for Beginers.ppt
Modeling in R Programming Language for Beginers.pptModeling in R Programming Language for Beginers.ppt
Modeling in R Programming Language for Beginers.ppt
 
Reproducible Computational Research in R
Reproducible Computational Research in RReproducible Computational Research in R
Reproducible Computational Research in R
 
Unit 3
Unit 3Unit 3
Unit 3
 
Lecture1_R.pdf
Lecture1_R.pdfLecture1_R.pdf
Lecture1_R.pdf
 
Workshop presentation hands on r programming
Workshop presentation hands on r programmingWorkshop presentation hands on r programming
Workshop presentation hands on r programming
 
r,rstats,r language,r packages
r,rstats,r language,r packagesr,rstats,r language,r packages
r,rstats,r language,r packages
 
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache Spark Workshop, Apr. 2016, Euangelos LinardosApache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
 
Docopt, beautiful command-line options for R, user2014
Docopt, beautiful command-line options for R,  user2014Docopt, beautiful command-line options for R,  user2014
Docopt, beautiful command-line options for R, user2014
 
Best corporate-r-programming-training-in-mumbai
Best corporate-r-programming-training-in-mumbaiBest corporate-r-programming-training-in-mumbai
Best corporate-r-programming-training-in-mumbai
 
postgres loader
postgres loaderpostgres loader
postgres loader
 
summer training report on python
summer training report on pythonsummer training report on python
summer training report on python
 
Reproducible Research in R and R Studio
Reproducible Research in R and R StudioReproducible Research in R and R Studio
Reproducible Research in R and R Studio
 

Kürzlich hochgeladen

➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...amitlee9823
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Pooja Nehwal
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsJoseMangaJr1
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteedamy56318795
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...ZurliaSoop
 
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...amitlee9823
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...amitlee9823
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightDelhi Call girls
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...karishmasinghjnh
 
Detecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning ApproachDetecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning ApproachBoston Institute of Analytics
 

Kürzlich hochgeladen (20)

➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
 
Detecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning ApproachDetecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning Approach
 

Language-agnostic data analysis workflows and reproducible research

  • 1. Language-agnostic data analysis workflows and reproducible research Andrew John Lowe Wigner Research Centre for Physics, Hungarian Academy of Sciences 28 April 2017
  • 2. Overview This talk: language-agnostic (or polyglot) analysis workflows I’ll show how it is possible to work in multiple languages and switch between them without leaving the workflow you started Additionally, I’ll demonstrate how an entire workflow can be encapsulated in a markdown file that is rendered to a publishable paper with cross-references and a bibliography (and with raw a LATEX file produced as a by-product) in a simple process, making the whole analysis workflow reproducible
  • 3. Which tool/language is best? TMVA, scikit-learn, h2o, caret, mlr, WEKA, Shogun, . . . C++, Python, R, Java, MATLAB/Octave, Julia, . . . Many languages, environments and tools People may have strong opinions A false polychotomy? Some tools are best suited to specific problems than others Be a polyglot!
  • 6. Flat files Language agnostic Formats like text, CSV, and JSON are well-supported Breaks workflow Data types may not be preserved (e.g., datetime, NULL) New binary format Feather solves this
  • 7. Feather Feather: A fast on-disk format for data frames for R and Python, powered by Apache Arrow, developed by Wes McKinney and Hadley Wickham # In R: library(feather) path <- "my_data.feather" write_feather(df, path) # In Python: import feather path = 'my_data.feather' df = feather.read_dataframe(path) Other languages, such as Julia or Scala (for Spark users), can read and write Feather files without knowledge of details of Python or R
  • 8. RootTreeToR RootTreeToR allows users to import ROOT data directly into R Written by Adam Lyon (Fermilab), presented at useR! 2007 cdcvs.fnal.gov/redmine/projects/roottreetor Requires ROOT to be installed, but no need to run ROOT Can export R data.frames to ROOT trees # Open and load ROOT tree: rt <- openRootChain("TreeName", "FileName") N <- nEntries(rt) # number of rows of data # Names of branches: branches <- RootTreeToR::getNames(rt) # Read in a subset of branches (varsList), M rows: df <- toR(rt, varsList, nEntries=M) # Use writeDFToRoot to write a data.frame to ROOT
  • 9. root_numpy root_numpy is a Python extension module that provides an interface between ROOT and NumPy. Example from root_numpy homepage: import ROOT from root_numpy import root2array, root2rec, tree2rec from root_numpy.testdata import get_filepath filename = get_filepath('test.root') # Convert a TTree in a ROOT file into a NumPy # structured array: arr = root2array(filename, 'tree') # Convert a TTree in a ROOT file into a NumPy # record array: rec = root2rec(filename, 'tree') # Get the TTree from the ROOT file: rfile = ROOT.TFile(filename) intree = rfile.Get('tree') # Convert the TTree into a NumPy record array: rec = tree2rec(intree)
  • 11. rPython (using Python from R) library(rPython) python.call( "len", 1:3 ) a <- 1:4 b <- 5:8 python.exec( "def concat(a,b): return a+b" ) python.call( "concat", a, b) python.assign( "a", "hola hola" ) python.method.call( "a", "split", " " )
  • 12. rpy2 (using R from Python) from rpy2.robjects.lib.dplyr import (filter, mutate, group_by, summarize) dataf = (DataFrame(mtcars) >> filter('gear>3') >> mutate(powertoweight='hp*36/wt') >> group_by('gear') >> summarize(mean_ptw='mean(powertoweight)')) dataf
  • 13. Others Calling Python from Julia: PyCall Calling R from Julia: RCall Calling Python from Excel: xlwings Calling Python from Ruby: Rupy . . .
  • 15. Jupyter Notebook Each notebook has one main language Cells can contain other languages through “magic” For example: ggplot2 in Jupyter Notebook Many Jupyter kernels available ROOTbooks: Notebooks running ROOT Jupyter kernel Assume that most people here have already seen these
  • 16. Beaker Notebook Each notebook can contain any language Many languages are supported Auto translation of data (copied) http://beakernotebook.com/ Figure 1: “A universal translator for data scientists”
  • 17. Beaker Notebook Beaker’s individual cells support different languages within the same notebook and allow you to pass data from one cell to another — e.g. Python to R to JavaScript — seamlessly. This autotranslate functionality allows you to use the language best suited to the problem. Figure 2: “A polyglot workflow in Beaker.”
  • 19. Typical research pipeline 1. Measured data processing code −−−−−−−−−→ analysis data 2. Analysis data analysis code −−−−−−−→ computational results 3. Computational results presentation code −−−−−−−−−−→ plots, tables, numbers 4. Plots, tables, numbers + text manual editing −−−−−−−−→ ARTICLE 5. (Maybe) put the data and code on the web somewhere Seldom done
  • 20. Challenges with this approach Onus is on researchers to make their data and code available Authors may need to undertake considerable effort to put data and code on the web Readers must download data and code individually and piece together which data go with which code sections, etc. Typically, authors just put stuff on the web Readers just download the data and try to figure it out, piece together the software and run it Authors/readers must manually interact with websites There is no single document to integrate data analysis with textual representations; i.e. data, code, and text are not linked Your experiment may impose restrictions on what kind of data can be shared publicly. Nevertheless, making an analysis reproducible benefits both your collaboration colleagues and your future self!
  • 21. Literate Statistical Programming Original idea comes from Donald Knuth An article is a stream of text and code Analysis code is divided into text and code “chunks” Presentation code formats results (tables, figures, etc.) Article text explains what is going on Literate programs are weaved to produce human-readable documents and tangled to produce machine-readable documents
  • 22. knitr knitr is a system for literate (statistical) programming Uses R as the programming language, but others are allowed Benefits: text and code are all in one place, results automatically updated to reflect external changes, workflow supports reproducibility from the outset Almost no extra effort required by the researcher after the article has been written Develop your analysis interactively, like in a Jupyter Notebook When finished, do rmarkdown::render("my_paper.Rmd") to get PDF/HTML/doc Like Notebooks, but output is publication quality, and you get the raw LATEX also as a freebie
  • 23. R Markdown This is an R Markdown presentation. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com It looks like LATEX, or to be more precise, it’s a Beamer presentation, but this slide was created like: ## R Markdown This is an R Markdown presentation. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see <http://rmarkdown.rstudio.com> It looks like LaTeX, or to be more precise, it's a **Beamer** presentation, but this slide was created like:
  • 24. R Markdown (continued) The Markdown syntax has some enhancements. For example, you can include LATEX equations, like this: i γµ ∂µψ − mcψ = 0 (1) We can also add tables and figures, just as we would do in LATEX: Table 1: Effectiveness of Insect Sprays. The mean counts of insects in agricultural experimental units treated with different insecticides. Count Spray A 14.500000 B 15.333333 C 2.083333 D 4.916667 E 3.500000 F 16.666667
  • 25. Example code chunk A code chunk using R is defined like this: ```{r} print('hello world!') ``` A code chunk using some other execution engine is defined like this: ```{<execution-engine>} print('hello world!') ``` For example: ```{python} x = 'hello, python world!' print(x.split(' ')) ```
  • 26. Example R code chunk ggplot( data = gapminder, aes(x = lifeExp, y = gdpPercap)) + geom_point( aes(color = continent, shape = continent)) + scale_y_log10() 1e+03 1e+04 1e+05 40 60 80 lifeExp gdpPercap continent Africa Americas Asia Europe Oceania
  • 27. Code execution engines Although R is the default code execution engine and a first-class citizen in the knitr system, many other code execution engines are available; this is the current list: awk, bash, coffee, gawk, groovy, haskell, lein, mysql, node, octave, perl, psql, python, Rscript, ruby, sas, scala, sed, sh, stata, zsh, highlight, Rcpp, tikz, dot, c, fortran, fortran95, asy, cat, asis, stan, block, block2, js, css, sql, go I’ve already shown examples of R and Python code chunks earlier in this talk Except for R, all chunks are executed in separate sessions, so the variables cannot be directly shared. If we want to make use of objects created in previous chunks, we usually have to write them to files (as side effects). For the bash engine, we can use Sys.setenv() to export variables from R to bash. Code chunks can be read from external files, in the event that you don’t want to put everything in a monster-length markdown document (use bookdown)
  • 28. FORTRAN Define a subroutine: C Fortran test subroutine fexp(n, x) double precision x C output integer n, i C input value do 10 i=1,n x=dexp(dcos(dsin(dble(float(i))))) 10 continue return end
  • 29. FORTRAN (continued) Call it from R: # Call the function from R: res = .Fortran("fexp", n=100000L, x=0) str(res) ## List of 2 ## $ n: int 100000 ## $ x: num 2.72
  • 30. Haskell [x | x <- [1..10], odd x] ## [1,3,5,7,9]
  • 31. SQL db = dbConnect(RSQLite::SQLite(), dbname = ":memory:") DROP TABLE IF EXISTS packages CREATE TABLE packages (id INTEGER, name TEXT) INSERT INTO packages VALUES (1, 'readr'), (2, 'tm') SELECT * FROM packages /* Can direct query results to an R data frame */ Table 2: 2 records id name 1 readr 2 tm
  • 32. C C code is compiled automatically: void square(double *x) { *x = *x * *x; } // Compiler now runs... ## gcc -std=gnu99 -I/usr/share/R/include -DNDEBUG -fpi ## gcc -std=gnu99 -shared -L/usr/lib/R/lib -Wl,-Bsymbolic-f Test the square() function: .C('square', 9) ## [[1]] ## [1] 81
  • 33. Rcpp The C++ code is compiled through the Rcpp package; here we show how we can write a function in C+ that can be called from R: #include <Rcpp.h> using namespace Rcpp; // [[Rcpp::export]] NumericVector timesTwo(NumericVector x) { return x * 2; } # In R: print(timesTwo(3.1415926)) ## [1] 6.283185
  • 34. cat A special engine cat can be used to save the content of a code chunk to a file using the cat() function, defined like this: ```{cat engine.opts=list(file='source.cxx')} // Some code here... ``` The contents can be anything, but perhaps the most interesting use is as container for source code that you compile and run later.
  • 35. BASH echo hello bash! echo 'a b c' | sed 's/ /|/g' # We can also compile the source.cxx file that we # created in the previous code chunk; maybe compile # against ROOT libraries, etc. # Then run the application ## hello bash! ## a|b|c We can run anything that we can run from the command line, including stuff we built in previous code chunks
  • 36. Data exchange Since the Python engine executes code in an external process, exchanging data between R chunks and Python chunks is done via the file system. If you are exchanging data frames, you can use the Feather package for very high performance transfer of even large data frames between Python and R: import pandas import feather # Read flights data and select flights to O'Hare flights = pandas.read_csv("flights.csv") flights = flights[flights['dest'] == "ORD"] # Select carrier and delay columns and drop rows with # missing values flights = flights[['carrier','dep_delay','arr_delay']] flights = flights.dropna() print flights.head(10) # Write to feather file for reading from R feather.write_dataframe(flights, "flights.feather")
  • 37. Data exchange (continued) library(feather) library(ggplot2) # Read from feather and plot flights <- read_feather("flights.feather") ggplot(flights, aes(carrier, arr_delay)) + geom_point() + geom_jitter()
  • 38. Data exchange (summary) ROOT ↔ R: RootTreeToR ROOT ↔ Python: root_numpy Python ↔ R: Feather Scala, Julia and other languages supported ROOT → Octave/MATLAB: I wrote an interface to do this
  • 39. Knitron The knitron package, available on GitHub but not yet on CRAN, is intended to allow users the ability to use IPython/Jupyter and matplotlib in R Markdown code chunks and render them with knitr According to the author’s website, “Knitron works by lazily starting a global IPython kernel the first time a code chunk gets rendered by knitr and this kernel is reused for all consecutive chunks. This way all the computation done in any previous chunk is available in the current chunk, providing R-like behaviour for Python”.
  • 40. Knitron example from mpl_toolkits.mplot3d import Axes3D import matplotlib.pyplot as plt import numpy as np fig = plt.figure() ax = fig.add_subplot(111, projection='3d') for c, z in zip(['r', 'g', 'b', 'y'], [30, 20, 10, 0]): xs = np.arange(20) ys = np.random.rand(20) cs = [c] * len(xs) cs[0] = 'c' ax.bar(xs, ys, zs=z, zdir='y', color=cs, alpha=0.8) ax.set_xlabel('X') ax.set_ylabel('Y') ax.set_zlabel('Z') plt.savefig("myplot.pdf") ## GLib-GIO-Message: Using the 'memory' GSettings backend.
  • 41. Knitron example (continued) Then plot the figure using a standard knitr mechanism, for example: cat("![My matplotlib plot.](myplot.pdf)") X 0 5 10 15 Y 0.04 0.02 0.00 0.02 0.04 0.06 Z 0.0 0.2 0.4 0.6 0.8 1.0
  • 42. Knitron example (continued) Alternatively, we can use the following recipe for JPEG/PNG files to get a figure caption and allow cross-referencing and image resizing: require(png); require(grid) img <- readPNG("myplot.png") grid.raster(img) Figure 4: This figure has been produced from matplotlib in a Python code chunk using the knitron R package.
  • 43. ROOT In the R tutorial that I gave at the IML Workshop last month, I spoke about the possibility of using code execution engines other than R, but noted then that there is no way to use ROOT as a code execution engine Well, now there is! I have written an interface to allow ROOT to be used as a code execution engine Simple proof-of-concept at the moment; nothing too sophisticated at this stage Runs root -q -b macro.C and runs the macro If not sufficient, it’s always possible to run more complicated analyses, perhaps using stand-alone C++ code using ROOT libraries, or using your experiment’s software framework, in a UNIX shell code chunk
  • 44. ROOT code execution engine example // Builds a graph with errors, displays it and saves it // as image. First, include some header files (within, // CINT these will be ignored). #include "TCanvas.h" #include "TROOT.h" #include "TGraphErrors.h" #include "TF1.h" #include "TLegend.h" #include "TArrow.h" #include "TLatex.h" void macro1(){ // The values and the errors on the Y axis const int n_points=10; double x_vals[n_points]= {1,2,3,4,5,6,7,8,9,10};
  • 45. lenght [cm] 2 4 6 8 10 Arb.Units 0 10 20 30 40 50 60 Measurement XYZ Lab. Lesson 1 Exp. Points Th. Law Deviation Maximum Measurement XYZ Figure 5: This is the graph generated by the ROOT macro.
  • 46. ROOT + markdown ROOT + Jupyter Notebook = ROOTbook Develop your analysis interactively Work collaboratively Reproducible document Contains the plots and numeric results Version control is difficult Not immediately fit for publication in an academic journal ROOT + R markdown = “ROOTdown”? Interactivity not currently available for ROOT (needs work) Plots and numeric results only in the rendered document Plain text format → version control is simple Immediately fit for publication in an academic journal
  • 47. Example paper output See toy_example_paper.pdf to see example output from knitr This Beamer presentation, but rendered as a paper instead To demonstrate that the necessary infrastructure works Contains (hyperlinked) cross-references and bibliography Note that I wouldn’t have to do anything special to make this happen for a real paper; I ran a single command to run all the “analysis code” and generate the document with all the plots, tables, numerical results, etc. Raw LATEX (to submit to journal) generated as a by-product
  • 48. Messages to take away It is already possible to write a reproducible analysis in your favourite programming language and have your paper rendered fit for publication You can mix and match programming languages in a single uninterrupted workflow Enabling an analysis to be performed in a single unbroken workflow greatly facilitates reproducibility There are ways to exchange data between code chunks that are written in different programming languages ROOT, Python, R, . . . Works nicely with version control (it’s just plain text!) You can now embed your ROOT analysis code in a reproducible academic publication
  • 49. Acknowledgements This work is supported by NKFIH OTKA grant K120660.