SlideShare ist ein Scribd-Unternehmen logo
1 von 25
Downloaden Sie, um offline zu lesen
Tools for large-scale collection &
analysis of source code repositories
OPEN SOURCE GIT REPOSITORY COLLECTION PIPELINE
Alexander Bezzubov source{d}
➔ committer & PMC @ apache zeppelin
➔ engineer @source{d}
➔ startup in Madrid
➔ builds the open-source components that enable large-scale
code analysis and machine learning on source code
Intro
motivation & vision
MOTIVATION: WHY COLLECTING SOURCE CODE
VISION:
➔ Academia: material for research in IR/ML/PL communities
➔ Industry: fuel for building data-driven products (i.e for sourcing candidates for hiring)
➔ OSS collection pipeline
➔ Use it to build public datasets industry and academia
➔ Use Git as “source of truth”, the most popular VCS
➔ A crawler (find URLs, git clone them), Distributed storage (FS, DB), Parallel processing framework
custom, in Golang standard, Apache HDFS + Postgres custom, library for Apache Spark
Tech Stack
CoreOS
infrastructure
Rovers
collection
storage
processing
analysis
K8s
go-git
HDFS
śiva
Apache Spark
source{d} Engine
Bblfsh
Enry
tech stack
Borders
CoreOS
infrastructure
Rovers
collection
storage
processing
analysis
K8s
go-git
HDFS
śiva
Apache Spark
source{d} Engine
Bblfsh
Enry
tech stack
Borders
infrastructure
➔ Dedicated cluster (cloud becomes prohibitively expensive for storing ~100sTb)
➔ CoreOS provisioned on bare-menta w Terraform
➔ Booting and OS configuration Matchbox and Ignition
➔ K8s deployed on top of that
More details at talk at CfgMgmtCamp
http://cfgmgmtcamp.eu/schedule/terraform/CoreOS.html
CoreOS
infrastructure
Rovers
collection
storage
processing
analysis
K8s
go-git
HDFS
śiva
Apache Spark
source{d} Engine
Bblfsh
Enry
tech stack
Borders
collection
➔ Rovers: search for Git repository URLs
➔ Borges: fetching repository w “git pull”
➔ Git storage format & protocol implementation
➔ Optimize for on-disk size: forks that share history, saved together
go-git to talk Git Last year had a talk at FOSDEM
https://archive.fosdem.org/2017/schedule/event/go_git/
GIT LIBRARY FOR GO
• need to clone and analyze tens of millions of
repositories with our core language Go
• be able to do so in memory, and by using custom
filesystem implementations
• easy to use and stable API for the Go community
• used in production by companies, e.g.: keybase.io
motivation
go-git A HIGHLY EXTENSIBLE IMPLEMENTATION OF GIT IN GO
GO-GIT IN ACTION
example
• the most complete git library for any language after
libgit2 and jgit
• highly extensible by design
• idiomatic API for plumbing and porcelain commands
• 2+ years of continuous development
• used by a significant number of open source projects
PURE GO SOURCE CODE
features
example mimicking `git clone` using go-git:
usage
• https://github.com/src-d/go-git
• go-git presentation at FOSDEM 2017
• go-git presentation at Git Merge 2017
• compatibility table of git vs. go-git
• comparing git trees in go
resources
YOUR NEXT STEPS
TRY IT YOURSELF
# installation
$ go get -u gopkg.in/src-d/go-git.v4/...
• list of more go-git usage examples
// Clone the repo to the given directory
url := "https://github.com/src-d/go-git",
_, err := git.PlainClone(
"/tmp/foo", false,
&git.CloneOptions{
URL: url,
Progress: os.Stdout,
},
)
CheckIfError(err)
output:
Counting objects: 4924, done.
Compressing objects: 100% (1333/1333), done.
Total 4924 (delta 530), reused 6 (delta 6),
pack-reused 3533
CODE COLLECTION AT SCALE
• collection and storage of repositories at large scale
• automated process
• optimal usage of storage
• optimal to keep repositories up-to-date with the origin
motivation
rovers & borges LARGE SCALE CODE REPOSITORY COLLECTION AND STORAGE
KEY CONCEPT
• set up and run rovers
• set up borges
• run borges producer
• run borges consumer
architecture
• distributed system similar to a search engine
• src-d/rovers retrieves URLs from git hosting providers
via API, plus self-hosted git repositories
• src-d/borges producer reads URL list, schedules fetching
• borges consumer fetches and pushes repo to storage
• borges packer also available as a standalone command,
transforming repository urls into siva files
• stores using src-d/śiva repository storage file format
• optimized for storage and keeping repos up-to-date
SEEK, FETCH, STORE
architecture
• rooted repositories are standard git repositories that
store all objects from all repositories that share a
common history, identified by same initial commit:
• a rooted repository is saved in a single śiva file
• updates stored in concatenated siva files: no need to
rewriting the whole repository file
• distributed-file-system backed, supports GCS & HDFS
usage
• https://github.com/src-d/rovers
• https://github.com/src-d/borges
• https://github.com/src-d/go-siva
• śiva: Why We Created Yet Another Archive
Format
resources
YOUR NEXT STEPS
SETUP & RUN
CoreOS
infrastructure
Rovers
collection
storage
processing
analysis
K8s
go-git
HDFS
śiva
Apache Spark
source{d} Engine
Bblfsh
Enry
tech stack
Borders
storage
➔ Metadata: PostgreSQL
➔ Built small type-safe ORM for Go<->Postgres
https://github.com/src-d/go-kallax
➔ Data: Apache Hadoop HDFS
➔ Custom (seekable, appendable) archive format: Siva 1 RootedRepository <-> 1 Siva file
SMART REPO STORAGE
• store a git repository in a single file
• updates possible without rewriting the whole file
• friendly to distributed file systems
• seekable to allow random access to any file position
motivation
śiva SEEKABLE INDEXED BLOCK ARCHIVER FILE FORMAT
SIVA FILE BLOCK SCHEMA
architecture
CHARACTERISTICS
• src-d/go-siva is an archiving format similar to tar or zip
• allows constant-time random file access
• allows seekable read access to the contained files
• allows file concatenation given the block-based design
• command-line tool + implementations in Go and Java
architecture usage
• https://github.com/src-d/go-siva
• śiva: Why We Created Yet Another Archive
Format
resources
YOUR NEXT STEPSAPPENDING FILES
# pack into siva file
$ siva pack example.siva qux
# append into siva file
$ siva pack --append example.siva
bar
# list siva file contents
$ siva list example.siva
Sep 20 13:04 4 B qux -rw-r--r--
Sep 20 13:07 4 B bar -rw-r--r--
Core OS
infrastructure
Rovers
collection
storage
processing
analysis
K8s
go-git
HDFS
śiva
Apache Spark
source{d} Engine
Bblfsh
Enry
tech stack
Borders
processing
Apache Spark
➔ For batch processing, SparkSQL
Engine
➔ Library, w custom DataSource implementation GitDataSource
➔ Read repositories from Siva archives in HDFS, exposes though DataFrame
➔ API for accessing refs/commits/files/blobs
➔ Talks to external services though gRPC for parsing/lexing, and other analysis
YOUR NEXT STEPS
UNIFIED SCALABLE PIPELINE
• easy-to-use pipeline for git repository analysis
• integrated with standard tools for large scale data analysis
• avoid custom code in operations across millions of repos
motivation
engine UNIFIED SCALABLE CODE ANALYSIS PIPELINE ON SPARK
APACHE SPARK DATAFRAME
architecture
• listing and retrieval of git repositories
• Apache Spark datasource on top of git repositories
• iterators over any git object, references
• code exploration and querying using XPath expressions
• language identification and source code parsing
• feature extraction for machine learning at scale
PREPARATION
architecture
usage sample
• https://github.com/src-d/engine
• Early example jupyter notebook:
https://github.com/src-d/spark-api/blob/maste
r/examples/notebooks/Example.ipynb
resources
EngineAPI(spark, 'siva',
'/path/to/siva-files')
.repositories
.references
.head_ref
.files
.classify_languages()
.extract_uasts()
.query_uast('//*[@roleImport and
@roleDeclaration]',
'imports')
.filter("lang = 'java'")
.select('imports',
'path',
'repository_id')
.write
.parquet("hdfs://...")
• extends Apache SparkSQL
• git repositories stored as siva files or standard
repositories in HDFS
• metadata caching for faster lookups over all the
dataset.
• fetches repositories in batches and on demand
• available APIs for Spark and PySpark
• can run either locally or in a distributed cluster
CoreOS
infrastructure
Rovers
collection
storage
processing
analysis
K8s
go-git
HDFS
śiva
Apache Spark
source{d} Engine
Bblfsh
Enry
tech stack
Borders
analysis
Enry
➔ Programming language identification
➔ Re-write of github/linguist in Golang, ~370 langs
Project Babelfish
➔ Distributed parser infrastructure for source code analysis
➔ Unified interface though gRPC to native parsers in containers: src -> uAST
Talk in Source Code Analysis devRoom
Room: UD2.119, Sunday, 12:40
https://fosdem.org/2018/schedule/event/code_babelfish_a_universal_code_parser_for_source_c
ode_analysis/
• enry speed
improvement over
linguist when
applied to
linguist/samples
folder file samples
usable in Go as a native library, in Java as
shared library and as a CLI tool.
LANG DETECTION AT SCALE
• need to detect programming languages of every file in
a git repository
• initially used github/linguist, but needed more
performance for large scale applications
• keep compatibility with the original linguist project
motivation
enry A FASTER FILE PROGRAMMING LANGUAGE DETECTOR
benchmarks
COMPATIBLE AND FLEXIBLE
• linguist as source of information on language detection
• ignores binary and vendored files
• command line tool mimics the original linguist one
• can be used in Go (native library) or Java (shared library)
architecture
usage
• https://github.com/src-d/enry
• enry: detecting languages
• benchmark methodology and results
resources
YOUR NEXT STEPS
• src-d/enry is at least 4x faster than linguist
• 5x (larger repos) to 20x faster (smaller repos)
GO FASTER
• additional info on benchmarking enry
$ enry /path/to/src-d/go-git
98.28% Go
0.69% Shell
0.34% Makefile
0.34% Markdown
0.34% Text
UNIVERSAL CODE ANALYSIS
• was born as a solution for massive code analysis
• parsing single files in any programming language
• analyze all source code from all repositories in the world
• analyze many languages using a shared
structure/format
motivation
babelfish A SELF-HOSTED SERVER FOR UNIVERSAL SOURCE CODE PARSING
CONTAINER-BASED
architecture
• AST-based diff'ing. Understanding changes made to
code with finer-grained granularity.
• extract features for Machine Learning on Source Code.
• statistics of language features
• detecting similar coding patterns across languages
POWERFUL OPPORTUNITIES
use cases
• language drivers as the main building blocks
• parsing service via one driver per language
• language drivers can be written in any language
and are packaged as standard Docker containers
• containers are executed by the babelfish server
in a specific runtime built on-top of libcontainer.
usage
• https://github.com/bblfsh
• Babelfish documentation
• announcing Babelfish
• Babelfish presentation
• join the Babelfish community
resources
YOUR NEXT STEPSUNIVERSAL AST
architecture
• UAST is a universal (normalized and annotated)
form of Abstract Syntax Tree (AST)
• language-independent annotations (roles) such as
Expression, Statement, Operator, Arithmetic, etc.
• can be easily ported to many languages using
gogo/protobuf
TRY BABELFISH ONLINE
• or run babelfish server & dashboard locally:
$ docker run --privileged -d -p 
9432:9432 --name bblfsh 
bblfsh/server
$ docker run -p 8080:80 --link 
bblfsh bblfsh/dashboard 
--bblfsh-addr bblfsh:9432
CoreOS
infrastructure
Rovers
collection
storage
processing
analysis
K8s
go-git
HDFS
śiva
Apache Spark
source{d} Engine
Bblfsh
Enry
tech stack
Borders
Further directions
further directions
INFRASTRUCTURE
➔ Persistent storage in k8s on bare-metal cluster
COLLECTION
STORAGE
PROCESSING
ANALYSIS
➔ Explore SEDA architecture, to dynamically saturate throughput
➔ Better splittable Git object storage format (w delta-encoding, etc)
➔ Distributed Indexes to speed up common Apache Spark queries
➔ AST-diff, cross-language abstractions on top of ASTs
thank you.

Weitere ähnliche Inhalte

Was ist angesagt?

Openbar 7 - Leuven - OpenShift - The Enterprise Container Platform - Piros
Openbar 7 - Leuven - OpenShift - The Enterprise Container Platform - PirosOpenbar 7 - Leuven - OpenShift - The Enterprise Container Platform - Piros
Openbar 7 - Leuven - OpenShift - The Enterprise Container Platform - PirosOpenbar
 
Ceph Overview for Distributed Computing Denver Meetup
Ceph Overview for Distributed Computing Denver MeetupCeph Overview for Distributed Computing Denver Meetup
Ceph Overview for Distributed Computing Denver Meetupktdreyer
 
A Dive Into Containers and Docker
A Dive Into Containers and DockerA Dive Into Containers and Docker
A Dive Into Containers and DockerMatthew Farina
 
Cloud Native Applications on OpenShift
Cloud Native Applications on OpenShiftCloud Native Applications on OpenShift
Cloud Native Applications on OpenShiftSerhat Dirik
 
Containers - Portable, repeatable user-oriented application delivery. Build, ...
Containers - Portable, repeatable user-oriented application delivery. Build, ...Containers - Portable, repeatable user-oriented application delivery. Build, ...
Containers - Portable, repeatable user-oriented application delivery. Build, ...Walid Shaari
 
Containers in depth – Understanding how containers work to better work with c...
Containers in depth – Understanding how containers work to better work with c...Containers in depth – Understanding how containers work to better work with c...
Containers in depth – Understanding how containers work to better work with c...All Things Open
 
What is a Ceph (and why do I care). OpenStack storage - Colorado OpenStack Me...
What is a Ceph (and why do I care). OpenStack storage - Colorado OpenStack Me...What is a Ceph (and why do I care). OpenStack storage - Colorado OpenStack Me...
What is a Ceph (and why do I care). OpenStack storage - Colorado OpenStack Me...Ian Colle
 
Interop 2017 - Managing Containers in Production
Interop 2017 - Managing Containers in ProductionInterop 2017 - Managing Containers in Production
Interop 2017 - Managing Containers in ProductionBrian Gracely
 
Introduction to Kubernetes
Introduction to KubernetesIntroduction to Kubernetes
Introduction to KubernetesPaul Czarkowski
 
Docker Meetup - Melbourne 2015 - Kubernetes Deep Dive
Docker Meetup - Melbourne 2015 - Kubernetes Deep DiveDocker Meetup - Melbourne 2015 - Kubernetes Deep Dive
Docker Meetup - Melbourne 2015 - Kubernetes Deep DiveKen Thompson
 
Oracle Database on Docker
Oracle Database on DockerOracle Database on Docker
Oracle Database on DockerFranck Pachot
 
Kubernetes for the PHP developer
Kubernetes for the PHP developerKubernetes for the PHP developer
Kubernetes for the PHP developerPaul Czarkowski
 
Open shift enterprise 3.1 paas on kubernetes
Open shift enterprise 3.1   paas on kubernetesOpen shift enterprise 3.1   paas on kubernetes
Open shift enterprise 3.1 paas on kubernetesSamuel Terburg
 
Generating Linked Data descriptions of Debian packages in the Debian PTS
Generating Linked Data descriptions of Debian packages in the Debian PTSGenerating Linked Data descriptions of Debian packages in the Debian PTS
Generating Linked Data descriptions of Debian packages in the Debian PTSolberger
 
Journey to the devops automation with docker kubernetes and openshift
Journey to the devops automation with docker kubernetes and openshiftJourney to the devops automation with docker kubernetes and openshift
Journey to the devops automation with docker kubernetes and openshiftYusuf Hadiwinata Sutandar
 
DCEU 18: Continuous Delivery with Docker Containers and Java: The Good, the B...
DCEU 18: Continuous Delivery with Docker Containers and Java: The Good, the B...DCEU 18: Continuous Delivery with Docker Containers and Java: The Good, the B...
DCEU 18: Continuous Delivery with Docker Containers and Java: The Good, the B...Docker, Inc.
 
Comparative Performance Analysis of AWS EC2 Instance Types Commonly Used for ...
Comparative Performance Analysis of AWS EC2 Instance Types Commonly Used for ...Comparative Performance Analysis of AWS EC2 Instance Types Commonly Used for ...
Comparative Performance Analysis of AWS EC2 Instance Types Commonly Used for ...DataWorks Summit
 
Introduction to GitHub Actions - How to easily automate and integrate with Gi...
Introduction to GitHub Actions - How to easily automate and integrate with Gi...Introduction to GitHub Actions - How to easily automate and integrate with Gi...
Introduction to GitHub Actions - How to easily automate and integrate with Gi...All Things Open
 

Was ist angesagt? (20)

Openbar 7 - Leuven - OpenShift - The Enterprise Container Platform - Piros
Openbar 7 - Leuven - OpenShift - The Enterprise Container Platform - PirosOpenbar 7 - Leuven - OpenShift - The Enterprise Container Platform - Piros
Openbar 7 - Leuven - OpenShift - The Enterprise Container Platform - Piros
 
Ceph Overview for Distributed Computing Denver Meetup
Ceph Overview for Distributed Computing Denver MeetupCeph Overview for Distributed Computing Denver Meetup
Ceph Overview for Distributed Computing Denver Meetup
 
A Dive Into Containers and Docker
A Dive Into Containers and DockerA Dive Into Containers and Docker
A Dive Into Containers and Docker
 
Cloud Native Applications on OpenShift
Cloud Native Applications on OpenShiftCloud Native Applications on OpenShift
Cloud Native Applications on OpenShift
 
Containers - Portable, repeatable user-oriented application delivery. Build, ...
Containers - Portable, repeatable user-oriented application delivery. Build, ...Containers - Portable, repeatable user-oriented application delivery. Build, ...
Containers - Portable, repeatable user-oriented application delivery. Build, ...
 
Containers in depth – Understanding how containers work to better work with c...
Containers in depth – Understanding how containers work to better work with c...Containers in depth – Understanding how containers work to better work with c...
Containers in depth – Understanding how containers work to better work with c...
 
Openshift presentation
Openshift presentationOpenshift presentation
Openshift presentation
 
What is a Ceph (and why do I care). OpenStack storage - Colorado OpenStack Me...
What is a Ceph (and why do I care). OpenStack storage - Colorado OpenStack Me...What is a Ceph (and why do I care). OpenStack storage - Colorado OpenStack Me...
What is a Ceph (and why do I care). OpenStack storage - Colorado OpenStack Me...
 
Interop 2017 - Managing Containers in Production
Interop 2017 - Managing Containers in ProductionInterop 2017 - Managing Containers in Production
Interop 2017 - Managing Containers in Production
 
Introduction to Kubernetes
Introduction to KubernetesIntroduction to Kubernetes
Introduction to Kubernetes
 
Docker Meetup - Melbourne 2015 - Kubernetes Deep Dive
Docker Meetup - Melbourne 2015 - Kubernetes Deep DiveDocker Meetup - Melbourne 2015 - Kubernetes Deep Dive
Docker Meetup - Melbourne 2015 - Kubernetes Deep Dive
 
Oracle Database on Docker
Oracle Database on DockerOracle Database on Docker
Oracle Database on Docker
 
Kubernetes for the PHP developer
Kubernetes for the PHP developerKubernetes for the PHP developer
Kubernetes for the PHP developer
 
Red hat cloud platforms
Red hat cloud platformsRed hat cloud platforms
Red hat cloud platforms
 
Open shift enterprise 3.1 paas on kubernetes
Open shift enterprise 3.1   paas on kubernetesOpen shift enterprise 3.1   paas on kubernetes
Open shift enterprise 3.1 paas on kubernetes
 
Generating Linked Data descriptions of Debian packages in the Debian PTS
Generating Linked Data descriptions of Debian packages in the Debian PTSGenerating Linked Data descriptions of Debian packages in the Debian PTS
Generating Linked Data descriptions of Debian packages in the Debian PTS
 
Journey to the devops automation with docker kubernetes and openshift
Journey to the devops automation with docker kubernetes and openshiftJourney to the devops automation with docker kubernetes and openshift
Journey to the devops automation with docker kubernetes and openshift
 
DCEU 18: Continuous Delivery with Docker Containers and Java: The Good, the B...
DCEU 18: Continuous Delivery with Docker Containers and Java: The Good, the B...DCEU 18: Continuous Delivery with Docker Containers and Java: The Good, the B...
DCEU 18: Continuous Delivery with Docker Containers and Java: The Good, the B...
 
Comparative Performance Analysis of AWS EC2 Instance Types Commonly Used for ...
Comparative Performance Analysis of AWS EC2 Instance Types Commonly Used for ...Comparative Performance Analysis of AWS EC2 Instance Types Commonly Used for ...
Comparative Performance Analysis of AWS EC2 Instance Types Commonly Used for ...
 
Introduction to GitHub Actions - How to easily automate and integrate with Gi...
Introduction to GitHub Actions - How to easily automate and integrate with Gi...Introduction to GitHub Actions - How to easily automate and integrate with Gi...
Introduction to GitHub Actions - How to easily automate and integrate with Gi...
 

Ähnlich wie FOSDEM '18 - Tools for large scale collection and analysis of source code repositories

Analysing GitHub commits with R
Analysing GitHub commits with RAnalysing GitHub commits with R
Analysing GitHub commits with RBarbara Fusinska
 
Managing Open Source Software in the GitHub Era
Managing Open Source Software in the GitHub EraManaging Open Source Software in the GitHub Era
Managing Open Source Software in the GitHub EranexB Inc.
 
Trend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache BigtopTrend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache BigtopEvans Ye
 
August OpenNTF Webinar - Git and GitHub Explained
August OpenNTF Webinar - Git and GitHub ExplainedAugust OpenNTF Webinar - Git and GitHub Explained
August OpenNTF Webinar - Git and GitHub ExplainedHoward Greenberg
 
Apache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory DataApache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory DataWes McKinney
 
Docker data science pipeline
Docker data science pipelineDocker data science pipeline
Docker data science pipelineDataWorks Summit
 
Real-Time Distributed and Reactive Systems with Apache Kafka and Apache Accumulo
Real-Time Distributed and Reactive Systems with Apache Kafka and Apache AccumuloReal-Time Distributed and Reactive Systems with Apache Kafka and Apache Accumulo
Real-Time Distributed and Reactive Systems with Apache Kafka and Apache AccumuloJoe Stein
 
Accumulo Summit 2015: Real-Time Distributed and Reactive Systems with Apache ...
Accumulo Summit 2015: Real-Time Distributed and Reactive Systems with Apache ...Accumulo Summit 2015: Real-Time Distributed and Reactive Systems with Apache ...
Accumulo Summit 2015: Real-Time Distributed and Reactive Systems with Apache ...Accumulo Summit
 
Code as Data workshop: Using source{d} Engine to extract insights from git re...
Code as Data workshop: Using source{d} Engine to extract insights from git re...Code as Data workshop: Using source{d} Engine to extract insights from git re...
Code as Data workshop: Using source{d} Engine to extract insights from git re...source{d}
 
Jenkins Pipeline @ Scale. Building Automation Frameworks for Systems Integration
Jenkins Pipeline @ Scale. Building Automation Frameworks for Systems IntegrationJenkins Pipeline @ Scale. Building Automation Frameworks for Systems Integration
Jenkins Pipeline @ Scale. Building Automation Frameworks for Systems IntegrationOleg Nenashev
 
LasCon 2014 DevOoops
LasCon 2014 DevOoops LasCon 2014 DevOoops
LasCon 2014 DevOoops Chris Gates
 
Analysis of-quality-of-pkgs-in-packagist-univ-20171024
Analysis of-quality-of-pkgs-in-packagist-univ-20171024Analysis of-quality-of-pkgs-in-packagist-univ-20171024
Analysis of-quality-of-pkgs-in-packagist-univ-20171024Clark Everetts
 
DSAG Jahreskongress 2018 - DevOps and Deployment Pipelines in SAP ABAP Landsc...
DSAG Jahreskongress 2018 - DevOps and Deployment Pipelines in SAP ABAP Landsc...DSAG Jahreskongress 2018 - DevOps and Deployment Pipelines in SAP ABAP Landsc...
DSAG Jahreskongress 2018 - DevOps and Deployment Pipelines in SAP ABAP Landsc...Sascha Junkert
 
Women Who Code - RSpec JSON API Workshop
Women Who Code - RSpec JSON API WorkshopWomen Who Code - RSpec JSON API Workshop
Women Who Code - RSpec JSON API WorkshopEddie Lau
 
betterCode Workshop: Effizientes DevOps-Tooling mit Go
betterCode Workshop:  Effizientes DevOps-Tooling mit GobetterCode Workshop:  Effizientes DevOps-Tooling mit Go
betterCode Workshop: Effizientes DevOps-Tooling mit GoQAware GmbH
 
Big Data Day LA 2016/ Big Data Track - Fluentd and Embulk: Collect More Data,...
Big Data Day LA 2016/ Big Data Track - Fluentd and Embulk: Collect More Data,...Big Data Day LA 2016/ Big Data Track - Fluentd and Embulk: Collect More Data,...
Big Data Day LA 2016/ Big Data Track - Fluentd and Embulk: Collect More Data,...Data Con LA
 
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...Michael Rys
 
Road to Opscon (Pisa '15) - DevOoops
Road to Opscon (Pisa '15) - DevOoopsRoad to Opscon (Pisa '15) - DevOoops
Road to Opscon (Pisa '15) - DevOoopsGianluca Varisco
 
How to Manage Open Source requirements with AboutCode
How to Manage Open Source requirements with AboutCodeHow to Manage Open Source requirements with AboutCode
How to Manage Open Source requirements with AboutCodenexB Inc.
 

Ähnlich wie FOSDEM '18 - Tools for large scale collection and analysis of source code repositories (20)

Analysing GitHub commits with R
Analysing GitHub commits with RAnalysing GitHub commits with R
Analysing GitHub commits with R
 
Managing Open Source Software in the GitHub Era
Managing Open Source Software in the GitHub EraManaging Open Source Software in the GitHub Era
Managing Open Source Software in the GitHub Era
 
Migrating To GitHub
Migrating To GitHub  Migrating To GitHub
Migrating To GitHub
 
Trend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache BigtopTrend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache Bigtop
 
August OpenNTF Webinar - Git and GitHub Explained
August OpenNTF Webinar - Git and GitHub ExplainedAugust OpenNTF Webinar - Git and GitHub Explained
August OpenNTF Webinar - Git and GitHub Explained
 
Apache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory DataApache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory Data
 
Docker data science pipeline
Docker data science pipelineDocker data science pipeline
Docker data science pipeline
 
Real-Time Distributed and Reactive Systems with Apache Kafka and Apache Accumulo
Real-Time Distributed and Reactive Systems with Apache Kafka and Apache AccumuloReal-Time Distributed and Reactive Systems with Apache Kafka and Apache Accumulo
Real-Time Distributed and Reactive Systems with Apache Kafka and Apache Accumulo
 
Accumulo Summit 2015: Real-Time Distributed and Reactive Systems with Apache ...
Accumulo Summit 2015: Real-Time Distributed and Reactive Systems with Apache ...Accumulo Summit 2015: Real-Time Distributed and Reactive Systems with Apache ...
Accumulo Summit 2015: Real-Time Distributed and Reactive Systems with Apache ...
 
Code as Data workshop: Using source{d} Engine to extract insights from git re...
Code as Data workshop: Using source{d} Engine to extract insights from git re...Code as Data workshop: Using source{d} Engine to extract insights from git re...
Code as Data workshop: Using source{d} Engine to extract insights from git re...
 
Jenkins Pipeline @ Scale. Building Automation Frameworks for Systems Integration
Jenkins Pipeline @ Scale. Building Automation Frameworks for Systems IntegrationJenkins Pipeline @ Scale. Building Automation Frameworks for Systems Integration
Jenkins Pipeline @ Scale. Building Automation Frameworks for Systems Integration
 
LasCon 2014 DevOoops
LasCon 2014 DevOoops LasCon 2014 DevOoops
LasCon 2014 DevOoops
 
Analysis of-quality-of-pkgs-in-packagist-univ-20171024
Analysis of-quality-of-pkgs-in-packagist-univ-20171024Analysis of-quality-of-pkgs-in-packagist-univ-20171024
Analysis of-quality-of-pkgs-in-packagist-univ-20171024
 
DSAG Jahreskongress 2018 - DevOps and Deployment Pipelines in SAP ABAP Landsc...
DSAG Jahreskongress 2018 - DevOps and Deployment Pipelines in SAP ABAP Landsc...DSAG Jahreskongress 2018 - DevOps and Deployment Pipelines in SAP ABAP Landsc...
DSAG Jahreskongress 2018 - DevOps and Deployment Pipelines in SAP ABAP Landsc...
 
Women Who Code - RSpec JSON API Workshop
Women Who Code - RSpec JSON API WorkshopWomen Who Code - RSpec JSON API Workshop
Women Who Code - RSpec JSON API Workshop
 
betterCode Workshop: Effizientes DevOps-Tooling mit Go
betterCode Workshop:  Effizientes DevOps-Tooling mit GobetterCode Workshop:  Effizientes DevOps-Tooling mit Go
betterCode Workshop: Effizientes DevOps-Tooling mit Go
 
Big Data Day LA 2016/ Big Data Track - Fluentd and Embulk: Collect More Data,...
Big Data Day LA 2016/ Big Data Track - Fluentd and Embulk: Collect More Data,...Big Data Day LA 2016/ Big Data Track - Fluentd and Embulk: Collect More Data,...
Big Data Day LA 2016/ Big Data Track - Fluentd and Embulk: Collect More Data,...
 
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
 
Road to Opscon (Pisa '15) - DevOoops
Road to Opscon (Pisa '15) - DevOoopsRoad to Opscon (Pisa '15) - DevOoops
Road to Opscon (Pisa '15) - DevOoops
 
How to Manage Open Source requirements with AboutCode
How to Manage Open Source requirements with AboutCodeHow to Manage Open Source requirements with AboutCode
How to Manage Open Source requirements with AboutCode
 

Kürzlich hochgeladen

Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfhans926745
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 

Kürzlich hochgeladen (20)

Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 

FOSDEM '18 - Tools for large scale collection and analysis of source code repositories

  • 1. Tools for large-scale collection & analysis of source code repositories OPEN SOURCE GIT REPOSITORY COLLECTION PIPELINE
  • 2. Alexander Bezzubov source{d} ➔ committer & PMC @ apache zeppelin ➔ engineer @source{d} ➔ startup in Madrid ➔ builds the open-source components that enable large-scale code analysis and machine learning on source code Intro
  • 3. motivation & vision MOTIVATION: WHY COLLECTING SOURCE CODE VISION: ➔ Academia: material for research in IR/ML/PL communities ➔ Industry: fuel for building data-driven products (i.e for sourcing candidates for hiring) ➔ OSS collection pipeline ➔ Use it to build public datasets industry and academia ➔ Use Git as “source of truth”, the most popular VCS ➔ A crawler (find URLs, git clone them), Distributed storage (FS, DB), Parallel processing framework custom, in Golang standard, Apache HDFS + Postgres custom, library for Apache Spark
  • 7. infrastructure ➔ Dedicated cluster (cloud becomes prohibitively expensive for storing ~100sTb) ➔ CoreOS provisioned on bare-menta w Terraform ➔ Booting and OS configuration Matchbox and Ignition ➔ K8s deployed on top of that More details at talk at CfgMgmtCamp http://cfgmgmtcamp.eu/schedule/terraform/CoreOS.html
  • 9. collection ➔ Rovers: search for Git repository URLs ➔ Borges: fetching repository w “git pull” ➔ Git storage format & protocol implementation ➔ Optimize for on-disk size: forks that share history, saved together go-git to talk Git Last year had a talk at FOSDEM https://archive.fosdem.org/2017/schedule/event/go_git/
  • 10. GIT LIBRARY FOR GO • need to clone and analyze tens of millions of repositories with our core language Go • be able to do so in memory, and by using custom filesystem implementations • easy to use and stable API for the Go community • used in production by companies, e.g.: keybase.io motivation go-git A HIGHLY EXTENSIBLE IMPLEMENTATION OF GIT IN GO GO-GIT IN ACTION example • the most complete git library for any language after libgit2 and jgit • highly extensible by design • idiomatic API for plumbing and porcelain commands • 2+ years of continuous development • used by a significant number of open source projects PURE GO SOURCE CODE features example mimicking `git clone` using go-git: usage • https://github.com/src-d/go-git • go-git presentation at FOSDEM 2017 • go-git presentation at Git Merge 2017 • compatibility table of git vs. go-git • comparing git trees in go resources YOUR NEXT STEPS TRY IT YOURSELF # installation $ go get -u gopkg.in/src-d/go-git.v4/... • list of more go-git usage examples // Clone the repo to the given directory url := "https://github.com/src-d/go-git", _, err := git.PlainClone( "/tmp/foo", false, &git.CloneOptions{ URL: url, Progress: os.Stdout, }, ) CheckIfError(err) output: Counting objects: 4924, done. Compressing objects: 100% (1333/1333), done. Total 4924 (delta 530), reused 6 (delta 6), pack-reused 3533
  • 11. CODE COLLECTION AT SCALE • collection and storage of repositories at large scale • automated process • optimal usage of storage • optimal to keep repositories up-to-date with the origin motivation rovers & borges LARGE SCALE CODE REPOSITORY COLLECTION AND STORAGE KEY CONCEPT • set up and run rovers • set up borges • run borges producer • run borges consumer architecture • distributed system similar to a search engine • src-d/rovers retrieves URLs from git hosting providers via API, plus self-hosted git repositories • src-d/borges producer reads URL list, schedules fetching • borges consumer fetches and pushes repo to storage • borges packer also available as a standalone command, transforming repository urls into siva files • stores using src-d/śiva repository storage file format • optimized for storage and keeping repos up-to-date SEEK, FETCH, STORE architecture • rooted repositories are standard git repositories that store all objects from all repositories that share a common history, identified by same initial commit: • a rooted repository is saved in a single śiva file • updates stored in concatenated siva files: no need to rewriting the whole repository file • distributed-file-system backed, supports GCS & HDFS usage • https://github.com/src-d/rovers • https://github.com/src-d/borges • https://github.com/src-d/go-siva • śiva: Why We Created Yet Another Archive Format resources YOUR NEXT STEPS SETUP & RUN
  • 13. storage ➔ Metadata: PostgreSQL ➔ Built small type-safe ORM for Go<->Postgres https://github.com/src-d/go-kallax ➔ Data: Apache Hadoop HDFS ➔ Custom (seekable, appendable) archive format: Siva 1 RootedRepository <-> 1 Siva file
  • 14. SMART REPO STORAGE • store a git repository in a single file • updates possible without rewriting the whole file • friendly to distributed file systems • seekable to allow random access to any file position motivation śiva SEEKABLE INDEXED BLOCK ARCHIVER FILE FORMAT SIVA FILE BLOCK SCHEMA architecture CHARACTERISTICS • src-d/go-siva is an archiving format similar to tar or zip • allows constant-time random file access • allows seekable read access to the contained files • allows file concatenation given the block-based design • command-line tool + implementations in Go and Java architecture usage • https://github.com/src-d/go-siva • śiva: Why We Created Yet Another Archive Format resources YOUR NEXT STEPSAPPENDING FILES # pack into siva file $ siva pack example.siva qux # append into siva file $ siva pack --append example.siva bar # list siva file contents $ siva list example.siva Sep 20 13:04 4 B qux -rw-r--r-- Sep 20 13:07 4 B bar -rw-r--r--
  • 16. processing Apache Spark ➔ For batch processing, SparkSQL Engine ➔ Library, w custom DataSource implementation GitDataSource ➔ Read repositories from Siva archives in HDFS, exposes though DataFrame ➔ API for accessing refs/commits/files/blobs ➔ Talks to external services though gRPC for parsing/lexing, and other analysis
  • 17. YOUR NEXT STEPS UNIFIED SCALABLE PIPELINE • easy-to-use pipeline for git repository analysis • integrated with standard tools for large scale data analysis • avoid custom code in operations across millions of repos motivation engine UNIFIED SCALABLE CODE ANALYSIS PIPELINE ON SPARK APACHE SPARK DATAFRAME architecture • listing and retrieval of git repositories • Apache Spark datasource on top of git repositories • iterators over any git object, references • code exploration and querying using XPath expressions • language identification and source code parsing • feature extraction for machine learning at scale PREPARATION architecture usage sample • https://github.com/src-d/engine • Early example jupyter notebook: https://github.com/src-d/spark-api/blob/maste r/examples/notebooks/Example.ipynb resources EngineAPI(spark, 'siva', '/path/to/siva-files') .repositories .references .head_ref .files .classify_languages() .extract_uasts() .query_uast('//*[@roleImport and @roleDeclaration]', 'imports') .filter("lang = 'java'") .select('imports', 'path', 'repository_id') .write .parquet("hdfs://...") • extends Apache SparkSQL • git repositories stored as siva files or standard repositories in HDFS • metadata caching for faster lookups over all the dataset. • fetches repositories in batches and on demand • available APIs for Spark and PySpark • can run either locally or in a distributed cluster
  • 19. analysis Enry ➔ Programming language identification ➔ Re-write of github/linguist in Golang, ~370 langs Project Babelfish ➔ Distributed parser infrastructure for source code analysis ➔ Unified interface though gRPC to native parsers in containers: src -> uAST Talk in Source Code Analysis devRoom Room: UD2.119, Sunday, 12:40 https://fosdem.org/2018/schedule/event/code_babelfish_a_universal_code_parser_for_source_c ode_analysis/
  • 20. • enry speed improvement over linguist when applied to linguist/samples folder file samples usable in Go as a native library, in Java as shared library and as a CLI tool. LANG DETECTION AT SCALE • need to detect programming languages of every file in a git repository • initially used github/linguist, but needed more performance for large scale applications • keep compatibility with the original linguist project motivation enry A FASTER FILE PROGRAMMING LANGUAGE DETECTOR benchmarks COMPATIBLE AND FLEXIBLE • linguist as source of information on language detection • ignores binary and vendored files • command line tool mimics the original linguist one • can be used in Go (native library) or Java (shared library) architecture usage • https://github.com/src-d/enry • enry: detecting languages • benchmark methodology and results resources YOUR NEXT STEPS • src-d/enry is at least 4x faster than linguist • 5x (larger repos) to 20x faster (smaller repos) GO FASTER • additional info on benchmarking enry $ enry /path/to/src-d/go-git 98.28% Go 0.69% Shell 0.34% Makefile 0.34% Markdown 0.34% Text
  • 21. UNIVERSAL CODE ANALYSIS • was born as a solution for massive code analysis • parsing single files in any programming language • analyze all source code from all repositories in the world • analyze many languages using a shared structure/format motivation babelfish A SELF-HOSTED SERVER FOR UNIVERSAL SOURCE CODE PARSING CONTAINER-BASED architecture • AST-based diff'ing. Understanding changes made to code with finer-grained granularity. • extract features for Machine Learning on Source Code. • statistics of language features • detecting similar coding patterns across languages POWERFUL OPPORTUNITIES use cases • language drivers as the main building blocks • parsing service via one driver per language • language drivers can be written in any language and are packaged as standard Docker containers • containers are executed by the babelfish server in a specific runtime built on-top of libcontainer. usage • https://github.com/bblfsh • Babelfish documentation • announcing Babelfish • Babelfish presentation • join the Babelfish community resources YOUR NEXT STEPSUNIVERSAL AST architecture • UAST is a universal (normalized and annotated) form of Abstract Syntax Tree (AST) • language-independent annotations (roles) such as Expression, Statement, Operator, Arithmetic, etc. • can be easily ported to many languages using gogo/protobuf TRY BABELFISH ONLINE • or run babelfish server & dashboard locally: $ docker run --privileged -d -p 9432:9432 --name bblfsh bblfsh/server $ docker run -p 8080:80 --link bblfsh bblfsh/dashboard --bblfsh-addr bblfsh:9432
  • 24. further directions INFRASTRUCTURE ➔ Persistent storage in k8s on bare-metal cluster COLLECTION STORAGE PROCESSING ANALYSIS ➔ Explore SEDA architecture, to dynamically saturate throughput ➔ Better splittable Git object storage format (w delta-encoding, etc) ➔ Distributed Indexes to speed up common Apache Spark queries ➔ AST-diff, cross-language abstractions on top of ASTs