"Source Code Abstracts Classification Using CNN", Vadim Markovtsev, Lead Software Engineer - Machine Learning Team at Source {d}
Watch more from Data Natives Berlin 2016 here: http://bit.ly/2fE1sEo
Visit the conference website to learn more: www.datanatives.io
Follow Data Natives:
https://www.facebook.com/DataNatives
https://twitter.com/DataNativesConf
Stay Connected to Data Natives by Email: Subscribe to our newsletter to get the news first about Data Natives 2016: http://bit.ly/1WMJAqS
About the Author:
Currently Vadim is a Senior Machine Learning Engineer at source{d} where he works on deep neural networks that aim to understand all of the world's developers through their code. Vadim is one of the creators of the distributed deep learning platform Veles (https://velesnet.ml) while working at Samsung. Afterwards Vadim was responsible for the machine learning efforts to fight email spam at Mail.Ru. In the past Vadim was also a visiting associate professor at Moscow Institute of Physics and Technology, teaching about new technologies and conducting ACM-like internal coding competitions. Vadim is also a big fan of GitHub (vmarkovtsev) and HackerRank (markhor), as well as likes to write technical articles on a number of web sites.
5. Motivation
So to understand the latter, we need to understand what and how they do
what they do. Feature origins:
• Social networks
• Version control statistics
• History
• Style
• Source code
• Algorithms
• Dependency graph
• Style
7. Motivation
Let's check how deep we can drill with source code style ML.
Toy task: binary classification between 2 projects using only the data with the
origin in code style.
8. Feature engineering
Requirements:
1. Ignore text files, Markdown, etc.
2. Ignore autogenerated files
3. Support many languages with minimal efforts
4. Include as much information about the source code as possible
9. Feature engineering
(1) and (2) are solved by github/linguist and source{d}'s own tool
• Used by GihHub for language bars
• Supports 400+ languages
10. Feature engineering
(3) and (4) are solved by
• Highlights source code (tokenizer)
• Supports 400+ languages (though only 50% intersects with github/linguist)
• ≈90 token types (not all are used for every language)
11. Feature engineering
Pygments example:
# prints "Hello, World!"
if True:
print("Hello, World!")
# prints "Hello, World!"
if True:
print("Hello, World!")
01.
02.
03.
14. Feature engineering
• Split stream into lines, each line contains ≤40 tokens
• Merge indents
• "One against all" with value length
• Some tokens occupy more than 1 dimension, e.g. Token.Name reflects
naming style
• About 200 dimensions overall
• 8000 features per line, most are zeros
• Mean-dispersion normalization
15. Feature engineering
Though extracted, names as words may not used in this scheme.
We've checked out two approaches to using this extra information:
1. LSTM sequence modelling (link to presentation)
2. ARTM topic modelling (article in our blog)
19. The Network
• Merge all project files together, feed 50 LOT (lines of tokens) as a single
sample.
• Does not converge without random shuffling files (sample borders are of
course fixed).
• Batch size is 50.
• Truncate projects by the smallest LOT.
• Fragile to small metaparameter deviations.
20. The Network
• Python3 / Tensorflow / NVIDIA GPU
• Preprocessing is done on Dataproc (Spark)
• Database of features is stored in Cloud Storage
• Sparse matrices ⇒normalization on the fly
21. Results
projects description size accuracy
Django vs Twisted Web frameworks, Python 800ktok each 84%
Matplotlib vs Bokeh Plotting libraries, Python 1Mtok vs 250ktok 60%
Matplotlib vs Django Plotting libraries, Python 1Mtok vs 800ktok 76%
Django vs Guava Python vs Java 800ktok >99%
Hibernate vs Guava Java libraries 3Mtok vs 800ktok 96%
22. Results
Conclusion: the network is likely to extract internal similarity in each project
and use it. Just like humans do.
If the languages are different, it is very easy to distinguish projects (at least
because of unique token types).
25. Other work
GitHub has ≈6M of active users (and 3M after reasonable filtering). If we are
able to extract various features for each, we can cluster them. Visio:
1. Run K-means with K=45000 (using src-d/kmcuda)
2. Run t-SNE to visualize the landscape
BTW, kmcuda implements Yinyang k-means.