2. Tech companies compete for talent
• Recruiting is difficult and expensive.
• Companies use Github for code repositories.
• GitRecruit can automate talent search.
4. Algorithm-using README files to find
similar repositories
NLTK -- the Natural
Language Toolkit -- is a
suite of open source
Python modules, data
sets and tutorials
supporting research
and development in
Natural Language
Processing.
NLTK
README
6. Algorithm-using README files to find
similar repositories
NLTK
README tf-idf vector
Scikit-learn is a Python
module for machine
learning built on top of
SciPy and distributed
under the 3-Clause
BSD license.
The project was
started in 2007 by
David Cournapeau as a
Google Summer of
Matplotlib is a python
2D plotting library
which produces
publication quality
figures in a variety of
hardcopy formats and
interactive
environments across
platforms. matplotlib
can be used in python
~110,000 repository README files
~70% pull requests
NumPy is the
fundamental package
needed for scientific
computing with
Python. This package
contains: a powerful N-
dimensional array
object . sophisticated
(broadcasting)
functions
݈ܽ݊݃݁݃ܽݑ
݄݊ݐݕ
݈݊ܽܽݎݑݐ
ݐ݈
ݐ݈݅݇ݐ
⋮
1.8
2.4
2.4
2
2
⋮
16. Parameters optimized
• Number of words included
• Minimum document frequency
• Maximum document frequency
• Sublinear term frequency
• Cosine similarity
• Maximum n-gram