SlideShare ist ein Scribd-Unternehmen logo
1 von 16
Downloaden Sie, um offline zu lesen
GitRecruit
Recruit the right tech talent using Github
Insight Data Science Fellow:
Yinghan Fu
Tech companies compete for talent
• Recruiting is difficult and expensive.
• Companies use Github for code repositories.
• GitRecruit can automate talent search.
Algorithm-using README files to find
similar repositories
NLTK
Algorithm-using README files to find
similar repositories
NLTK -- the Natural
Language Toolkit -- is a
suite of open source
Python modules, data
sets and tutorials
supporting research
and development in
Natural Language
Processing.
NLTK
README
Algorithm-using README files to find
similar repositories
NLTK
README tf-idf vector
݈ܽ݊݃‫݁݃ܽݑ‬
‫݊݋݄ݐݕ݌‬
݊ܽ‫݈ܽݎݑݐ‬
‫ݐ݋݈݌‬
‫ݐ݈݅݇݋݋ݐ‬
⋮
1.8
2.4
2.4
2
2
⋮
Algorithm-using README files to find
similar repositories
NLTK
README tf-idf vector
Scikit-learn is a Python
module for machine
learning built on top of
SciPy and distributed
under the 3-Clause
BSD license.
The project was
started in 2007 by
David Cournapeau as a
Google Summer of
Matplotlib is a python
2D plotting library
which produces
publication quality
figures in a variety of
hardcopy formats and
interactive
environments across
platforms. matplotlib
can be used in python
~110,000 repository README files
~70% pull requests
NumPy is the
fundamental package
needed for scientific
computing with
Python. This package
contains: a powerful N-
dimensional array
object . sophisticated
(broadcasting)
functions
݈ܽ݊݃‫݁݃ܽݑ‬
‫݊݋݄ݐݕ݌‬
݊ܽ‫݈ܽݎݑݐ‬
‫ݐ݋݈݌‬
‫ݐ݈݅݇݋݋ݐ‬
⋮
1.8
2.4
2.4
2
2
⋮
Algorithm-using README files to find
similar repositories
NLTK
README tf-idf vector
~110,000 repository README files
~70% pull requests
݈ܽ݊݃‫݁݃ܽݑ‬
‫݊݋݄ݐݕ݌‬
݊ܽ‫݈ܽݎݑݐ‬
‫ݐ݋݈݌‬
‫݂ܿ݅݅ݐ݊݁݅ܿݏ‬
⋮
0 0 0
2.5 2.3 2.8
0 0 0
0 3.1 0
5.4 4.3 6.7
⋮ ⋮ ⋮
݈ܽ݊݃‫݁݃ܽݑ‬
‫݊݋݄ݐݕ݌‬
݊ܽ‫݈ܽݎݑݐ‬
‫ݐ݋݈݌‬
‫ݐ݈݅݇݋݋ݐ‬
⋮
1.8
2.4
2.4
2
2
⋮
Algorithm-using README files to find
similar repositories
ܰ‫	ܭܶܮ‬ 0.5 0.3 0.8
Cosine similarity
Algorithm-using README files to find
similar repositories
ܰ‫	ܭܶܮ‬ 0.5 0.3 0.8
Cosine similarity
Optimization and benchmark show
the search engine is effective
Database Driver Audio Text Processing
Training
133 repository readme files
Optimize mean average precision
Testing
123 repository readme files
Mean average precision 0.39
(random 0.09, p-value<0.00001)
……………
……………
……………
……………
……………
……………
……………
……………
……………
……………
……………
……………
……………
……………
……………
……………
……………
……………
……………
……………
……………
……………
……………
……………
……………
……………
……………
……………
……………
……………
……………
……………
……………
……………
……………
……………
....
About me, Yinghan Fu
About me, Yinghan Fu
• GitRecruit
– MySQL, Python, NLTK, scikit-learn etc.
• Machine learning
– C++, Python
• Dynamic programming algorithm
– C++
• User-user/item-item collaborative filtering
– Python, mrjob, MongoDB
The search engine can concentrate
python packages from the same category
Github activities are highly concentrated
16 mil repos
6 mil users
2.7 mil pull requests merged
0.4 mil repos have any pull
requests merged
tf-idf vectorization
ܶ‫ܨ‬ = 	1 + log	(‫݂ݐ‬௧,ௗ)
‫ܨܦܫ‬ = 	log	
ܰ
݂݀௧
Parameters optimized
• Number of words included
• Minimum document frequency
• Maximum document frequency
• Sublinear term frequency
• Cosine similarity
• Maximum n-gram

Weitere ähnliche Inhalte

Was ist angesagt?

Researh toolbox-data-analysis-with-python
Researh toolbox-data-analysis-with-pythonResearh toolbox-data-analysis-with-python
Researh toolbox-data-analysis-with-pythonWaternomics
 
Bypassing DEP using ROP
Bypassing DEP using ROPBypassing DEP using ROP
Bypassing DEP using ROPJapneet Singh
 
Machine Learning on Code - SF meetup
Machine Learning on Code - SF meetupMachine Learning on Code - SF meetup
Machine Learning on Code - SF meetupsource{d}
 
IRIS-HEP: Boost-histogram and Hist
IRIS-HEP: Boost-histogram and HistIRIS-HEP: Boost-histogram and Hist
IRIS-HEP: Boost-histogram and HistHenry Schreiner
 
Concurrency and Python - PyCon MY 2015
Concurrency and Python - PyCon MY 2015Concurrency and Python - PyCon MY 2015
Concurrency and Python - PyCon MY 2015Boey Pak Cheong
 
CHEP 2018: A Python upgrade to the GooFit package for parallel fitting
CHEP 2018: A Python upgrade to the GooFit package for parallel fittingCHEP 2018: A Python upgrade to the GooFit package for parallel fitting
CHEP 2018: A Python upgrade to the GooFit package for parallel fittingHenry Schreiner
 
CHEP 2019: Recent developments in histogram libraries
CHEP 2019: Recent developments in histogram librariesCHEP 2019: Recent developments in histogram libraries
CHEP 2019: Recent developments in histogram librariesHenry Schreiner
 
Python in a physics lab
Python in a physics labPython in a physics lab
Python in a physics labGergely Imreh
 
Business logic with PostgreSQL and Python
Business logic with PostgreSQL and PythonBusiness logic with PostgreSQL and Python
Business logic with PostgreSQL and PythonHubert Piotrowski
 
Digital RSE: automated code quality checks - RSE group meeting
Digital RSE: automated code quality checks - RSE group meetingDigital RSE: automated code quality checks - RSE group meeting
Digital RSE: automated code quality checks - RSE group meetingHenry Schreiner
 
DIANA: Recent developments in GooFit
DIANA: Recent developments in GooFitDIANA: Recent developments in GooFit
DIANA: Recent developments in GooFitHenry Schreiner
 
[PyCon 2014 APAC] How to integrate python into a scala stack to build realtim...
[PyCon 2014 APAC] How to integrate python into a scala stack to build realtim...[PyCon 2014 APAC] How to integrate python into a scala stack to build realtim...
[PyCon 2014 APAC] How to integrate python into a scala stack to build realtim...Jerry Chou
 
Cpu.ppt INTRODUCTION TO “C”
Cpu.ppt INTRODUCTION TO “C” Cpu.ppt INTRODUCTION TO “C”
Cpu.ppt INTRODUCTION TO “C” Sukhvinder Singh
 
The Computer Science Behind a modern Distributed Database
The Computer Science Behind a modern Distributed DatabaseThe Computer Science Behind a modern Distributed Database
The Computer Science Behind a modern Distributed DatabaseArangoDB Database
 
ROOT 2018: iminuit and MINUIT2 Standalone
ROOT 2018: iminuit and MINUIT2 StandaloneROOT 2018: iminuit and MINUIT2 Standalone
ROOT 2018: iminuit and MINUIT2 StandaloneHenry Schreiner
 

Was ist angesagt? (20)

Researh toolbox-data-analysis-with-python
Researh toolbox-data-analysis-with-pythonResearh toolbox-data-analysis-with-python
Researh toolbox-data-analysis-with-python
 
Numba lightning
Numba lightningNumba lightning
Numba lightning
 
Bypassing DEP using ROP
Bypassing DEP using ROPBypassing DEP using ROP
Bypassing DEP using ROP
 
Python to scala
Python to scalaPython to scala
Python to scala
 
Machine Learning on Code - SF meetup
Machine Learning on Code - SF meetupMachine Learning on Code - SF meetup
Machine Learning on Code - SF meetup
 
IRIS-HEP: Boost-histogram and Hist
IRIS-HEP: Boost-histogram and HistIRIS-HEP: Boost-histogram and Hist
IRIS-HEP: Boost-histogram and Hist
 
Concurrency and Python - PyCon MY 2015
Concurrency and Python - PyCon MY 2015Concurrency and Python - PyCon MY 2015
Concurrency and Python - PyCon MY 2015
 
CHEP 2018: A Python upgrade to the GooFit package for parallel fitting
CHEP 2018: A Python upgrade to the GooFit package for parallel fittingCHEP 2018: A Python upgrade to the GooFit package for parallel fitting
CHEP 2018: A Python upgrade to the GooFit package for parallel fitting
 
CHEP 2019: Recent developments in histogram libraries
CHEP 2019: Recent developments in histogram librariesCHEP 2019: Recent developments in histogram libraries
CHEP 2019: Recent developments in histogram libraries
 
Python in a physics lab
Python in a physics labPython in a physics lab
Python in a physics lab
 
Using HDF5 and Python: The H5py module
Using HDF5 and Python: The H5py moduleUsing HDF5 and Python: The H5py module
Using HDF5 and Python: The H5py module
 
ACAT 2017: GooFit 2.0
ACAT 2017: GooFit 2.0ACAT 2017: GooFit 2.0
ACAT 2017: GooFit 2.0
 
Business logic with PostgreSQL and Python
Business logic with PostgreSQL and PythonBusiness logic with PostgreSQL and Python
Business logic with PostgreSQL and Python
 
Pybind11 - SciPy 2021
Pybind11 - SciPy 2021Pybind11 - SciPy 2021
Pybind11 - SciPy 2021
 
Digital RSE: automated code quality checks - RSE group meeting
Digital RSE: automated code quality checks - RSE group meetingDigital RSE: automated code quality checks - RSE group meeting
Digital RSE: automated code quality checks - RSE group meeting
 
DIANA: Recent developments in GooFit
DIANA: Recent developments in GooFitDIANA: Recent developments in GooFit
DIANA: Recent developments in GooFit
 
[PyCon 2014 APAC] How to integrate python into a scala stack to build realtim...
[PyCon 2014 APAC] How to integrate python into a scala stack to build realtim...[PyCon 2014 APAC] How to integrate python into a scala stack to build realtim...
[PyCon 2014 APAC] How to integrate python into a scala stack to build realtim...
 
Cpu.ppt INTRODUCTION TO “C”
Cpu.ppt INTRODUCTION TO “C” Cpu.ppt INTRODUCTION TO “C”
Cpu.ppt INTRODUCTION TO “C”
 
The Computer Science Behind a modern Distributed Database
The Computer Science Behind a modern Distributed DatabaseThe Computer Science Behind a modern Distributed Database
The Computer Science Behind a modern Distributed Database
 
ROOT 2018: iminuit and MINUIT2 Standalone
ROOT 2018: iminuit and MINUIT2 StandaloneROOT 2018: iminuit and MINUIT2 Standalone
ROOT 2018: iminuit and MINUIT2 Standalone
 

Ähnlich wie GitRecruit final 1

Python standard library &amp; list of important libraries
Python standard library &amp; list of important librariesPython standard library &amp; list of important libraries
Python standard library &amp; list of important librariesgrinu
 
The Future of Computing is Distributed
The Future of Computing is DistributedThe Future of Computing is Distributed
The Future of Computing is DistributedAlluxio, Inc.
 
Deep Learning Frameworks 2019 | Which Deep Learning Framework To Use | Deep L...
Deep Learning Frameworks 2019 | Which Deep Learning Framework To Use | Deep L...Deep Learning Frameworks 2019 | Which Deep Learning Framework To Use | Deep L...
Deep Learning Frameworks 2019 | Which Deep Learning Framework To Use | Deep L...Simplilearn
 
DHT_62196cbe17eefc645ce6794676313372.pptx
DHT_62196cbe17eefc645ce6794676313372.pptxDHT_62196cbe17eefc645ce6794676313372.pptx
DHT_62196cbe17eefc645ce6794676313372.pptxGovadaDhana
 
Logging for Production Systems in The Container Era
Logging for Production Systems in The Container EraLogging for Production Systems in The Container Era
Logging for Production Systems in The Container EraSadayuki Furuhashi
 
Tips and tricks for data science projects with Python
Tips and tricks for data science projects with Python Tips and tricks for data science projects with Python
Tips and tricks for data science projects with Python Jose Manuel Ortega Candel
 
Role of python in hpc
Role of python in hpcRole of python in hpc
Role of python in hpcDr Reeja S R
 
Python 3.5: An agile, general-purpose development language.
Python 3.5: An agile, general-purpose development language.Python 3.5: An agile, general-purpose development language.
Python 3.5: An agile, general-purpose development language.Carlos Miguel Ferreira
 
Scientific Computing @ Fred Hutch
Scientific Computing @ Fred HutchScientific Computing @ Fred Hutch
Scientific Computing @ Fred HutchDirk Petersen
 
ACM Sunnyvale Meetup.pdf
ACM Sunnyvale Meetup.pdfACM Sunnyvale Meetup.pdf
ACM Sunnyvale Meetup.pdfAnyscale
 
Introduction to Python Programming
Introduction to Python ProgrammingIntroduction to Python Programming
Introduction to Python ProgrammingAkhil Kaushik
 
License Plate Recognition System using Python and OpenCV
License Plate Recognition System using Python and OpenCVLicense Plate Recognition System using Python and OpenCV
License Plate Recognition System using Python and OpenCVVishal Polley
 
WRENCH: Workflow Management System Simulation Workbench
WRENCH: Workflow Management System Simulation WorkbenchWRENCH: Workflow Management System Simulation Workbench
WRENCH: Workflow Management System Simulation WorkbenchRafael Ferreira da Silva
 
Anaconda Python KNIME & Orange Installation
Anaconda Python KNIME & Orange InstallationAnaconda Python KNIME & Orange Installation
Anaconda Python KNIME & Orange InstallationGirinath Pillai
 
Robot Framework with Python | Edureka
Robot Framework with Python | EdurekaRobot Framework with Python | Edureka
Robot Framework with Python | EdurekaEdureka!
 
Acceleration of Statistical Detection of Zero-day Malware in the Memory Dump ...
Acceleration of Statistical Detection of Zero-day Malware in the Memory Dump ...Acceleration of Statistical Detection of Zero-day Malware in the Memory Dump ...
Acceleration of Statistical Detection of Zero-day Malware in the Memory Dump ...Igor Korkin
 
Getting More Out of the Node.js, PHP, and Python Agents - AppSphere16
Getting More Out of the Node.js, PHP, and Python Agents - AppSphere16Getting More Out of the Node.js, PHP, and Python Agents - AppSphere16
Getting More Out of the Node.js, PHP, and Python Agents - AppSphere16AppDynamics
 
Improving your team’s source code searching capabilities
Improving your team’s source code searching capabilitiesImproving your team’s source code searching capabilities
Improving your team’s source code searching capabilitiesNikos Katirtzis
 

Ähnlich wie GitRecruit final 1 (20)

Python standard library &amp; list of important libraries
Python standard library &amp; list of important librariesPython standard library &amp; list of important libraries
Python standard library &amp; list of important libraries
 
The Future of Computing is Distributed
The Future of Computing is DistributedThe Future of Computing is Distributed
The Future of Computing is Distributed
 
Deep Learning Frameworks 2019 | Which Deep Learning Framework To Use | Deep L...
Deep Learning Frameworks 2019 | Which Deep Learning Framework To Use | Deep L...Deep Learning Frameworks 2019 | Which Deep Learning Framework To Use | Deep L...
Deep Learning Frameworks 2019 | Which Deep Learning Framework To Use | Deep L...
 
DHT_62196cbe17eefc645ce6794676313372.pptx
DHT_62196cbe17eefc645ce6794676313372.pptxDHT_62196cbe17eefc645ce6794676313372.pptx
DHT_62196cbe17eefc645ce6794676313372.pptx
 
Logging for Production Systems in The Container Era
Logging for Production Systems in The Container EraLogging for Production Systems in The Container Era
Logging for Production Systems in The Container Era
 
Tips and tricks for data science projects with Python
Tips and tricks for data science projects with Python Tips and tricks for data science projects with Python
Tips and tricks for data science projects with Python
 
Python ml
Python mlPython ml
Python ml
 
Role of python in hpc
Role of python in hpcRole of python in hpc
Role of python in hpc
 
Session 2
Session 2Session 2
Session 2
 
Python 3.5: An agile, general-purpose development language.
Python 3.5: An agile, general-purpose development language.Python 3.5: An agile, general-purpose development language.
Python 3.5: An agile, general-purpose development language.
 
Scientific Computing @ Fred Hutch
Scientific Computing @ Fred HutchScientific Computing @ Fred Hutch
Scientific Computing @ Fred Hutch
 
ACM Sunnyvale Meetup.pdf
ACM Sunnyvale Meetup.pdfACM Sunnyvale Meetup.pdf
ACM Sunnyvale Meetup.pdf
 
Introduction to Python Programming
Introduction to Python ProgrammingIntroduction to Python Programming
Introduction to Python Programming
 
License Plate Recognition System using Python and OpenCV
License Plate Recognition System using Python and OpenCVLicense Plate Recognition System using Python and OpenCV
License Plate Recognition System using Python and OpenCV
 
WRENCH: Workflow Management System Simulation Workbench
WRENCH: Workflow Management System Simulation WorkbenchWRENCH: Workflow Management System Simulation Workbench
WRENCH: Workflow Management System Simulation Workbench
 
Anaconda Python KNIME & Orange Installation
Anaconda Python KNIME & Orange InstallationAnaconda Python KNIME & Orange Installation
Anaconda Python KNIME & Orange Installation
 
Robot Framework with Python | Edureka
Robot Framework with Python | EdurekaRobot Framework with Python | Edureka
Robot Framework with Python | Edureka
 
Acceleration of Statistical Detection of Zero-day Malware in the Memory Dump ...
Acceleration of Statistical Detection of Zero-day Malware in the Memory Dump ...Acceleration of Statistical Detection of Zero-day Malware in the Memory Dump ...
Acceleration of Statistical Detection of Zero-day Malware in the Memory Dump ...
 
Getting More Out of the Node.js, PHP, and Python Agents - AppSphere16
Getting More Out of the Node.js, PHP, and Python Agents - AppSphere16Getting More Out of the Node.js, PHP, and Python Agents - AppSphere16
Getting More Out of the Node.js, PHP, and Python Agents - AppSphere16
 
Improving your team’s source code searching capabilities
Improving your team’s source code searching capabilitiesImproving your team’s source code searching capabilities
Improving your team’s source code searching capabilities
 

Kürzlich hochgeladen

Lucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRL
Lucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRLLucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRL
Lucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRLimonikaupta
 
Call Girls Ludhiana Just Call 98765-12871 Top Class Call Girl Service Available
Call Girls Ludhiana Just Call 98765-12871 Top Class Call Girl Service AvailableCall Girls Ludhiana Just Call 98765-12871 Top Class Call Girl Service Available
Call Girls Ludhiana Just Call 98765-12871 Top Class Call Girl Service AvailableSeo
 
₹5.5k {Cash Payment}New Friends Colony Call Girls In [Delhi NIHARIKA] 🔝|97111...
₹5.5k {Cash Payment}New Friends Colony Call Girls In [Delhi NIHARIKA] 🔝|97111...₹5.5k {Cash Payment}New Friends Colony Call Girls In [Delhi NIHARIKA] 🔝|97111...
₹5.5k {Cash Payment}New Friends Colony Call Girls In [Delhi NIHARIKA] 🔝|97111...Diya Sharma
 
Shikrapur - Call Girls in Pune Neha 8005736733 | 100% Gennuine High Class Ind...
Shikrapur - Call Girls in Pune Neha 8005736733 | 100% Gennuine High Class Ind...Shikrapur - Call Girls in Pune Neha 8005736733 | 100% Gennuine High Class Ind...
Shikrapur - Call Girls in Pune Neha 8005736733 | 100% Gennuine High Class Ind...SUHANI PANDEY
 
Dubai=Desi Dubai Call Girls O525547819 Outdoor Call Girls Dubai
Dubai=Desi Dubai Call Girls O525547819 Outdoor Call Girls DubaiDubai=Desi Dubai Call Girls O525547819 Outdoor Call Girls Dubai
Dubai=Desi Dubai Call Girls O525547819 Outdoor Call Girls Dubaikojalkojal131
 
'Future Evolution of the Internet' delivered by Geoff Huston at Everything Op...
'Future Evolution of the Internet' delivered by Geoff Huston at Everything Op...'Future Evolution of the Internet' delivered by Geoff Huston at Everything Op...
'Future Evolution of the Internet' delivered by Geoff Huston at Everything Op...APNIC
 
Russian Call Girls Pune (Adult Only) 8005736733 Escort Service 24x7 Cash Pay...
Russian Call Girls Pune  (Adult Only) 8005736733 Escort Service 24x7 Cash Pay...Russian Call Girls Pune  (Adult Only) 8005736733 Escort Service 24x7 Cash Pay...
Russian Call Girls Pune (Adult Only) 8005736733 Escort Service 24x7 Cash Pay...SUHANI PANDEY
 
Pune Airport ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready...
Pune Airport ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready...Pune Airport ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready...
Pune Airport ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready...tanu pandey
 
Enjoy Night⚡Call Girls Dlf City Phase 3 Gurgaon >༒8448380779 Escort Service
Enjoy Night⚡Call Girls Dlf City Phase 3 Gurgaon >༒8448380779 Escort ServiceEnjoy Night⚡Call Girls Dlf City Phase 3 Gurgaon >༒8448380779 Escort Service
Enjoy Night⚡Call Girls Dlf City Phase 3 Gurgaon >༒8448380779 Escort ServiceDelhi Call girls
 
Call Girls In Pratap Nagar Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Pratap Nagar Delhi 💯Call Us 🔝8264348440🔝Call Girls In Pratap Nagar Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Pratap Nagar Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
GDG Cloud Southlake 32: Kyle Hettinger: Demystifying the Dark Web
GDG Cloud Southlake 32: Kyle Hettinger: Demystifying the Dark WebGDG Cloud Southlake 32: Kyle Hettinger: Demystifying the Dark Web
GDG Cloud Southlake 32: Kyle Hettinger: Demystifying the Dark WebJames Anderson
 
Trump Diapers Over Dems t shirts Sweatshirt
Trump Diapers Over Dems t shirts SweatshirtTrump Diapers Over Dems t shirts Sweatshirt
Trump Diapers Over Dems t shirts Sweatshirtrahman018755
 
Hot Service (+9316020077 ) Goa Call Girls Real Photos and Genuine Service
Hot Service (+9316020077 ) Goa  Call Girls Real Photos and Genuine ServiceHot Service (+9316020077 ) Goa  Call Girls Real Photos and Genuine Service
Hot Service (+9316020077 ) Goa Call Girls Real Photos and Genuine Servicesexy call girls service in goa
 
Real Men Wear Diapers T Shirts sweatshirt
Real Men Wear Diapers T Shirts sweatshirtReal Men Wear Diapers T Shirts sweatshirt
Real Men Wear Diapers T Shirts sweatshirtrahman018755
 
Call Now ☎ 8264348440 !! Call Girls in Rani Bagh Escort Service Delhi N.C.R.
Call Now ☎ 8264348440 !! Call Girls in Rani Bagh Escort Service Delhi N.C.R.Call Now ☎ 8264348440 !! Call Girls in Rani Bagh Escort Service Delhi N.C.R.
Call Now ☎ 8264348440 !! Call Girls in Rani Bagh Escort Service Delhi N.C.R.soniya singh
 
Busty Desi⚡Call Girls in Vasundhara Ghaziabad >༒8448380779 Escort Service
Busty Desi⚡Call Girls in Vasundhara Ghaziabad >༒8448380779 Escort ServiceBusty Desi⚡Call Girls in Vasundhara Ghaziabad >༒8448380779 Escort Service
Busty Desi⚡Call Girls in Vasundhara Ghaziabad >༒8448380779 Escort ServiceDelhi Call girls
 
Call Girls In Ashram Chowk Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Ashram Chowk Delhi 💯Call Us 🔝8264348440🔝Call Girls In Ashram Chowk Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Ashram Chowk Delhi 💯Call Us 🔝8264348440🔝soniya singh
 

Kürzlich hochgeladen (20)

Lucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRL
Lucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRLLucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRL
Lucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRL
 
6.High Profile Call Girls In Punjab +919053900678 Punjab Call GirlHigh Profil...
6.High Profile Call Girls In Punjab +919053900678 Punjab Call GirlHigh Profil...6.High Profile Call Girls In Punjab +919053900678 Punjab Call GirlHigh Profil...
6.High Profile Call Girls In Punjab +919053900678 Punjab Call GirlHigh Profil...
 
Call Girls Ludhiana Just Call 98765-12871 Top Class Call Girl Service Available
Call Girls Ludhiana Just Call 98765-12871 Top Class Call Girl Service AvailableCall Girls Ludhiana Just Call 98765-12871 Top Class Call Girl Service Available
Call Girls Ludhiana Just Call 98765-12871 Top Class Call Girl Service Available
 
₹5.5k {Cash Payment}New Friends Colony Call Girls In [Delhi NIHARIKA] 🔝|97111...
₹5.5k {Cash Payment}New Friends Colony Call Girls In [Delhi NIHARIKA] 🔝|97111...₹5.5k {Cash Payment}New Friends Colony Call Girls In [Delhi NIHARIKA] 🔝|97111...
₹5.5k {Cash Payment}New Friends Colony Call Girls In [Delhi NIHARIKA] 🔝|97111...
 
Shikrapur - Call Girls in Pune Neha 8005736733 | 100% Gennuine High Class Ind...
Shikrapur - Call Girls in Pune Neha 8005736733 | 100% Gennuine High Class Ind...Shikrapur - Call Girls in Pune Neha 8005736733 | 100% Gennuine High Class Ind...
Shikrapur - Call Girls in Pune Neha 8005736733 | 100% Gennuine High Class Ind...
 
Dubai=Desi Dubai Call Girls O525547819 Outdoor Call Girls Dubai
Dubai=Desi Dubai Call Girls O525547819 Outdoor Call Girls DubaiDubai=Desi Dubai Call Girls O525547819 Outdoor Call Girls Dubai
Dubai=Desi Dubai Call Girls O525547819 Outdoor Call Girls Dubai
 
'Future Evolution of the Internet' delivered by Geoff Huston at Everything Op...
'Future Evolution of the Internet' delivered by Geoff Huston at Everything Op...'Future Evolution of the Internet' delivered by Geoff Huston at Everything Op...
'Future Evolution of the Internet' delivered by Geoff Huston at Everything Op...
 
Russian Call Girls Pune (Adult Only) 8005736733 Escort Service 24x7 Cash Pay...
Russian Call Girls Pune  (Adult Only) 8005736733 Escort Service 24x7 Cash Pay...Russian Call Girls Pune  (Adult Only) 8005736733 Escort Service 24x7 Cash Pay...
Russian Call Girls Pune (Adult Only) 8005736733 Escort Service 24x7 Cash Pay...
 
Dwarka Sector 26 Call Girls | Delhi | 9999965857 🫦 Vanshika Verma More Our Se...
Dwarka Sector 26 Call Girls | Delhi | 9999965857 🫦 Vanshika Verma More Our Se...Dwarka Sector 26 Call Girls | Delhi | 9999965857 🫦 Vanshika Verma More Our Se...
Dwarka Sector 26 Call Girls | Delhi | 9999965857 🫦 Vanshika Verma More Our Se...
 
Pune Airport ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready...
Pune Airport ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready...Pune Airport ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready...
Pune Airport ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready...
 
Enjoy Night⚡Call Girls Dlf City Phase 3 Gurgaon >༒8448380779 Escort Service
Enjoy Night⚡Call Girls Dlf City Phase 3 Gurgaon >༒8448380779 Escort ServiceEnjoy Night⚡Call Girls Dlf City Phase 3 Gurgaon >༒8448380779 Escort Service
Enjoy Night⚡Call Girls Dlf City Phase 3 Gurgaon >༒8448380779 Escort Service
 
Call Girls In Pratap Nagar Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Pratap Nagar Delhi 💯Call Us 🔝8264348440🔝Call Girls In Pratap Nagar Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Pratap Nagar Delhi 💯Call Us 🔝8264348440🔝
 
GDG Cloud Southlake 32: Kyle Hettinger: Demystifying the Dark Web
GDG Cloud Southlake 32: Kyle Hettinger: Demystifying the Dark WebGDG Cloud Southlake 32: Kyle Hettinger: Demystifying the Dark Web
GDG Cloud Southlake 32: Kyle Hettinger: Demystifying the Dark Web
 
Trump Diapers Over Dems t shirts Sweatshirt
Trump Diapers Over Dems t shirts SweatshirtTrump Diapers Over Dems t shirts Sweatshirt
Trump Diapers Over Dems t shirts Sweatshirt
 
Hot Service (+9316020077 ) Goa Call Girls Real Photos and Genuine Service
Hot Service (+9316020077 ) Goa  Call Girls Real Photos and Genuine ServiceHot Service (+9316020077 ) Goa  Call Girls Real Photos and Genuine Service
Hot Service (+9316020077 ) Goa Call Girls Real Photos and Genuine Service
 
Real Men Wear Diapers T Shirts sweatshirt
Real Men Wear Diapers T Shirts sweatshirtReal Men Wear Diapers T Shirts sweatshirt
Real Men Wear Diapers T Shirts sweatshirt
 
Call Now ☎ 8264348440 !! Call Girls in Rani Bagh Escort Service Delhi N.C.R.
Call Now ☎ 8264348440 !! Call Girls in Rani Bagh Escort Service Delhi N.C.R.Call Now ☎ 8264348440 !! Call Girls in Rani Bagh Escort Service Delhi N.C.R.
Call Now ☎ 8264348440 !! Call Girls in Rani Bagh Escort Service Delhi N.C.R.
 
Busty Desi⚡Call Girls in Vasundhara Ghaziabad >༒8448380779 Escort Service
Busty Desi⚡Call Girls in Vasundhara Ghaziabad >༒8448380779 Escort ServiceBusty Desi⚡Call Girls in Vasundhara Ghaziabad >༒8448380779 Escort Service
Busty Desi⚡Call Girls in Vasundhara Ghaziabad >༒8448380779 Escort Service
 
Call Girls In Ashram Chowk Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Ashram Chowk Delhi 💯Call Us 🔝8264348440🔝Call Girls In Ashram Chowk Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Ashram Chowk Delhi 💯Call Us 🔝8264348440🔝
 
Low Sexy Call Girls In Mohali 9053900678 🥵Have Save And Good Place 🥵
Low Sexy Call Girls In Mohali 9053900678 🥵Have Save And Good Place 🥵Low Sexy Call Girls In Mohali 9053900678 🥵Have Save And Good Place 🥵
Low Sexy Call Girls In Mohali 9053900678 🥵Have Save And Good Place 🥵
 

GitRecruit final 1

  • 1. GitRecruit Recruit the right tech talent using Github Insight Data Science Fellow: Yinghan Fu
  • 2. Tech companies compete for talent • Recruiting is difficult and expensive. • Companies use Github for code repositories. • GitRecruit can automate talent search.
  • 3. Algorithm-using README files to find similar repositories NLTK
  • 4. Algorithm-using README files to find similar repositories NLTK -- the Natural Language Toolkit -- is a suite of open source Python modules, data sets and tutorials supporting research and development in Natural Language Processing. NLTK README
  • 5. Algorithm-using README files to find similar repositories NLTK README tf-idf vector ݈ܽ݊݃‫݁݃ܽݑ‬ ‫݊݋݄ݐݕ݌‬ ݊ܽ‫݈ܽݎݑݐ‬ ‫ݐ݋݈݌‬ ‫ݐ݈݅݇݋݋ݐ‬ ⋮ 1.8 2.4 2.4 2 2 ⋮
  • 6. Algorithm-using README files to find similar repositories NLTK README tf-idf vector Scikit-learn is a Python module for machine learning built on top of SciPy and distributed under the 3-Clause BSD license. The project was started in 2007 by David Cournapeau as a Google Summer of Matplotlib is a python 2D plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments across platforms. matplotlib can be used in python ~110,000 repository README files ~70% pull requests NumPy is the fundamental package needed for scientific computing with Python. This package contains: a powerful N- dimensional array object . sophisticated (broadcasting) functions ݈ܽ݊݃‫݁݃ܽݑ‬ ‫݊݋݄ݐݕ݌‬ ݊ܽ‫݈ܽݎݑݐ‬ ‫ݐ݋݈݌‬ ‫ݐ݈݅݇݋݋ݐ‬ ⋮ 1.8 2.4 2.4 2 2 ⋮
  • 7. Algorithm-using README files to find similar repositories NLTK README tf-idf vector ~110,000 repository README files ~70% pull requests ݈ܽ݊݃‫݁݃ܽݑ‬ ‫݊݋݄ݐݕ݌‬ ݊ܽ‫݈ܽݎݑݐ‬ ‫ݐ݋݈݌‬ ‫݂ܿ݅݅ݐ݊݁݅ܿݏ‬ ⋮ 0 0 0 2.5 2.3 2.8 0 0 0 0 3.1 0 5.4 4.3 6.7 ⋮ ⋮ ⋮ ݈ܽ݊݃‫݁݃ܽݑ‬ ‫݊݋݄ݐݕ݌‬ ݊ܽ‫݈ܽݎݑݐ‬ ‫ݐ݋݈݌‬ ‫ݐ݈݅݇݋݋ݐ‬ ⋮ 1.8 2.4 2.4 2 2 ⋮
  • 8. Algorithm-using README files to find similar repositories ܰ‫ ܭܶܮ‬ 0.5 0.3 0.8 Cosine similarity
  • 9. Algorithm-using README files to find similar repositories ܰ‫ ܭܶܮ‬ 0.5 0.3 0.8 Cosine similarity
  • 10. Optimization and benchmark show the search engine is effective Database Driver Audio Text Processing Training 133 repository readme files Optimize mean average precision Testing 123 repository readme files Mean average precision 0.39 (random 0.09, p-value<0.00001) …………… …………… …………… …………… …………… …………… …………… …………… …………… …………… …………… …………… …………… …………… …………… …………… …………… …………… …………… …………… …………… …………… …………… …………… …………… …………… …………… …………… …………… …………… …………… …………… …………… …………… …………… …………… ....
  • 12. About me, Yinghan Fu • GitRecruit – MySQL, Python, NLTK, scikit-learn etc. • Machine learning – C++, Python • Dynamic programming algorithm – C++ • User-user/item-item collaborative filtering – Python, mrjob, MongoDB
  • 13. The search engine can concentrate python packages from the same category
  • 14. Github activities are highly concentrated 16 mil repos 6 mil users 2.7 mil pull requests merged 0.4 mil repos have any pull requests merged
  • 15. tf-idf vectorization ܶ‫ܨ‬ = 1 + log (‫݂ݐ‬௧,ௗ) ‫ܨܦܫ‬ = log ܰ ݂݀௧
  • 16. Parameters optimized • Number of words included • Minimum document frequency • Maximum document frequency • Sublinear term frequency • Cosine similarity • Maximum n-gram