SlideShare ist ein Scribd-Unternehmen logo
1 von 23
Assisting Code Search with
Automatic Query Reformulation
for Bug Localization
Bunyamin Sisman & Avinash Kak
bsisman@purdue.edu | kak@purdue.edu
Purdue University
MSR’13
Outline
I. Motivation
II. Past Work on Automatic Query Reformulation
III. Proposed Approach: Spatial Code Proximity (SCP)
IV. Data Preparation
V. Results
VI. Conclusions
MSR'13
Main Motivation
It is extremely common for software developers to use
arbitrary
abbreviations & concatenations
in software. These are generally difficult to predict when
searching the code base of a project.
The question is “Is there some way to automatically
reformulate a user’s query so that all such relevant terms
are also used in retrieval?”
MSR'13
ď‚´ We show how a query can be automatically
reformulated for superior retrieval accuracy
ď‚´ We propose a new framework for Query
Reformulation, which leverages the spatial proximity
of the terms in files
ď‚´ The approach leads to significant improvements over
the baseline and the competing Query Reformulation
approaches
MSR'13
Summary of Our Contribution
ď‚´ Our approach preserves or improves the retrieval
accuracy for 76% of the 4,393 bugs we analyzed for
Eclipse and Chrome projects
ď‚´ Our approach improves the retrieval accuracy for 42%
of the 4,393 bugs
ď‚´ Improvements are 66% for Eclipse and 90% for Chrome
in terms of MAP (Mean Average Precision)
ď‚´ We also describe the conditions under which Query
Reformulation may perform poorly.
MSR'13
Summary of Our Contribution
Query Reformulation with
Relevance Feedback
1. Perform an initial retrieval with the original query
2. Analyze the set of top retrieved documents vis-
Ă -vis the query
3. Reformulate the query
MSR'13
Acquiring Relevance
Feedback
ď‚´ Implicitly: infer feedback from user interactions
ď‚´ Explicitly: user provides feedback [Gay et al.
2009]
ď‚´ Pseudo Relevance Feedback (PRF): Automatic
QR
ď‚´ This is our work!
MSR'13
Data Flow in the Proposed
Retrieval Framework
MSR'13
Automatic Query Reformulation
ď‚´ No user involvement!
ď‚´ It takes less than a second to reformulate a query on
ordinary desktop hardware!
ď‚´ It is cheap!
ď‚´ It is effective!
MSR'13
Previous Work on Automatic
QR (for Text Retrieval)
Rocchio’s Formula (ROCC)
ď‚´Relevance Model (RM)
MSR'13
The Proposed Approach to QR:
Spatial Code Proximity (SCP)
ď‚´ Spatial Code Proximity is an elegant approach to
giving greater weights to terms in source code that
occur in the vicinity of the terms in a users’ query
ď‚´ Proximities may be created through commonly used
concatenations
ď‚´ Punctuation characters
 Camel Casing etc…
ď‚´ Underscores: tab_strip_gtk
ď‚´ Camel casing: kPinnedTabAnimationDurationMs
MSR'13
Spatial Code Proximity (SCP)
(Cont’d)
ď‚´ Tokenize source files and index the positions of the
terms in each source file:
ď‚´ Use the distance between terms to find relevant terms
vis-Ă -vis a query!
MSR'13
SCP: Bringing the Query into the Picture
MSR'13
 Example Query: “Browser Animation”
ď‚´ First perform an initial retrieval with the original query
ď‚´ Increase the weights of those nearby terms!
Research Questions
ď‚´ Question 1: Does the proposed QR approach improve
the accuracy of source code retrieval. If so, to what
extent?
ď‚´ Question 2: How do the QR techniques that are
currently in the literature perform for source code
retrieval?
ď‚´ Question 3: How does the initial retrieval performance
affect the performance of QR?
ď‚´ Question 4: What are the conditions under which QR
may perform poorly?
MSR'13
Data Preparation
ď‚´ For evaluation, we need a set of queries
and the relevant files
ď‚´ We use the titles of the bug reports as
queries
ď‚´ We have to link the repository commits
to the bug tracking database!
ď‚´ Used regular expressions to detect Bug Fix
commits based on commit messages
MSR'13
Data Preparation (Cont’d)
Eclipse v3.1 Chrome v4.0
#Bugs 4,035 358
Avg. # Relevant Files 2.76 3.82
Avg. #Commits 1.36 1.23
MSR'13
1https://engineering.purdue.edu/RVL/Database/BUGLinks/
ď‚´Resulting dataset: BUGLinks1
Evaluation Framework
ď‚´ We use Precision and Recall based metrics to evaluate
the retrieval accuracy.
ď‚´ Determine the query sets for which the proposed QR
approaches lead to
1. improvements in the retrieval accuracy
2. degradation in the retrieval accuracy
3. no change in the retrieval accuracy
ď‚´ Analyze these sets to understand the characteristics of
the queries each set contains
MSR'13
Evaluation Framework (Cont’d)
ď‚´ For comparison of these sets, we used the following Query
Performance Prediction (QPP) metrics [Haiduc et al. 2012, He
et al. 2004]:
ď‚´ Average Inverse Document Frequency (avgIDF)
ď‚´ Average Inverse Collection Term Frequency (avgICTF)
ď‚´ Query Scope (QS)
ď‚´ Simplified Clarity Score (SCS)
ď‚´ Additionally, we analyzed
ď‚´ Query Lengths
ď‚´ Number of Relevant files per bug
MSR'13
QR with Bug Report Titles
ROCC
RM
SCP (Proposed)
0
500
1000
1500
2000
#Bugs
ROCC RM SCP (Proposed)
MSR'13
Improvements in Retrieval
Accuracy (% Increase in MAP)
ROCC
RM
SCP (Proposed)
0%
20%
40%
60%
80%
100%
Eclipse Chrome
ROCC RM SCP (Proposed)
MSR'13
Conclusions & Future Work
ď‚— Our framework can use a weak initial query
as a jumping off point for a better query.
ď‚— No user input is necessary
ď‚— We obtained significant improvements over
the baseline and the well-known Automatic
QR methods.
ď‚— Future Work includes evaluation of different
term proximity metrics in source code for QR
MSR'13
References
 [1] B. Sisman and A. Kak, “Incorporating version
histories in information retrieval based bug
localization,” in Proceedings of the 9th Working
Conference on Mining Software Repositories (MSR’12).
IEEE, 2012, pp. 50–59
 [2] G. Gay, S. Haiduc, A. Marcus, and T. Menzies, “On
the use of relevance feedback in IR-based concept
location,” in International Conference on Software
Maintenance (ICSM’09), sept. 2009, pp. 351 –360.
ď‚´ [3] A. Marcus, A. Sergeyev, V. Rajlich, and J. I.
Maletic, “An information retrieval approach to
concept location in source code,” in Proceedings of
the 11th Working Conference on Reverse Engineering
(WCRE’04). IEEE Computer Society, 2004, pp. 214–223
MSR'13
References
ď‚´ [4] S. Haiduc, G. Bavota, R. Oliveto, A. De Lucia, and
A. Marcus, “Automatic query performance assessment
during the retrieval of software artifacts,” in
Proceedings of the 27th International Conference on
Automated Software Engineering (ASE’12) .
ACM, 2012, pp. 90–99
 [5] B. He and I. Ounis, “Inferring query performance
using pre-retrieval predictors,” in Proc. Symposium on
String Processing and Information Retrieval . Springer
Verlag, 2004, pp. 43–54
MSR'13

Weitere ähnliche Inhalte

Was ist angesagt?

On Using Network Science in Mining Developers Collaboration in Software Engin...
On Using Network Science in Mining Developers Collaboration in Software Engin...On Using Network Science in Mining Developers Collaboration in Software Engin...
On Using Network Science in Mining Developers Collaboration in Software Engin...IJDKP
 
IEEE 2015 Java Projects
IEEE 2015 Java ProjectsIEEE 2015 Java Projects
IEEE 2015 Java ProjectsVijay Karan
 
IEEE 2015 Java Projects
IEEE 2015 Java ProjectsIEEE 2015 Java Projects
IEEE 2015 Java ProjectsVijay Karan
 
Michael Alcorn, Sr. Software Engineer, Red Hat Inc. at MLconf SF 2017
Michael Alcorn, Sr. Software Engineer, Red Hat Inc. at MLconf SF 2017Michael Alcorn, Sr. Software Engineer, Red Hat Inc. at MLconf SF 2017
Michael Alcorn, Sr. Software Engineer, Red Hat Inc. at MLconf SF 2017MLconf
 
Second phase slide presentation on "ANALYZING THE EFFECTIVENESS OF THE ADVANC...
Second phase slide presentation on "ANALYZING THE EFFECTIVENESS OF THE ADVANC...Second phase slide presentation on "ANALYZING THE EFFECTIVENESS OF THE ADVANC...
Second phase slide presentation on "ANALYZING THE EFFECTIVENESS OF THE ADVANC...Nikhil Jain
 
IRJET- A Survey on Searching of Keyword on Encrypted Data in Cloud using ...
IRJET-  	  A Survey on Searching of Keyword on Encrypted Data in Cloud using ...IRJET-  	  A Survey on Searching of Keyword on Encrypted Data in Cloud using ...
IRJET- A Survey on Searching of Keyword on Encrypted Data in Cloud using ...IRJET Journal
 
Secret key extraction from wireless signal strength in real environments
Secret key extraction from wireless signal strength in real environmentsSecret key extraction from wireless signal strength in real environments
Secret key extraction from wireless signal strength in real environmentsMuthu Sybian
 
The Use of K-mer Minimizers to Identify Bacterium Genomes in High Throughput ...
The Use of K-mer Minimizers to Identify Bacterium Genomes in High Throughput ...The Use of K-mer Minimizers to Identify Bacterium Genomes in High Throughput ...
The Use of K-mer Minimizers to Identify Bacterium Genomes in High Throughput ...Mackenna Galicia
 
Opeed optimal energy efficient neighbor discovery scheme in opportunistic ne...
Opeed  optimal energy efficient neighbor discovery scheme in opportunistic ne...Opeed  optimal energy efficient neighbor discovery scheme in opportunistic ne...
Opeed optimal energy efficient neighbor discovery scheme in opportunistic ne...LogicMindtech Nologies
 
1.a verifiable semantic searching scheme by optimal matching over encrypted d...
1.a verifiable semantic searching scheme by optimal matching over encrypted d...1.a verifiable semantic searching scheme by optimal matching over encrypted d...
1.a verifiable semantic searching scheme by optimal matching over encrypted d...Venkat Projects
 
DoS Forensic Exemplar Comparison to a Known Sample
DoS Forensic Exemplar Comparison to a Known SampleDoS Forensic Exemplar Comparison to a Known Sample
DoS Forensic Exemplar Comparison to a Known SampleCSCJournals
 
Final Assignment - Evaluating Scholarly Articles - Area of Research Interest ...
Final Assignment - Evaluating Scholarly Articles - Area of Research Interest ...Final Assignment - Evaluating Scholarly Articles - Area of Research Interest ...
Final Assignment - Evaluating Scholarly Articles - Area of Research Interest ...E Rey Garcia, MPA, DCS-EIS Candidate
 
Toward a statistical framework for source anonymity in sensor networks
Toward a statistical framework for source anonymity in sensor networksToward a statistical framework for source anonymity in sensor networks
Toward a statistical framework for source anonymity in sensor networksJPINFOTECH JAYAPRAKASH
 
Efficient Privacy Preserving Clustering Based Multi Keyword Search
Efficient Privacy Preserving Clustering Based Multi Keyword Search        Efficient Privacy Preserving Clustering Based Multi Keyword Search
Efficient Privacy Preserving Clustering Based Multi Keyword Search IRJET Journal
 

Was ist angesagt? (14)

On Using Network Science in Mining Developers Collaboration in Software Engin...
On Using Network Science in Mining Developers Collaboration in Software Engin...On Using Network Science in Mining Developers Collaboration in Software Engin...
On Using Network Science in Mining Developers Collaboration in Software Engin...
 
IEEE 2015 Java Projects
IEEE 2015 Java ProjectsIEEE 2015 Java Projects
IEEE 2015 Java Projects
 
IEEE 2015 Java Projects
IEEE 2015 Java ProjectsIEEE 2015 Java Projects
IEEE 2015 Java Projects
 
Michael Alcorn, Sr. Software Engineer, Red Hat Inc. at MLconf SF 2017
Michael Alcorn, Sr. Software Engineer, Red Hat Inc. at MLconf SF 2017Michael Alcorn, Sr. Software Engineer, Red Hat Inc. at MLconf SF 2017
Michael Alcorn, Sr. Software Engineer, Red Hat Inc. at MLconf SF 2017
 
Second phase slide presentation on "ANALYZING THE EFFECTIVENESS OF THE ADVANC...
Second phase slide presentation on "ANALYZING THE EFFECTIVENESS OF THE ADVANC...Second phase slide presentation on "ANALYZING THE EFFECTIVENESS OF THE ADVANC...
Second phase slide presentation on "ANALYZING THE EFFECTIVENESS OF THE ADVANC...
 
IRJET- A Survey on Searching of Keyword on Encrypted Data in Cloud using ...
IRJET-  	  A Survey on Searching of Keyword on Encrypted Data in Cloud using ...IRJET-  	  A Survey on Searching of Keyword on Encrypted Data in Cloud using ...
IRJET- A Survey on Searching of Keyword on Encrypted Data in Cloud using ...
 
Secret key extraction from wireless signal strength in real environments
Secret key extraction from wireless signal strength in real environmentsSecret key extraction from wireless signal strength in real environments
Secret key extraction from wireless signal strength in real environments
 
The Use of K-mer Minimizers to Identify Bacterium Genomes in High Throughput ...
The Use of K-mer Minimizers to Identify Bacterium Genomes in High Throughput ...The Use of K-mer Minimizers to Identify Bacterium Genomes in High Throughput ...
The Use of K-mer Minimizers to Identify Bacterium Genomes in High Throughput ...
 
Opeed optimal energy efficient neighbor discovery scheme in opportunistic ne...
Opeed  optimal energy efficient neighbor discovery scheme in opportunistic ne...Opeed  optimal energy efficient neighbor discovery scheme in opportunistic ne...
Opeed optimal energy efficient neighbor discovery scheme in opportunistic ne...
 
1.a verifiable semantic searching scheme by optimal matching over encrypted d...
1.a verifiable semantic searching scheme by optimal matching over encrypted d...1.a verifiable semantic searching scheme by optimal matching over encrypted d...
1.a verifiable semantic searching scheme by optimal matching over encrypted d...
 
DoS Forensic Exemplar Comparison to a Known Sample
DoS Forensic Exemplar Comparison to a Known SampleDoS Forensic Exemplar Comparison to a Known Sample
DoS Forensic Exemplar Comparison to a Known Sample
 
Final Assignment - Evaluating Scholarly Articles - Area of Research Interest ...
Final Assignment - Evaluating Scholarly Articles - Area of Research Interest ...Final Assignment - Evaluating Scholarly Articles - Area of Research Interest ...
Final Assignment - Evaluating Scholarly Articles - Area of Research Interest ...
 
Toward a statistical framework for source anonymity in sensor networks
Toward a statistical framework for source anonymity in sensor networksToward a statistical framework for source anonymity in sensor networks
Toward a statistical framework for source anonymity in sensor networks
 
Efficient Privacy Preserving Clustering Based Multi Keyword Search
Efficient Privacy Preserving Clustering Based Multi Keyword Search        Efficient Privacy Preserving Clustering Based Multi Keyword Search
Efficient Privacy Preserving Clustering Based Multi Keyword Search
 

Andere mochten auch

Code Search Sucks
Code Search SucksCode Search Sucks
Code Search Sucksdavidcshepherd
 
Search and navigation in Visual Studio
Search and navigation in Visual StudioSearch and navigation in Visual Studio
Search and navigation in Visual StudioDavid Shepherd
 
Python code profiling - Jackson Isaac
Python code profiling - Jackson IsaacPython code profiling - Jackson Isaac
Python code profiling - Jackson IsaacJackson Isaac
 
MongoDB Schema Design: Insights and Tradeoffs (Jetlore's talk at MongoSF 2012)
MongoDB Schema Design: Insights and Tradeoffs (Jetlore's talk at MongoSF 2012)MongoDB Schema Design: Insights and Tradeoffs (Jetlore's talk at MongoSF 2012)
MongoDB Schema Design: Insights and Tradeoffs (Jetlore's talk at MongoSF 2012)Jetlore
 
Tools to Find Source Code on the Web
Tools to Find Source Code on the WebTools to Find Source Code on the Web
Tools to Find Source Code on the Webrgallard
 

Andere mochten auch (6)

Code Search Sucks
Code Search SucksCode Search Sucks
Code Search Sucks
 
Search and navigation in Visual Studio
Search and navigation in Visual StudioSearch and navigation in Visual Studio
Search and navigation in Visual Studio
 
Python code profiling - Jackson Isaac
Python code profiling - Jackson IsaacPython code profiling - Jackson Isaac
Python code profiling - Jackson Isaac
 
MongoDB Schema Design: Insights and Tradeoffs (Jetlore's talk at MongoSF 2012)
MongoDB Schema Design: Insights and Tradeoffs (Jetlore's talk at MongoSF 2012)MongoDB Schema Design: Insights and Tradeoffs (Jetlore's talk at MongoSF 2012)
MongoDB Schema Design: Insights and Tradeoffs (Jetlore's talk at MongoSF 2012)
 
Tools to Find Source Code on the Web
Tools to Find Source Code on the WebTools to Find Source Code on the Web
Tools to Find Source Code on the Web
 
MSR 2013 Preview
MSR 2013 PreviewMSR 2013 Preview
MSR 2013 Preview
 

Ă„hnlich wie Assisting Code Search with Automatic Query Reformulation for Bug Localization

Improving the search mechanism for unstructured peer to-peer networks using t...
Improving the search mechanism for unstructured peer to-peer networks using t...Improving the search mechanism for unstructured peer to-peer networks using t...
Improving the search mechanism for unstructured peer to-peer networks using t...Aditya Kumar
 
IEEE 2014 C# Projects
IEEE 2014 C# ProjectsIEEE 2014 C# Projects
IEEE 2014 C# ProjectsVijay Karan
 
IEEE 2014 C# Projects
IEEE 2014 C# ProjectsIEEE 2014 C# Projects
IEEE 2014 C# ProjectsVijay Karan
 
Web-Based System for Software Requirements Quality Analysis Using Case-Based ...
Web-Based System for Software Requirements Quality Analysis Using Case-Based ...Web-Based System for Software Requirements Quality Analysis Using Case-Based ...
Web-Based System for Software Requirements Quality Analysis Using Case-Based ...IOSR Journals
 
Multi-Keyword Ranked Search in Encrypted Cloud Storage
Multi-Keyword Ranked Search in Encrypted Cloud StorageMulti-Keyword Ranked Search in Encrypted Cloud Storage
Multi-Keyword Ranked Search in Encrypted Cloud StorageIRJET Journal
 
Scalable constrained spectral clustering
Scalable constrained spectral clusteringScalable constrained spectral clustering
Scalable constrained spectral clusteringNishanth Harapanahalli
 
IRJET- Secure Data Access on Distributed Database using Skyline Queries
IRJET- Secure Data Access on Distributed Database using Skyline QueriesIRJET- Secure Data Access on Distributed Database using Skyline Queries
IRJET- Secure Data Access on Distributed Database using Skyline QueriesIRJET Journal
 
QUICKAR-ASE2016-Singapore
QUICKAR-ASE2016-SingaporeQUICKAR-ASE2016-Singapore
QUICKAR-ASE2016-SingaporeMasud Rahman
 
IRJET - Efficient and Verifiable Queries over Encrypted Data in Cloud
 IRJET - Efficient and Verifiable Queries over Encrypted Data in Cloud IRJET - Efficient and Verifiable Queries over Encrypted Data in Cloud
IRJET - Efficient and Verifiable Queries over Encrypted Data in CloudIRJET Journal
 
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...IJERD Editor
 
STRICT-SANER2017
STRICT-SANER2017STRICT-SANER2017
STRICT-SANER2017Masud Rahman
 
Building confidential and efficient query services in the cloud with rasp dat...
Building confidential and efficient query services in the cloud with rasp dat...Building confidential and efficient query services in the cloud with rasp dat...
Building confidential and efficient query services in the cloud with rasp dat...Adz91 Digital Ads Pvt Ltd
 
Privacy preserving multi-keyword ranked search over encrypted cloud data
Privacy preserving multi-keyword ranked search over encrypted cloud dataPrivacy preserving multi-keyword ranked search over encrypted cloud data
Privacy preserving multi-keyword ranked search over encrypted cloud dataIGEEKS TECHNOLOGIES
 
Predicting Fault-Prone Files using Machine Learning
Predicting Fault-Prone Files using Machine LearningPredicting Fault-Prone Files using Machine Learning
Predicting Fault-Prone Files using Machine LearningGuido A. Ciollaro
 
Privacy preserving multi-keyword ranked search over encrypted cloud data
Privacy preserving multi-keyword ranked search over encrypted cloud dataPrivacy preserving multi-keyword ranked search over encrypted cloud data
Privacy preserving multi-keyword ranked search over encrypted cloud dataIGEEKS TECHNOLOGIES
 
Privacy preserving multi-keyword ranked search over encrypted cloud data
Privacy preserving multi-keyword ranked search over encrypted cloud dataPrivacy preserving multi-keyword ranked search over encrypted cloud data
Privacy preserving multi-keyword ranked search over encrypted cloud dataIGEEKS TECHNOLOGIES
 
K anonymity for crowdsourcing database
K anonymity for crowdsourcing databaseK anonymity for crowdsourcing database
K anonymity for crowdsourcing databaseLeMeniz Infotech
 
Machine Learning Applications in Grid Computing
Machine Learning Applications in Grid ComputingMachine Learning Applications in Grid Computing
Machine Learning Applications in Grid Computingbutest
 
Query expansion using novel use case scenario relationship for finding featur...
Query expansion using novel use case scenario relationship for finding featur...Query expansion using novel use case scenario relationship for finding featur...
Query expansion using novel use case scenario relationship for finding featur...IJECEIAES
 
A Graph-based Web Image Annotation for Large Scale Image Retrieval
A Graph-based Web Image Annotation for Large Scale Image RetrievalA Graph-based Web Image Annotation for Large Scale Image Retrieval
A Graph-based Web Image Annotation for Large Scale Image RetrievalIRJET Journal
 

Ă„hnlich wie Assisting Code Search with Automatic Query Reformulation for Bug Localization (20)

Improving the search mechanism for unstructured peer to-peer networks using t...
Improving the search mechanism for unstructured peer to-peer networks using t...Improving the search mechanism for unstructured peer to-peer networks using t...
Improving the search mechanism for unstructured peer to-peer networks using t...
 
IEEE 2014 C# Projects
IEEE 2014 C# ProjectsIEEE 2014 C# Projects
IEEE 2014 C# Projects
 
IEEE 2014 C# Projects
IEEE 2014 C# ProjectsIEEE 2014 C# Projects
IEEE 2014 C# Projects
 
Web-Based System for Software Requirements Quality Analysis Using Case-Based ...
Web-Based System for Software Requirements Quality Analysis Using Case-Based ...Web-Based System for Software Requirements Quality Analysis Using Case-Based ...
Web-Based System for Software Requirements Quality Analysis Using Case-Based ...
 
Multi-Keyword Ranked Search in Encrypted Cloud Storage
Multi-Keyword Ranked Search in Encrypted Cloud StorageMulti-Keyword Ranked Search in Encrypted Cloud Storage
Multi-Keyword Ranked Search in Encrypted Cloud Storage
 
Scalable constrained spectral clustering
Scalable constrained spectral clusteringScalable constrained spectral clustering
Scalable constrained spectral clustering
 
IRJET- Secure Data Access on Distributed Database using Skyline Queries
IRJET- Secure Data Access on Distributed Database using Skyline QueriesIRJET- Secure Data Access on Distributed Database using Skyline Queries
IRJET- Secure Data Access on Distributed Database using Skyline Queries
 
QUICKAR-ASE2016-Singapore
QUICKAR-ASE2016-SingaporeQUICKAR-ASE2016-Singapore
QUICKAR-ASE2016-Singapore
 
IRJET - Efficient and Verifiable Queries over Encrypted Data in Cloud
 IRJET - Efficient and Verifiable Queries over Encrypted Data in Cloud IRJET - Efficient and Verifiable Queries over Encrypted Data in Cloud
IRJET - Efficient and Verifiable Queries over Encrypted Data in Cloud
 
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
 
STRICT-SANER2017
STRICT-SANER2017STRICT-SANER2017
STRICT-SANER2017
 
Building confidential and efficient query services in the cloud with rasp dat...
Building confidential and efficient query services in the cloud with rasp dat...Building confidential and efficient query services in the cloud with rasp dat...
Building confidential and efficient query services in the cloud with rasp dat...
 
Privacy preserving multi-keyword ranked search over encrypted cloud data
Privacy preserving multi-keyword ranked search over encrypted cloud dataPrivacy preserving multi-keyword ranked search over encrypted cloud data
Privacy preserving multi-keyword ranked search over encrypted cloud data
 
Predicting Fault-Prone Files using Machine Learning
Predicting Fault-Prone Files using Machine LearningPredicting Fault-Prone Files using Machine Learning
Predicting Fault-Prone Files using Machine Learning
 
Privacy preserving multi-keyword ranked search over encrypted cloud data
Privacy preserving multi-keyword ranked search over encrypted cloud dataPrivacy preserving multi-keyword ranked search over encrypted cloud data
Privacy preserving multi-keyword ranked search over encrypted cloud data
 
Privacy preserving multi-keyword ranked search over encrypted cloud data
Privacy preserving multi-keyword ranked search over encrypted cloud dataPrivacy preserving multi-keyword ranked search over encrypted cloud data
Privacy preserving multi-keyword ranked search over encrypted cloud data
 
K anonymity for crowdsourcing database
K anonymity for crowdsourcing databaseK anonymity for crowdsourcing database
K anonymity for crowdsourcing database
 
Machine Learning Applications in Grid Computing
Machine Learning Applications in Grid ComputingMachine Learning Applications in Grid Computing
Machine Learning Applications in Grid Computing
 
Query expansion using novel use case scenario relationship for finding featur...
Query expansion using novel use case scenario relationship for finding featur...Query expansion using novel use case scenario relationship for finding featur...
Query expansion using novel use case scenario relationship for finding featur...
 
A Graph-based Web Image Annotation for Large Scale Image Retrieval
A Graph-based Web Image Annotation for Large Scale Image RetrievalA Graph-based Web Image Annotation for Large Scale Image Retrieval
A Graph-based Web Image Annotation for Large Scale Image Retrieval
 

KĂĽrzlich hochgeladen

Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
WhatsApp 9892124323 âś“Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 âś“Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 âś“Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 âś“Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 

KĂĽrzlich hochgeladen (20)

Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
WhatsApp 9892124323 âś“Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 âś“Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 âś“Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 âś“Call Girls In Kalyan ( Mumbai ) secure service
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 

Assisting Code Search with Automatic Query Reformulation for Bug Localization

  • 1. Assisting Code Search with Automatic Query Reformulation for Bug Localization Bunyamin Sisman & Avinash Kak bsisman@purdue.edu | kak@purdue.edu Purdue University MSR’13
  • 2. Outline I. Motivation II. Past Work on Automatic Query Reformulation III. Proposed Approach: Spatial Code Proximity (SCP) IV. Data Preparation V. Results VI. Conclusions MSR'13
  • 3. Main Motivation It is extremely common for software developers to use arbitrary abbreviations & concatenations in software. These are generally difficult to predict when searching the code base of a project. The question is “Is there some way to automatically reformulate a user’s query so that all such relevant terms are also used in retrieval?” MSR'13
  • 4. ď‚´ We show how a query can be automatically reformulated for superior retrieval accuracy ď‚´ We propose a new framework for Query Reformulation, which leverages the spatial proximity of the terms in files ď‚´ The approach leads to significant improvements over the baseline and the competing Query Reformulation approaches MSR'13 Summary of Our Contribution
  • 5. ď‚´ Our approach preserves or improves the retrieval accuracy for 76% of the 4,393 bugs we analyzed for Eclipse and Chrome projects ď‚´ Our approach improves the retrieval accuracy for 42% of the 4,393 bugs ď‚´ Improvements are 66% for Eclipse and 90% for Chrome in terms of MAP (Mean Average Precision) ď‚´ We also describe the conditions under which Query Reformulation may perform poorly. MSR'13 Summary of Our Contribution
  • 6. Query Reformulation with Relevance Feedback 1. Perform an initial retrieval with the original query 2. Analyze the set of top retrieved documents vis- Ă -vis the query 3. Reformulate the query MSR'13
  • 7. Acquiring Relevance Feedback ď‚´ Implicitly: infer feedback from user interactions ď‚´ Explicitly: user provides feedback [Gay et al. 2009] ď‚´ Pseudo Relevance Feedback (PRF): Automatic QR ď‚´ This is our work! MSR'13
  • 8. Data Flow in the Proposed Retrieval Framework MSR'13
  • 9. Automatic Query Reformulation ď‚´ No user involvement! ď‚´ It takes less than a second to reformulate a query on ordinary desktop hardware! ď‚´ It is cheap! ď‚´ It is effective! MSR'13
  • 10. Previous Work on Automatic QR (for Text Retrieval) ď‚´Rocchio’s Formula (ROCC) ď‚´Relevance Model (RM) MSR'13
  • 11. The Proposed Approach to QR: Spatial Code Proximity (SCP) ď‚´ Spatial Code Proximity is an elegant approach to giving greater weights to terms in source code that occur in the vicinity of the terms in a users’ query ď‚´ Proximities may be created through commonly used concatenations ď‚´ Punctuation characters ď‚´ Camel Casing etc… ď‚´ Underscores: tab_strip_gtk ď‚´ Camel casing: kPinnedTabAnimationDurationMs MSR'13
  • 12. Spatial Code Proximity (SCP) (Cont’d) ď‚´ Tokenize source files and index the positions of the terms in each source file: ď‚´ Use the distance between terms to find relevant terms vis-Ă -vis a query! MSR'13
  • 13. SCP: Bringing the Query into the Picture MSR'13 ď‚´ Example Query: “Browser Animation” ď‚´ First perform an initial retrieval with the original query ď‚´ Increase the weights of those nearby terms!
  • 14. Research Questions ď‚´ Question 1: Does the proposed QR approach improve the accuracy of source code retrieval. If so, to what extent? ď‚´ Question 2: How do the QR techniques that are currently in the literature perform for source code retrieval? ď‚´ Question 3: How does the initial retrieval performance affect the performance of QR? ď‚´ Question 4: What are the conditions under which QR may perform poorly? MSR'13
  • 15. Data Preparation ď‚´ For evaluation, we need a set of queries and the relevant files ď‚´ We use the titles of the bug reports as queries ď‚´ We have to link the repository commits to the bug tracking database! ď‚´ Used regular expressions to detect Bug Fix commits based on commit messages MSR'13
  • 16. Data Preparation (Cont’d) Eclipse v3.1 Chrome v4.0 #Bugs 4,035 358 Avg. # Relevant Files 2.76 3.82 Avg. #Commits 1.36 1.23 MSR'13 1https://engineering.purdue.edu/RVL/Database/BUGLinks/ ď‚´Resulting dataset: BUGLinks1
  • 17. Evaluation Framework ď‚´ We use Precision and Recall based metrics to evaluate the retrieval accuracy. ď‚´ Determine the query sets for which the proposed QR approaches lead to 1. improvements in the retrieval accuracy 2. degradation in the retrieval accuracy 3. no change in the retrieval accuracy ď‚´ Analyze these sets to understand the characteristics of the queries each set contains MSR'13
  • 18. Evaluation Framework (Cont’d) ď‚´ For comparison of these sets, we used the following Query Performance Prediction (QPP) metrics [Haiduc et al. 2012, He et al. 2004]: ď‚´ Average Inverse Document Frequency (avgIDF) ď‚´ Average Inverse Collection Term Frequency (avgICTF) ď‚´ Query Scope (QS) ď‚´ Simplified Clarity Score (SCS) ď‚´ Additionally, we analyzed ď‚´ Query Lengths ď‚´ Number of Relevant files per bug MSR'13
  • 19. QR with Bug Report Titles ROCC RM SCP (Proposed) 0 500 1000 1500 2000 #Bugs ROCC RM SCP (Proposed) MSR'13
  • 20. Improvements in Retrieval Accuracy (% Increase in MAP) ROCC RM SCP (Proposed) 0% 20% 40% 60% 80% 100% Eclipse Chrome ROCC RM SCP (Proposed) MSR'13
  • 21. Conclusions & Future Work ď‚— Our framework can use a weak initial query as a jumping off point for a better query. ď‚— No user input is necessary ď‚— We obtained significant improvements over the baseline and the well-known Automatic QR methods. ď‚— Future Work includes evaluation of different term proximity metrics in source code for QR MSR'13
  • 22. References ď‚´ [1] B. Sisman and A. Kak, “Incorporating version histories in information retrieval based bug localization,” in Proceedings of the 9th Working Conference on Mining Software Repositories (MSR’12). IEEE, 2012, pp. 50–59 ď‚´ [2] G. Gay, S. Haiduc, A. Marcus, and T. Menzies, “On the use of relevance feedback in IR-based concept location,” in International Conference on Software Maintenance (ICSM’09), sept. 2009, pp. 351 –360. ď‚´ [3] A. Marcus, A. Sergeyev, V. Rajlich, and J. I. Maletic, “An information retrieval approach to concept location in source code,” in Proceedings of the 11th Working Conference on Reverse Engineering (WCRE’04). IEEE Computer Society, 2004, pp. 214–223 MSR'13
  • 23. References ď‚´ [4] S. Haiduc, G. Bavota, R. Oliveto, A. De Lucia, and A. Marcus, “Automatic query performance assessment during the retrieval of software artifacts,” in Proceedings of the 27th International Conference on Automated Software Engineering (ASE’12) . ACM, 2012, pp. 90–99 ď‚´ [5] B. He and I. Ounis, “Inferring query performance using pre-retrieval predictors,” in Proc. Symposium on String Processing and Information Retrieval . Springer Verlag, 2004, pp. 43–54 MSR'13

Hinweis der Redaktion

  1. Despite the naming conventions in programming languages, the textual content of software is made up by using arbitrary abbreviations and concatenations.
  2. Among different approaches to QR, Relevance Feedback has been shown to be an effective method. It basically has three steps
  3. It has been shown that Relevance Feedback is an effective method for Query Reformulation.
  4. First of all, we index the source code. When a query comes, we do an initial first pass retrieval to get the highest ranked documents.Then the QR module takes the initial query and the first pass retrieval results to reformulate the query.Then with the reformulated query we perform the second retrieval to obtain the final results.
  5. There are several previous studies on QR, the most important of which are …
  6. The fact that the developer concatenated those terms to declare a variable or class name indicates that those terms are conceptually related.So for a given term nearby terms are more likely relevant
  7. There is nothing to do with a query at this point
  8. I am showing you an indexed file. Lets consider a query that consists of two terms. We will use a window of size 3 to capture the spatial proximity effects in this file vis-Ă -vis this particular query.We have a query term browser in the first line. So using a window we increase the weights of the nearby terms!
  9. While all the three methods degrade about the same number of queries, SCP improves the largest number of queries.For more than 200 queries SCP does better than the next best model RM. In the paper, you will also find an analysis of the queries for which QR does not lead to improvements.