This document presents a systematic literature review of automated query reformulations for source code search. It discusses seven research questions explored in the review, including the methods, algorithms, data sources, evaluation metrics, challenges, publication trends, and comparisons between local and internet-scale code search queries. The review analyzed over 50 primary studies identified through a multi-database search and filtering process. Key findings include the predominant use of term weighting, query expansion and reduction techniques, evaluations based on standard information retrieval metrics, and various challenges like vocabulary mismatch that remain unsolved. Opportunities for future work are also identified, such as leveraging bug reports for keyword selection and using semantic representations to address vocabulary issues.
This document summarizes the findings of a tertiary review on code smells and refactoring. It analyzed 40 secondary studies on the topic. Key findings include: 1) Common refactoring topics investigated were opportunities, techniques and tools, while common smell topics were detection, descriptions and support tools. 2) Popular smell detection tools were CCFinder and Jdeodorant, while there were fewer refactoring tools. 3) The relationship between smells and refactoring is complex, as not all smells require refactoring and refactoring can sometimes negatively impact quality. The review identified implications for practitioners, researchers and educators, as well as open issues in the field.
Crops yield estimation through remote sensingCIMMYT
Remote sensing –Beyond images
Mexico 14-15 December 2013
The workshop was organized by CIMMYT Global Conservation Agriculture Program (GCAP) and funded by the Bill & Melinda Gates Foundation (BMGF), the Mexican Secretariat of Agriculture, Livestock, Rural Development, Fisheries and Food (SAGARPA), the International Maize and Wheat Improvement Center (CIMMYT), CGIAR Research Program on Maize, the Cereal System Initiative for South Asia (CSISA) and the Sustainable Modernization of the Traditional Agriculture (MasAgro)
Introduction to Optimization with Genetic Algorithm (GA)Ahmed Gad
Selection of the optimal parameters for machine learning tasks is challenging. Some results may be bad not because the data is noisy or the used learning algorithm is weak, but due to the bad selection of the parameters values. This article gives a brief introduction about evolutionary algorithms (EAs) and describes genetic algorithm (GA) which is one of the simplest random-based EAs.
References:
Eiben, Agoston E., and James E. Smith. Introduction to evolutionary computing. Vol. 53. Heidelberg: springer, 2003.
https://www.linkedin.com/pulse/introduction-optimization-genetic-algorithm-ahmed-gad
https://www.kdnuggets.com/2018/03/introduction-optimization-with-genetic-algorithm.html
Ofil manufactures daytime corona cameras with DayCor® technology inside. Ofil's cameras are non-destructive and non-intrusive bi-spectral UV-Visible with high sensitivity to UV spectral range. Ofil's cameras detect corona and pinpoint the emitting sources. Used worldwide by thousands of predictive maintenance teams.dOfil manufactures daytime corona cameras with bispectral UV-Visible sensors that detect corona and its emmiting sources.
Wind turbines convert the kinetic energy of wind into rotational power that runs a generator to produce electricity. Wind speed and direction can be modeled using computer programs that take into account elevation, topography, and ground cover. India has significant wind power potential, with estimates of potential capacity ranging from 49.13 GW to over 160 GW. Key factors in assessing wind farm potential include long-term wind resource assessment, wake effects between turbines, and loss factors. The states of Tamil Nadu and Gujarat currently lead India in installed wind farm capacity.
Machine Learning Strategies for Time Series PredictionGianluca Bontempi
This document introduces machine learning strategies for time series prediction. It begins with an introduction to the speaker and his background and research interests. It then provides an outline of the topics to be covered, including notions of time series, machine learning approaches for prediction, local learning methods, forecasting techniques, and applications and future directions. The document discusses what the audience should know coming into the course and what they will learn.
This document summarizes the findings of a tertiary review on code smells and refactoring. It analyzed 40 secondary studies on the topic. Key findings include: 1) Common refactoring topics investigated were opportunities, techniques and tools, while common smell topics were detection, descriptions and support tools. 2) Popular smell detection tools were CCFinder and Jdeodorant, while there were fewer refactoring tools. 3) The relationship between smells and refactoring is complex, as not all smells require refactoring and refactoring can sometimes negatively impact quality. The review identified implications for practitioners, researchers and educators, as well as open issues in the field.
Crops yield estimation through remote sensingCIMMYT
Remote sensing –Beyond images
Mexico 14-15 December 2013
The workshop was organized by CIMMYT Global Conservation Agriculture Program (GCAP) and funded by the Bill & Melinda Gates Foundation (BMGF), the Mexican Secretariat of Agriculture, Livestock, Rural Development, Fisheries and Food (SAGARPA), the International Maize and Wheat Improvement Center (CIMMYT), CGIAR Research Program on Maize, the Cereal System Initiative for South Asia (CSISA) and the Sustainable Modernization of the Traditional Agriculture (MasAgro)
Introduction to Optimization with Genetic Algorithm (GA)Ahmed Gad
Selection of the optimal parameters for machine learning tasks is challenging. Some results may be bad not because the data is noisy or the used learning algorithm is weak, but due to the bad selection of the parameters values. This article gives a brief introduction about evolutionary algorithms (EAs) and describes genetic algorithm (GA) which is one of the simplest random-based EAs.
References:
Eiben, Agoston E., and James E. Smith. Introduction to evolutionary computing. Vol. 53. Heidelberg: springer, 2003.
https://www.linkedin.com/pulse/introduction-optimization-genetic-algorithm-ahmed-gad
https://www.kdnuggets.com/2018/03/introduction-optimization-with-genetic-algorithm.html
Ofil manufactures daytime corona cameras with DayCor® technology inside. Ofil's cameras are non-destructive and non-intrusive bi-spectral UV-Visible with high sensitivity to UV spectral range. Ofil's cameras detect corona and pinpoint the emitting sources. Used worldwide by thousands of predictive maintenance teams.dOfil manufactures daytime corona cameras with bispectral UV-Visible sensors that detect corona and its emmiting sources.
Wind turbines convert the kinetic energy of wind into rotational power that runs a generator to produce electricity. Wind speed and direction can be modeled using computer programs that take into account elevation, topography, and ground cover. India has significant wind power potential, with estimates of potential capacity ranging from 49.13 GW to over 160 GW. Key factors in assessing wind farm potential include long-term wind resource assessment, wake effects between turbines, and loss factors. The states of Tamil Nadu and Gujarat currently lead India in installed wind farm capacity.
Machine Learning Strategies for Time Series PredictionGianluca Bontempi
This document introduces machine learning strategies for time series prediction. It begins with an introduction to the speaker and his background and research interests. It then provides an outline of the topics to be covered, including notions of time series, machine learning approaches for prediction, local learning methods, forecasting techniques, and applications and future directions. The document discusses what the audience should know coming into the course and what they will learn.
This document summarizes a talk given by Masud Rahman, a PhD candidate at the University of Saskatchewan. The talk focused on Rahman's PhD thesis research, which aims to improve code search by generating context-aware, analytics-driven queries through effective reformulation. The talk outlined three research questions around improving keyword selection, incorporating bug report quality, and using crowd knowledge and data analytics. It provided an overview of Rahman's PhD thesis and publications addressing the research questions. Evaluation methods for the proposed approaches were also discussed.
The document outlines Masud Rahman's PhD thesis proposal on supporting source code search with context-aware, analytics-driven query reformulation. The proposal discusses three research questions: 1) evaluating term weighting techniques for keyword selection from source code and bug reports, 2) incorporating bug report quality for local code search, and 3) leveraging crowd knowledge and data analytics to deliver query keywords. The contribution summary highlights techniques for term dependence, quality-aware bug localization, and using crowd knowledge and large data analytics.
High resolution mass spectrometry (HRMS) and non-targeted analysis (NTA) are advancing the identification of emerging contaminants in environmental and agricultural matrices. However, confidence in structure identification of unknowns in NTA presents challenges to analytical chemists. Structure identification requires integration of complementary data types such as reference databases, fragmentation prediction tools, and retention time prediction models. The goal of this research is to optimize and implement structure identification functionality within the US EPA’s CompTox Chemicals Dashboard, an open chemistry resource and web application containing data for ~760,000 substances. Rank-ordering the number of sources associated with chemical records within the Dashboard (Data Source Ranking) improves the identification of unknowns by bringing the most likely candidate structures to the top of a search results list. Incorporating additional data streams contained within the database underlying the Dashboard further enhances identifications. Integrating tandem mass spectrometry data into NTA workflows enables spectral match scores and increases confidence in structural assignments. We have generated and stored predicted MS/MS fragmentation spectra for the entirety of the Chemistry Dashboard using the in silico prediction tool CFM-ID. Predicted fragments incorporated into the identification workflow were used as both a scoring term and as a candidate threshold cutoff. Combining these steps within an open chemistry resource provides a freely available software tool for structure identification and NTA. This abstract does not necessarily represent the views or policies of the U.S. Environmental Protection Agency.
This document presents a case study on applying a data analytics approach to conducting a systematic literature review on master data management. It outlines the steps taken, including defining review questions, searching multiple databases and sources, combining and preprocessing the data, and performing descriptive and text analyses. The analyses addressed questions about trends in publications over time, primary databases, publication types, and frequent keywords. This provided insights into the progress and topics within the master data management research domain. The presented structured approach aims to improve the replicability of systematic literature reviews.
The success of developer forums like Stack Overflow (SO) depends on the participation of users and the quality of shared knowledge. SO allows its users to suggest edits to improve the quality of the posts (e.g., questions and answers). Such posts can be rolled back to an earlier version when the current version of the post with the suggested edit does not satisfy the user. However, subjectivity bias in deciding either an edit is satisfactory or not could introduce inconsistencies in the rollback edits. For example, while a user may accept the formatting of a method name (e.g., getActivity()) as a code term, another user may reject it. Such bias in rollback edits could be detrimental and demotivating to the users whose suggested edits were rolled back. This problem is compounded due to the absence of specific guidelines and tools to support consistency across users on their rollback actions. To mitigate this problem, we investigate the inconsistencies in the rollback editing process of SO and make three contributions. First, we identify eight inconsistency types in rollback edits through a qualitative analysis of 777 rollback edits in 382 questions and 395 answers. Second, we determine the impact of the eight rollback inconsistencies by surveying 44 software developers. More than 80% of the study participants find our produced catalogue of rollback inconsistencies to be detrimental to the post quality. Third, we develop a suite of algorithms to detect the eight rollback inconsistencies. The algorithms offer more than 95% accuracy and thus can be used to automatically but reliably inform users in SO of the prevalence of inconsistencies in their suggested edits and rollback actions.
The document describes CORRECT, a technique for recommending code reviewers for pull requests on GitHub based on developers' cross-project and technology experience. It evaluates CORRECT using codebases from both a commercial software company and open source projects. The results show that CORRECT achieves over 90% accuracy in recommending reviewers, outperforming a baseline technique. Library and technology experience are also found to be good proxies for code review skills. CORRECT performs equally well on both private and public codebases without bias toward any development framework.
This document describes SureFIND Transcriptome PCR Arrays, which are ready-to-use cDNA panels that can identify the miRNAs, pathways, or transcription factors that regulate gene expression. Each array contains cDNA from cells treated with different factors, such as miRNA mimics or pathway inhibitors. The document outlines an example where a Transcriptome PCR Array identified three miRNAs - miR-193b, miR-138, and miR-373 - that regulate the INPPL1 gene. Users are encouraged to validate top hits from the arrays.
Presentation made during the Intelligent User-Adapted Interfaces: Design and Multi-Modal Evaluation Workshop (IUadaptME) workshop conducted as part of UMAP 2018
RACK is an approach that automatically recommends relevant APIs for code search queries using crowdsourced knowledge from Stack Overflow questions, answers, and titles. An exploratory study found that accepted Stack Overflow answers frequently mention API names and cover a large percentage of standard APIs. Question titles often contain keywords relevant to code search. RACK constructs an API-token mapping database from Stack Overflow and ranks APIs for a given query based on heuristics measuring keyword-API co-occurrence and coherence. An evaluation found RACK achieved around 79% top-10 accuracy and outperformed existing techniques, demonstrating the potential of leveraging crowdsourced technical knowledge for API recommendation.
Predicting query performance and explaining results to assist Linked Data con...Rakebul Hasan
This document presents research on generating explanations for SPARQL query results over linked data. It describes developing a framework for predicting query performance using machine learning models trained on query characteristics. It also explains generating provenance-based explanations for query results by computing why-provenance without annotation. Finally, it discusses representing explanation metadata as linked data and evaluating the impact of explanations through a user study showing explanations improve understanding.
Relevance Improvements at Cengage - Ivan Provalovlucenerevolution
In the session we describe relevance improvements we have implemented in our Lucene-based search system for English and Chinese contents and the tests we have performed for Arabic and Spanish contents based on TREC data. We will also describe our relevance feedback web app for the end-users to rank results of various queries. The presentation will have information about the usage data we analyze to improve the relevance. We will also touch upon our OCR data indexing challenges for English and non-English content.
Supporting program comprehension with source code summarizationMasud Rahman
This document discusses research on automatically generating summaries of source code to help with program comprehension. It proposes using techniques like latent semantic indexing to extract important lexical and structural information from code, and generate summaries at different granularity levels like class or method. Experiments on an open source project showed that incorporating structural elements like method names into automatic summaries improved their quality compared to only using term frequencies. Future work could develop better techniques to account for structural information when creating code summaries.
The document describes a technique called STRICT that uses TextRank and POSRank algorithms to identify important terms from a software change task description to generate an effective initial search query. An experiment on 1,939 change tasks from 8 open source projects found that STRICT improved the query effectiveness in 57.84% of cases compared to baseline queries like title alone. STRICT also showed better retrieval performance based on metrics like mean average precision and mean recall compared to state-of-the-art techniques. The approach validates the use of graph-based ranking algorithms to address the challenge of generating relevant initial search queries from natural language change task descriptions.
The identification of chemicals in environment media depends on the application of analytical methods, the primary approach being one of the multiple mass spectrometry techniques. Cheminformatics solutions are critical to supporting the chemical identification process. This includes the assembly of large chemical substance databases, prioritization ranking of potential candidate search hits, and search approaches that support both targeted and non-targeted screening approaches. The US Environmental Protection Agency CompTox Chemicals Dashboard is a web-based application providing access to data for over 760,000 chemical substances. This includes access to physicochemical property, environmental fate and transport data, both human and ecological toxicity data, information regarding chemicals contained in products in commerce, and in vitro bioactivity data. Searches are allowed based on chemical identifiers, product and use, genes and assays associated with the EPA ToxCast assays and, specific to supporting mass spectrometry, searches based on masses and formulae. These searches make use of a novel “MS-Ready structures” approach collapsing chemicals related as mixtures, salts, stereoforms and isotopomers. The dashboard supports both singleton or batch searching by accurate mass/chemical formula, supported by MS-ready structures, and utilizes rich meta data to facilitate candidate ranking and the prioritization of chemicals of concern based on toxicity and exposure data. The dashboard also hosts tens of chemical lists that have been assembled from public databases, many supporting non-targeted analysis and mass spectrometry databases.
This presentation will provide an overview of the dashboard and will review our latest research into structure identification by searching experimental mass spectrometry data against predicted fragmentation spectra for LC-MS (positive and negative ion mode) and GC-MS (EI), a total of 3 million predicted spectra. We will also provide an overview of our progress supporting structure and substructure searching, using mass and formula-based filtering, and report on the latest applications of the dashboard to support structure identification projects of interest to the EPA. This abstract does not necessarily represent the views or policies of the U.S. Environmental Protection Agency.
Semantics2018 Zhang,Petrak,Maynard: Adapted TextRank for Term Extraction: A G...Johann Petrak
Slides for the talk about the paper:
Ziqi Zhang, Johann Petrak and Diana Maynard, 2018: Adapted TextRank for Term Extraction: A Generic Method of Improving Automatic Term Extraction Algorithms. Semantics-2018, Vienna, Austria
This document describes two concept-based approaches for finding similar programming examples to questions: global and local similarity. The global approach measures similarity based on all concepts in examples and questions, while the local approach compares concept subtrees. An evaluation with 12 students solving Java problems found the local approach had slightly better ratings and precision. Further work is needed to personalize example selection based on user knowledge and adaptively visualize the problem-example space.
Ph.D. Dissertation Presentation
B. Thomas Golisano College of Computing and Information Sciences
Rochester Institute of Technology
Date of presentation: June 28, 2022
Location: Virtual
Link to dissertation: https://scholarworks.rit.edu/theses/11219/
RAISE Lab at Dalhousie University
aims to develop tools and technologies for intelligent automation in software engineering. An overview is presented by Dr. Masud Rahman, Assistant Professor, Faculty of Computer Science, Dalhousie University, Canada.
The Forgotten Role of Search Queries in IR-based Bug Localization: An Empiric...Masud Rahman
The document summarizes a study on improving search queries for bug localization using natural language text from bug reports. The study evaluated different keyword selection techniques, generated optimal search queries using a genetic algorithm, and compared optimal versus non-optimal queries. Key findings include: 1) Current approaches failed to identify keywords for 34% of bug reports, 2) A genetic algorithm produced optimal queries that achieved up to 80% higher performance than baselines, and 3) Optimal queries differed in using less frequent, less ambiguous, noun-heavy keywords located in bug report bodies.
Weitere ähnliche Inhalte
Ähnlich wie PhD Comprehensive exam of Masud Rahman
This document summarizes a talk given by Masud Rahman, a PhD candidate at the University of Saskatchewan. The talk focused on Rahman's PhD thesis research, which aims to improve code search by generating context-aware, analytics-driven queries through effective reformulation. The talk outlined three research questions around improving keyword selection, incorporating bug report quality, and using crowd knowledge and data analytics. It provided an overview of Rahman's PhD thesis and publications addressing the research questions. Evaluation methods for the proposed approaches were also discussed.
The document outlines Masud Rahman's PhD thesis proposal on supporting source code search with context-aware, analytics-driven query reformulation. The proposal discusses three research questions: 1) evaluating term weighting techniques for keyword selection from source code and bug reports, 2) incorporating bug report quality for local code search, and 3) leveraging crowd knowledge and data analytics to deliver query keywords. The contribution summary highlights techniques for term dependence, quality-aware bug localization, and using crowd knowledge and large data analytics.
High resolution mass spectrometry (HRMS) and non-targeted analysis (NTA) are advancing the identification of emerging contaminants in environmental and agricultural matrices. However, confidence in structure identification of unknowns in NTA presents challenges to analytical chemists. Structure identification requires integration of complementary data types such as reference databases, fragmentation prediction tools, and retention time prediction models. The goal of this research is to optimize and implement structure identification functionality within the US EPA’s CompTox Chemicals Dashboard, an open chemistry resource and web application containing data for ~760,000 substances. Rank-ordering the number of sources associated with chemical records within the Dashboard (Data Source Ranking) improves the identification of unknowns by bringing the most likely candidate structures to the top of a search results list. Incorporating additional data streams contained within the database underlying the Dashboard further enhances identifications. Integrating tandem mass spectrometry data into NTA workflows enables spectral match scores and increases confidence in structural assignments. We have generated and stored predicted MS/MS fragmentation spectra for the entirety of the Chemistry Dashboard using the in silico prediction tool CFM-ID. Predicted fragments incorporated into the identification workflow were used as both a scoring term and as a candidate threshold cutoff. Combining these steps within an open chemistry resource provides a freely available software tool for structure identification and NTA. This abstract does not necessarily represent the views or policies of the U.S. Environmental Protection Agency.
This document presents a case study on applying a data analytics approach to conducting a systematic literature review on master data management. It outlines the steps taken, including defining review questions, searching multiple databases and sources, combining and preprocessing the data, and performing descriptive and text analyses. The analyses addressed questions about trends in publications over time, primary databases, publication types, and frequent keywords. This provided insights into the progress and topics within the master data management research domain. The presented structured approach aims to improve the replicability of systematic literature reviews.
The success of developer forums like Stack Overflow (SO) depends on the participation of users and the quality of shared knowledge. SO allows its users to suggest edits to improve the quality of the posts (e.g., questions and answers). Such posts can be rolled back to an earlier version when the current version of the post with the suggested edit does not satisfy the user. However, subjectivity bias in deciding either an edit is satisfactory or not could introduce inconsistencies in the rollback edits. For example, while a user may accept the formatting of a method name (e.g., getActivity()) as a code term, another user may reject it. Such bias in rollback edits could be detrimental and demotivating to the users whose suggested edits were rolled back. This problem is compounded due to the absence of specific guidelines and tools to support consistency across users on their rollback actions. To mitigate this problem, we investigate the inconsistencies in the rollback editing process of SO and make three contributions. First, we identify eight inconsistency types in rollback edits through a qualitative analysis of 777 rollback edits in 382 questions and 395 answers. Second, we determine the impact of the eight rollback inconsistencies by surveying 44 software developers. More than 80% of the study participants find our produced catalogue of rollback inconsistencies to be detrimental to the post quality. Third, we develop a suite of algorithms to detect the eight rollback inconsistencies. The algorithms offer more than 95% accuracy and thus can be used to automatically but reliably inform users in SO of the prevalence of inconsistencies in their suggested edits and rollback actions.
The document describes CORRECT, a technique for recommending code reviewers for pull requests on GitHub based on developers' cross-project and technology experience. It evaluates CORRECT using codebases from both a commercial software company and open source projects. The results show that CORRECT achieves over 90% accuracy in recommending reviewers, outperforming a baseline technique. Library and technology experience are also found to be good proxies for code review skills. CORRECT performs equally well on both private and public codebases without bias toward any development framework.
This document describes SureFIND Transcriptome PCR Arrays, which are ready-to-use cDNA panels that can identify the miRNAs, pathways, or transcription factors that regulate gene expression. Each array contains cDNA from cells treated with different factors, such as miRNA mimics or pathway inhibitors. The document outlines an example where a Transcriptome PCR Array identified three miRNAs - miR-193b, miR-138, and miR-373 - that regulate the INPPL1 gene. Users are encouraged to validate top hits from the arrays.
Presentation made during the Intelligent User-Adapted Interfaces: Design and Multi-Modal Evaluation Workshop (IUadaptME) workshop conducted as part of UMAP 2018
RACK is an approach that automatically recommends relevant APIs for code search queries using crowdsourced knowledge from Stack Overflow questions, answers, and titles. An exploratory study found that accepted Stack Overflow answers frequently mention API names and cover a large percentage of standard APIs. Question titles often contain keywords relevant to code search. RACK constructs an API-token mapping database from Stack Overflow and ranks APIs for a given query based on heuristics measuring keyword-API co-occurrence and coherence. An evaluation found RACK achieved around 79% top-10 accuracy and outperformed existing techniques, demonstrating the potential of leveraging crowdsourced technical knowledge for API recommendation.
Predicting query performance and explaining results to assist Linked Data con...Rakebul Hasan
This document presents research on generating explanations for SPARQL query results over linked data. It describes developing a framework for predicting query performance using machine learning models trained on query characteristics. It also explains generating provenance-based explanations for query results by computing why-provenance without annotation. Finally, it discusses representing explanation metadata as linked data and evaluating the impact of explanations through a user study showing explanations improve understanding.
Relevance Improvements at Cengage - Ivan Provalovlucenerevolution
In the session we describe relevance improvements we have implemented in our Lucene-based search system for English and Chinese contents and the tests we have performed for Arabic and Spanish contents based on TREC data. We will also describe our relevance feedback web app for the end-users to rank results of various queries. The presentation will have information about the usage data we analyze to improve the relevance. We will also touch upon our OCR data indexing challenges for English and non-English content.
Supporting program comprehension with source code summarizationMasud Rahman
This document discusses research on automatically generating summaries of source code to help with program comprehension. It proposes using techniques like latent semantic indexing to extract important lexical and structural information from code, and generate summaries at different granularity levels like class or method. Experiments on an open source project showed that incorporating structural elements like method names into automatic summaries improved their quality compared to only using term frequencies. Future work could develop better techniques to account for structural information when creating code summaries.
The document describes a technique called STRICT that uses TextRank and POSRank algorithms to identify important terms from a software change task description to generate an effective initial search query. An experiment on 1,939 change tasks from 8 open source projects found that STRICT improved the query effectiveness in 57.84% of cases compared to baseline queries like title alone. STRICT also showed better retrieval performance based on metrics like mean average precision and mean recall compared to state-of-the-art techniques. The approach validates the use of graph-based ranking algorithms to address the challenge of generating relevant initial search queries from natural language change task descriptions.
The identification of chemicals in environment media depends on the application of analytical methods, the primary approach being one of the multiple mass spectrometry techniques. Cheminformatics solutions are critical to supporting the chemical identification process. This includes the assembly of large chemical substance databases, prioritization ranking of potential candidate search hits, and search approaches that support both targeted and non-targeted screening approaches. The US Environmental Protection Agency CompTox Chemicals Dashboard is a web-based application providing access to data for over 760,000 chemical substances. This includes access to physicochemical property, environmental fate and transport data, both human and ecological toxicity data, information regarding chemicals contained in products in commerce, and in vitro bioactivity data. Searches are allowed based on chemical identifiers, product and use, genes and assays associated with the EPA ToxCast assays and, specific to supporting mass spectrometry, searches based on masses and formulae. These searches make use of a novel “MS-Ready structures” approach collapsing chemicals related as mixtures, salts, stereoforms and isotopomers. The dashboard supports both singleton or batch searching by accurate mass/chemical formula, supported by MS-ready structures, and utilizes rich meta data to facilitate candidate ranking and the prioritization of chemicals of concern based on toxicity and exposure data. The dashboard also hosts tens of chemical lists that have been assembled from public databases, many supporting non-targeted analysis and mass spectrometry databases.
This presentation will provide an overview of the dashboard and will review our latest research into structure identification by searching experimental mass spectrometry data against predicted fragmentation spectra for LC-MS (positive and negative ion mode) and GC-MS (EI), a total of 3 million predicted spectra. We will also provide an overview of our progress supporting structure and substructure searching, using mass and formula-based filtering, and report on the latest applications of the dashboard to support structure identification projects of interest to the EPA. This abstract does not necessarily represent the views or policies of the U.S. Environmental Protection Agency.
Semantics2018 Zhang,Petrak,Maynard: Adapted TextRank for Term Extraction: A G...Johann Petrak
Slides for the talk about the paper:
Ziqi Zhang, Johann Petrak and Diana Maynard, 2018: Adapted TextRank for Term Extraction: A Generic Method of Improving Automatic Term Extraction Algorithms. Semantics-2018, Vienna, Austria
This document describes two concept-based approaches for finding similar programming examples to questions: global and local similarity. The global approach measures similarity based on all concepts in examples and questions, while the local approach compares concept subtrees. An evaluation with 12 students solving Java problems found the local approach had slightly better ratings and precision. Further work is needed to personalize example selection based on user knowledge and adaptively visualize the problem-example space.
Ph.D. Dissertation Presentation
B. Thomas Golisano College of Computing and Information Sciences
Rochester Institute of Technology
Date of presentation: June 28, 2022
Location: Virtual
Link to dissertation: https://scholarworks.rit.edu/theses/11219/
Ähnlich wie PhD Comprehensive exam of Masud Rahman (20)
RAISE Lab at Dalhousie University
aims to develop tools and technologies for intelligent automation in software engineering. An overview is presented by Dr. Masud Rahman, Assistant Professor, Faculty of Computer Science, Dalhousie University, Canada.
The Forgotten Role of Search Queries in IR-based Bug Localization: An Empiric...Masud Rahman
The document summarizes a study on improving search queries for bug localization using natural language text from bug reports. The study evaluated different keyword selection techniques, generated optimal search queries using a genetic algorithm, and compared optimal versus non-optimal queries. Key findings include: 1) Current approaches failed to identify keywords for 34% of bug reports, 2) A genetic algorithm produced optimal queries that achieved up to 80% higher performance than baselines, and 3) Optimal queries differed in using less frequent, less ambiguous, noun-heavy keywords located in bug report bodies.
This document summarizes a study on improving bug localization through considering the quality of bug reports and reformulating bug report queries. The study analyzes 5,500 bug reports from eight projects and finds that existing bug localization techniques perform poorly when bug reports lack useful information or contain excessive stack traces. Preliminary findings suggest context-aware query reformulation may help address these limitations by improving the quality and relevance of the queries used.
This document summarizes research into the impact of continuous integration (CI) on code reviews. The researchers studied over 500,000 pull requests and builds from open source projects to answer three questions: 1) Whether build status influences code review participation, 2) If frequent builds improve review quality, and 3) Predicting if a build will trigger new reviews. They found that passed builds were more associated with new reviews and comments. Projects with frequent builds received more review comments that remained steady over time, unlike less frequently built projects. Their machine learning model could predict if a build would trigger new reviews with up to 64% accuracy.
This document presents research on predicting the usefulness of code review comments using textual features and developer experience. The researchers analyzed 1,482 code review comments, manually classified as useful or non-useful. They found non-useful comments had more stop words and less code elements, while useful comments had higher conceptual similarity to changed code. More experienced reviewers provided more useful comments. The researchers also built a Random Forest model that predicts comment usefulness with 66% accuracy, outperforming baselines. Their work provides the first automated approach to assess code review comment usefulness.
The document analyzes why some questions on Stack Overflow remain unresolved and explores whether machine learning can predict which questions will be unresolved. It finds that unresolved questions have higher topic entropy, meaning they are less specific. Owners of unresolved questions reject answers more often, have lower reputation, and are less active on Stack Overflow. Models using features like topic entropy, answer rejection ratio, and owner reputation achieved up to 78% accuracy at predicting unresolved questions. The study aims to help improve question quality on Stack Overflow.
This document analyzes data from over 78,000 pull requests on GitHub to understand why pull request failure rates are high. It finds that 57.05% of pull requests failed, most often due to issues with recursion/refactoring, database queries, arrays/functions. Programming languages like Java, JavaScript and Ruby saw more failed pull requests on average than PHP. Projects in IDE and framework domains had the most pull request activity. Older projects, projects with more forks/developers, and projects where developers had 20-50 months of experience saw the highest numbers of pull requests and failures. The study aims to help understand and address common reasons for pull request failures on GitHub.
The document describes a technique called CodeInsight that mines insightful code comments from crowdsourced knowledge on Stack Overflow. An exploratory study of Stack Overflow discussions found that around 22% of comments discuss tips, bugs, or warnings related to code examples. CodeInsight uses heuristics like popularity, relevance, comment rank, sentiment, and word count to retrieve these insightful comments for a given code segment. An empirical evaluation showed the technique could recall over 80% of relevant comments on average. A user study with professional developers found that 80% of the comments recommended by CodeInsight were accurate and useful.
This document proposes using TextRank to identify initial search terms for software change tasks. It adapts TextRank, originally used for keyword extraction and text summarization, to build a graph of terms from development artifacts and rank them. An evaluation on 349 change tasks from two systems identifies search terms, which outperform an existing approach in solving more tasks with higher precision and recall. The approach recommends initial search queries to help developers find relevant code artifacts when performing change tasks.
This document discusses a method called BRACK for identifying bug-prone API methods using crowdsourced knowledge from Stack Overflow. BRACK ranks API method invocations based on two heuristics: API Context-Susceptibility (ACS) which estimates how context can impact an invocation, and API Error-Associativity (AEA) which calculates the co-occurrence of an invocation in defective and corrected code segments. An evaluation of BRACK on 8 open source systems found that it achieved a top-3 accuracy of 75.93% in identifying bug-prone invocations, and that ACS was more effective than AEA. The evaluation also showed BRACK had no significant bias towards system size or API package and performed comparably
The document presents research on RACK, a tool that uses crowdsourced knowledge from Stack Overflow to reformulate natural language code search queries into relevant API names. The researchers analyzed Stack Overflow data to find that answers frequently refer to APIs by name and cover a high percentage of core APIs. They also found question titles contain terms relevant to real code search queries. RACK maps query terms to API names using this data, then searches GitHub code examples. An evaluation showed RACK returns relevant examples with 79% top-10 accuracy, outperforming existing techniques.
QUICKAR is a technique for automatically reformulating code search queries using crowdsourced knowledge from Stack Overflow. It constructs an adjacency list database of terms from Stack Overflow question titles. For an initial search query, it identifies reformulation candidates by comparing the query terms to terms in the adjacency list database and project source code. In experiments, QUICKAR significantly outperformed a baseline technique, improving over 50% of queries while worsening less than 50%, by leveraging vocabulary from Stack Overflow to address mismatches between developer queries and code.
CORRECT is a code reviewer recommendation tool that:
- Recommends appropriate code reviewers automatically by mining developers' contributions across projects
- Provides recommendation rationales that fit within developers' workflows
- Achieves over 90% accuracy in recommending reviewers based on library and technology experience
- Outperforms an existing technique (RevFinder) with 92.15% top-5 accuracy, 85.93% mean precision and 81.39% mean recall
- Performs similarly on open source projects with 85.20% top-5 accuracy, demonstrating effectiveness for public and private codebases
The document presents a new technique called ACER for automated query reformulation to improve concept location in source code. ACER uses a novel term weighting method called CodeRank that considers source code semantics and structure. It selects the best reformulation from candidates generated by CodeRank using machine learning on resampled data. An experimental evaluation on 1,675 change requests from 8 projects found that ACER improved 71% of queries and decreased mean rank, outperforming previous methods. The results validate that exploiting source code structure and semantics through CodeRank and data resampling in ACER enhances query reformulation over traditional techniques.
This document describes an approach for supporting software change tasks using automated query reformulations. It begins with an example of a software change request between Alex and Bob. It then discusses using techniques like TextRank and POSRank, adapted from PageRank, to identify important terms from change requests for querying a codebase. The approach was evaluated on a dataset of over 1,900 change tasks from eight open source projects. Experimental results found the proposed approach improved 57.84% of queries and outperformed existing state-of-the-art methods based on measures like query effectiveness and retrieval performance.
Effective Reformulation of Query for Code Search using Crowdsourced Knowledge...Masud Rahman
An effective query reformulation technique that adopts crowd sourced knowledge and large-scale data analytics from Stack Overflow Q&A site, and then improves source code search.
Improving IR-Based Bug Localization with Context-Aware-Query ReformulationMasud Rahman
This document describes an approach called BLIZZARD for improving IR-based bug localization through context-aware query reformulation. BLIZZARD treats bug reports differently based on their quality—noisy, poor, or rich—and selects keywords using PageRank to reformulate queries. An experiment on over 5,000 bug reports found that BLIZZARD outperforms both traditional baselines and state-of-the-art methods in effectively locating bugs and reformulating queries. The approach provides an improved treatment of noisy bug reports compared to traditional IR-based localization.
Exploiting Context in Dealing with Programming Errors and ExceptionsMasud Rahman
This document proposes approaches to improve the process of debugging programming errors and exceptions. It summarizes existing ad-hoc approaches and their limitations. The proposed approaches leverage context from the integrated development environment to provide context-aware web search, query recommendation, content suggestion, code examples, and exception handling support. The approaches were evaluated in experiments and user studies and showed improvements over traditional search engines and existing approaches in areas like search accuracy, recall, and time to fix exceptions. The thesis contributes techniques like SurfClipse, QueryClipse, ContentSuggest, SurfExample, and ExcClipse to address different phases of the exception handling process.
The slides provide a major overview on SOAP protocol, and demonstrates a working example that uses SOAP for RPC. It uses WCF/visual studio and Apache Axis for the implementation.
How Barcodes Can Be Leveraged Within Odoo 17Celine George
In this presentation, we will explore how barcodes can be leveraged within Odoo 17 to streamline our manufacturing processes. We will cover the configuration steps, how to utilize barcodes in different manufacturing scenarios, and the overall benefits of implementing this technology.
Philippine Edukasyong Pantahanan at Pangkabuhayan (EPP) CurriculumMJDuyan
(𝐓𝐋𝐄 𝟏𝟎𝟎) (𝐋𝐞𝐬𝐬𝐨𝐧 𝟏)-𝐏𝐫𝐞𝐥𝐢𝐦𝐬
𝐃𝐢𝐬𝐜𝐮𝐬𝐬 𝐭𝐡𝐞 𝐄𝐏𝐏 𝐂𝐮𝐫𝐫𝐢𝐜𝐮𝐥𝐮𝐦 𝐢𝐧 𝐭𝐡𝐞 𝐏𝐡𝐢𝐥𝐢𝐩𝐩𝐢𝐧𝐞𝐬:
- Understand the goals and objectives of the Edukasyong Pantahanan at Pangkabuhayan (EPP) curriculum, recognizing its importance in fostering practical life skills and values among students. Students will also be able to identify the key components and subjects covered, such as agriculture, home economics, industrial arts, and information and communication technology.
𝐄𝐱𝐩𝐥𝐚𝐢𝐧 𝐭𝐡𝐞 𝐍𝐚𝐭𝐮𝐫𝐞 𝐚𝐧𝐝 𝐒𝐜𝐨𝐩𝐞 𝐨𝐟 𝐚𝐧 𝐄𝐧𝐭𝐫𝐞𝐩𝐫𝐞𝐧𝐞𝐮𝐫:
-Define entrepreneurship, distinguishing it from general business activities by emphasizing its focus on innovation, risk-taking, and value creation. Students will describe the characteristics and traits of successful entrepreneurs, including their roles and responsibilities, and discuss the broader economic and social impacts of entrepreneurial activities on both local and global scales.
A Visual Guide to 1 Samuel | A Tale of Two HeartsSteve Thomason
These slides walk through the story of 1 Samuel. Samuel is the last judge of Israel. The people reject God and want a king. Saul is anointed as the first king, but he is not a good king. David, the shepherd boy is anointed and Saul is envious of him. David shows honor while Saul continues to self destruct.
THE SACRIFICE HOW PRO-PALESTINE PROTESTS STUDENTS ARE SACRIFICING TO CHANGE T...indexPub
The recent surge in pro-Palestine student activism has prompted significant responses from universities, ranging from negotiations and divestment commitments to increased transparency about investments in companies supporting the war on Gaza. This activism has led to the cessation of student encampments but also highlighted the substantial sacrifices made by students, including academic disruptions and personal risks. The primary drivers of these protests are poor university administration, lack of transparency, and inadequate communication between officials and students. This study examines the profound emotional, psychological, and professional impacts on students engaged in pro-Palestine protests, focusing on Generation Z's (Gen-Z) activism dynamics. This paper explores the significant sacrifices made by these students and even the professors supporting the pro-Palestine movement, with a focus on recent global movements. Through an in-depth analysis of printed and electronic media, the study examines the impacts of these sacrifices on the academic and personal lives of those involved. The paper highlights examples from various universities, demonstrating student activism's long-term and short-term effects, including disciplinary actions, social backlash, and career implications. The researchers also explore the broader implications of student sacrifices. The findings reveal that these sacrifices are driven by a profound commitment to justice and human rights, and are influenced by the increasing availability of information, peer interactions, and personal convictions. The study also discusses the broader implications of this activism, comparing it to historical precedents and assessing its potential to influence policy and public opinion. The emotional and psychological toll on student activists is significant, but their sense of purpose and community support mitigates some of these challenges. However, the researchers call for acknowledging the broader Impact of these sacrifices on the future global movement of FreePalestine.
Elevate Your Nonprofit's Online Presence_ A Guide to Effective SEO Strategies...TechSoup
Whether you're new to SEO or looking to refine your existing strategies, this webinar will provide you with actionable insights and practical tips to elevate your nonprofit's online presence.
Beyond Degrees - Empowering the Workforce in the Context of Skills-First.pptxEduSkills OECD
Iván Bornacelly, Policy Analyst at the OECD Centre for Skills, OECD, presents at the webinar 'Tackling job market gaps with a skills-first approach' on 12 June 2024
1. A SYSTEMATIC LITERATURE REVIEW OF
AUTOMATED QUERY REFORMULATIONS IN
SOURCE CODE SEARCH
Masud Rahman
Department of Computer Science
University of Saskatchewan, Canada
Advisor: Dr. Chanchal K. Roy
@masud233
6
2. MASUD RAHMAN: ACADEMICS
2
2019
PhD (In Progress),
University of Saskatchewan
(Award: Dr. Keith Geddes Award)
2014
MSc, University of Saskatchewan
(Award: Best MSc Thesis Nomination)
2009
BSc, Khulna University, Bangladesh
(Award: President Gold Medal)
MasudRahman,UofS
5. 5
MasudRahman,UofS
A SYSTEMATIC LITERATURE REVIEW OF
AUTOMATED QUERY REFORMULATIONS IN
SOURCE CODE SEARCH
Source Code Search
Automated Query
Reformulation
BACKGROUND CONCEPTS
1 2
P1 P2 P3 P4
6. MCAS: A SOFTWARE BUG THAT KILLS
6
MasudRahman,UofS
P1 P2 P3 P4
Boeing 737 MAX 8
7. A TALE OF SOURCE CODE SEARCH
7
MasudRahman,UofS
Boeing
Customer
MCAS Bug report
Boeing Developer Code search
Query Suggestion Query Reformulation
Boeing Codebase
P1 P2 P3 P4
8. QUERY REFORMULATION: 2 WORKING CONTEXTS
8
MasudRahman,UofS
Local code search
(e.g., bug localization)
Internet-scale
code search
Boeing
codebase GitHub
P1 P2 P3 P4
14. SYSTEMATIC LITERATURE REVIEW: 6 STEPS
14
MasudRahman,UofS
Research questions Search keywords Literature search
Literature bulkNoise filtrationPrimary studies
In-depth
investigation
7 RQs
P1 P2 P3 P4
15. SYSTEMATIC LITERATURE REVIEW: PRIMARY
STUDY SELECTION
15
MasudRahman,UofS
ACM DL
CrossRef
DBLP
Mendeley
Google Scholar
IEEE Xplore
ProQuest
ScienceDirect
SpringerLink
Web of Science
Wiley Online Lib
2871 2317 562
Initial
results
Impurity
removal
Filter by
Title
195
Filter by
Abstract
93
Merging &
Duplicate
removal
56
Primary
studies
P1 P2 P3 P4
Filter by
Full texts
Information retrieval, IR, text retrieval, TR, bug localization,
concept location, feature location, FLT, concern location, Internet-
scale code search, code search engine, search engine, local code
search, code search, source code search, and code search query.
Query reformulation, query expansion, query reduction, query
formulation, query refinement, automated query expansion, AQE,
query suggestion, query recommendation, term selection, query
replacement, query difficulty, query quality, keyword selection,
keyword extraction, search term identification, search query,
search term, and search keyword.
+
16. OUR RESEARCH QUESTIONS
16
MasudRahman,UofS
RQ1: Which methods, algorithms and data sources have been used for automated
query reformulations targeting code search in the literature?
P1 P2 P3 P4
RQ2: Which methods, metrics or subject systems have been used to evaluate and
validate the researches on automated query reformulations?
RQ3: What are the major challenges of automated query reformulations intended
for code search? How many of them have been solved to date by the literature?
RQ4: How much activities of research on automated query reformulations have
been performed to date? What are the venues that these researches got published
at?
RQ5: What are the differences and similarities between query reformulations for
local code search and query reformulations for Internet-scale code search?
RQ6: Which one is more appropriate among term weighting, query-term co-
occurrence and thesaurus-based approaches for query keyword selection?
RQ7: What are the scopes for future work in the area of automated query
reformulation targeting the code search?
17. RQ1: WHICH METHODS & ALGORITHMS ARE
USED BY LITERATURE?
17
MasudRahman,UofS
Grounded
Theory
• Open coding
• Axial coding
• Selective
coding
1
2
P1 P2 P3 P4
18. RQ1: WHICH ALGORITHMS & REFORMULATION TYPES
ARE USED BY LITERATURE?
18
MasudRahman,UofS
P1 P2 P3 P4
19. RQ2: WHICH EVALUATION & VALIDATION
SETTINGS ARE EMPLOYED?
19
MasudRahman,UofS
P1 P2 P3 P4
20. RQ3: WHAT ARE COMMON CHALLENGES &
LIMITATIONS OF EXISTING LITERATURE?
20
MasudRahman,UofS
Grounded
Theory
• Open coding
• Axial coding
• Selective
coding
1
2
P1 P2 P3 P4
21. RQ3: WHAT ARE COMMON CHALLENGES & LIMITATIONS OF
EXISTING LITERATURE?
21
MasudRahman,UofS
P1 P2 P3 P4
22. RQ4: PUBLICATION STATS & INTERESTS ON QUERY
REFORMULATION RESEARCH
22
MasudRahman,UofS
P1 P2 P3 P4
23. RQ5: COMPARISON BETWEEN LOCAL CODE SEARCH &
INTERNET-SCALE CODE SEARCH
23
MasudRahman,UofS
TW = Term Weighting, TQC = Term-Query Co-occurrence, TS =
Thesaurus, ON = Ontology, SLM = Search Log Mining, ML = Machine
Learning, HM = Heuristics & Miscellaneous
P1 P2 P3 P4
24. RQ5: COMPARISON BETWEEN LOCAL CODE SEARCH &
INTERNET-SCALE CODE SEARCH
24
MasudRahman,UofS
CH1 = Vocabulary Mismatch Unsolved, CH2 = Extra Burden on
Developers, CH3 = Lack of Generalizability, CH4 = Lack of Practical
Use, CH5 = Inappropriate Use of Tools, CH6 = Human Bias + Weak
Evaluation, CH7 = Unverified Assumptions
P1 P2 P3 P4
25. RQ6: CHALLENGES WITH THREE KEYWORD
SELECTION METHODS
25
MasudRahman,UofS
Method #Study CH1 CH2 CH3 CH6
Term Weighting 22 (39%) 36% 18% 91% 50%
Term-Query Co-occurrence 11 (20%) 9% 27% 64% 91%
Thesaurus 17 (30%) 12% 12% 47% 41%
CH1 = Vocabulary Mismatch Unsolved, CH2 = Extra Burden on
Developers, CH3 = Lack of Generalizability, CH4 = Lack of Practical
Use, CH5 = Inappropriate Use of Tools, CH6 = Human Bias + Weak
Evaluation, CH7 = Unverified Assumptions
P1 P2 P3 P4
27. 27
MasudRahman,,UofS
R1: KEYWORD SELECTION FROM BUG REPORT
Title
Description
ID Query QE
1. Custom search results view iresource
2. Custom search results search results view
3. element iresource provider level tree
4. Custom search results hierarchically java search results
1331
636
01
570
Lower QE is better
P1 P2 P3 P4
28. R2: TERM WEIGHTING FOR SOURCE CODE
28
RFDd t
t
n
D
dftIDFTF log)),log(1()(
• Different syntax
• Different semantics
• Different structures
P1 P2 P3 P4
Hello everyone! Good afternoon!
Thank you all for coming and attending this talk.
My name is Masud Rahman. I am a PhD student in the Department of Computer Science.
I work with Dr. Chanchal K. Roy.
Today, I will be talking about automated query reformulations for code search.
A little bit of background about Me:
Currently, I am a PhD student at USASK.
I completed my MSc in Software Engineering from the same university in 2014.
Before that, I completed my BSc in Computer Science & Engineering from Khulna University, back in 2009.
Got couple of awards.
Today, my talk will be divided into four sections.
In the first section, I will provide a background overview on automated query reformulations & code search in general.
In the second section, I will present a systematic literature review of automated query reformulations.
In the third section, I will discuss about the future research opportunities in this domain.
Finally, we will have a Q&A session.
Part 1: Background concepts.
If we look at my talk’s title, we can see two major concepts.
Source code search
Automated query reformulations.
Now we will go into details. But, lets look at two recent events.
You are looking at two aircrafts -- Ethiopian airlines and Lion Air Indonesia.
These are nose-down situation. Due to these nose down situations, we have two fatal crashes in a single calendar year.
These crashes took 346 precious human lives and cost trillions of dollars.
Now, the culprit is MCAS. This is a software component that was added to Boeing 737-Max 8 version.
The summary is, this is a faulty component, not well designed, and ultimately leads to crash.
That is why, Boeing 737 Max planes are grounded right now.
Now, lets say, a Boeing customer has submitted a bug report.
Now, a Boeing developer is responsible to locate and repair the faulty code triggering that bug.
As a frequent practice, developer chooses a few important keywords and attempts to locate the buggy code within the Boeing codebase.
But the study shows that 88% of the keywords chosen by the developer are incorrect. That is, they do not return the buggy code.
So, the obvious next step is to reformulate the query through automated tool supports, so that the buggy code could be located.
There are also tools that take a bug report and suggest appropriate search queries in the first place.
Now, the developer not only searches in the Boeing codebase, she might search in the Internet-scale codebase such as GitHub as well.
So as discussed, the code search could be done in two working contexts.
It could be in a local codebase such as Boeing.
It could also be in the large-scale open source repository such as GitHub.
Now, based on these contexts, there are different challenges in query reformulation.
The local codebase is small, domain specific and organized.
On the contrary, GitHub is huge, cross-domain and very noisy.
So, yes, they need different strategies to suggest queries for them.
We can reformulate a search query in three ways.
-- It could be query expansion by adding new keywords.
-- It could be query reduction by discarding the noisy keywords.
-- Or it could be total query replacement by using a new set of keywords.
Now, there are many steps in query reformulation.
But three major steps are common.
First, you need to collect feedback on the given query. The query is executed and top-K results are collected for developer inspection.
The developers marks them whether they look relevant or irrelevant. This is step-I.
In the second step, these annotated results are mined using various text mining tools, and candidate keywords are selected using various keyword selection methods.
In the third step, the most important keywords are returned to the developer for query expansion.
Now, there are automated query reformulations and semi-automated query reformulations.
Positives:
It can improve code search performance up to 20%, which is significant.
It helps to redefine the information needs. Developers often not sure which keywords to choose, automated suggestion can help them.
Also reduces the cost and efforts in code search, i.e., in software maintenance such as bug fixing.
Negatives:
Automated reformulation might degrade the already good queries. If you have already good one, you need to stop.
It has a chance of topic drifting. That ism through reformulations, the original topic might be lost.
Now, we are done with Background concepts, Part 1.
Now, we are going into Part 2 -- Systematic Literature Review.
Systematic literature review starts with several research questions.
We ask 7 research questions in our survey about automated query reformulations.
These questions are broken down into search keywords.
Then we use these keywords, and retrieve a bulk of literature from various publication databases.
Then we perform several steps for noise filtration, and select a set of primary studies on automated query reformulations.
Then we do in-depth investigation on these primary studies.
Now, lets look take a closer look on this section.
We choose 11 publication databases, and collect about ~3K results from these databases on automated query reformulations.
Then we remove the impurities from the results. Sometimes, keyword matching can produce unexpected results.
For example, these results contain studies from database management systems, multimedia retrieval or image retrieval.
Since we are looking for query reformulation for code search, we only keep results for code search, and discard the rest.
This step provides ~2300 results. That is still huge.
Then we filter the results by title and abstract. That is, we look at the title and abstract, and determine whether they are related to code search and query reformulations or not.
These steps provide us 195 results.
Then, we do the merging and duplication removal. Still the topics of a few results were not clear to us.
We thus read their full texts, especially the Introduction part.
Finally, we reach to a collection of 56 studies after all these filtration steps.
We call these studies as the primary studies on automated query reformulations.
We answer 7 research questions in our systematic survey.
We answer three general questions about methodology, evaluation and challenges/limitations from the existing literature.
We answer one statistical question.
Then we answer three specialized questions including future research opportunities.
In the first research question, we identify which underlying algorithms and methodologies are used by the existing literature.
In order to do that, we used Grounded theory approach. It is a well known method for qualitative research.
How do we do that?
Well, we read the Introduction and methodology section of each of the 56 primary studies, and identify the algorithms and technology used.
Since we are trying to develop a theory about the existing literature, we apply three types of coding in the Grounded Theory.
--Open coding: In this stage, we describe each study with a list of appropriate key phrases. The idea is to keep an open mind, and use as much as key phrases possible.
-- Axial coding: In this stage, we try to make connections among different key phrases, and color code similar phrases. We basically look for topical similarity.
-- Selective coding: In this stage, we develop the underlying variables for a dependent variable.
Thus, we develop a mental model about existing literature based on our qualitative analysis of the primary studies.
Then we do various quantitative analysis using this theory.
For example, we discover that seven major methodologies and algorithms are employed in the query reformulations.
About 40% of the studies use term weighting approaches such as TF-IDF.
About 30% studies use thesaurus such as WordNet for query expansions with synonyms.
Besides 50% studies employ various advanced heuristics and ad hoc methods for query reformulation.
We also identify that 70% studies do query expansion, which is the highest.
There are also 15%--25% studies do query reduction or replacement.
Majority of the approaches do not collect any feedback during query reformulation.
We also discover that 40% literature on query reformulation target Internet-scale code search.
The remaining studies target various code searches within a local codebase for bug localization, concept location and feature location.
In the second research question, we investigate which evaluation and validation approaches were used by the existing literature on query reformulations.
We found that 50% studies used more than two performance metrics.
50% studies used at least 2 subject systems for their experiments.
About 38% studies involve developers in their experiments, and 50% of them use less than 16 developers.
Most of the studies used some means to validate their work. 50% of the studies compare with at least 2 existing works.
In term of search queries, we found that 50% studies used at least 74 queries for their evaluation and validation.
Now, we see that the subject systems and validation targets are not sufficient, which often lead to the lack of generalizability issues.
In the third research question, we identify the threats and limitations of the existing literature.
For doing that, we consult with the methodology and threats to validity section of each of the primary studies.
In particular, we check the threats or issues reported by the authors, and identify several issues through inferences.
Like RQ1, we apply Grounded Theory approach, and identify the common challenges, issues and limitations of the existing literature.
We found seven major challenges/limitations with the existing literature. Now, the details are in the report. We are just providing the summary here.
We see that 80% studies suffer from one or more generalizability issues.
That is, they either use only subject systems from a single programming platform. For example, the findings from Java-based systems might not always generalize for C-based systems.
The number of queries or developers involved is not sufficient, or the validation is not sufficient enough.
We also see that 50% studies are affected by human bias and suffer from weak evaluation.
We also found that vocabulary mismatch problem is not solved. Well, this is a long standing problem in any type of document search, and all query reformulation approaches attempt to solve this problem. But found that 30% studies do a poor job in doing so.
We also see that 30% studies impose extra cognitive burdens on the developers during query reformulation/code search.
In the fourth research question, we find out the statistics on the research activities conducted on automated query reformulation for the last 15 years.
We see that first work on query reformulation targeting concept location was published at 2004.
Then there was some moderate activities. However, for the last 5-6 years, we see significant activities by the community.
Especially, since 2013, we see a major interest on this domain.
In terms of venue, we see that ASE and ICSE are pioneer which are A/A* conferences. So, yes, these are top-quality researches.
In fact 70% of the primary studies were done in the last 5 years, which shows the promising aspect of this domain.
These are some top authors. According to our investigation, about 150 researches have worked or working in this domain.
So, this is a well established and promising area for research.
Besides these analysis, we did more in-depth analysis, and compare between Local and Internet-scale code search in RQ5.
We see that local code searches involve term weighting for keyword selection, since it has the bug reports.
On the contrary, people use thesaurus for query expansion for code search on the web. No bug reports are there.
More details can be found on the comprehensive paper.
We also see that queries in the Internet-scale code search suffer more from vocabulary mismatch issues.
In this case, the developers do not have materials to get the help for queries.
So, they generally guess some keywords that define their information needs, which are often not sufficient.
This leads to vocabulary mismatch issue.
In the sixth research questions, we see that
Term weighting has some connection with vocabulary mismatch problem.
Inappropriate term weighting can choose inappropriate keywords for query reformulation.
This leads of noise in the query and degraded performance.
On the contrary, thesaurus and term-query co-occurrence attempt to deliver synonyms or similar words.
They create less vocabulary mismatch issues comparatively.
OK! Now we are done with the literature survey.
Now, we will focus on the third part, the future research opportunities.
Let us see an example.
This is a bug report, this is title and this is the description.
Now, developer JOE would use this bug report to localize the bug from source code.
Now he chose some ad hoc queries.
Which one is the best do you think, here? PAUSE!
Well, lets see. This one returns the correct result at this position. That means, the developer needs to check 1300+ results b4 reaching to the correct result he tries this query.
… oh… this one is the best.
So, selecting appropriate keywords from the bug report is not that simple.
Now, this is a metric which has been on the play from last the century. It was proposed in the 70s.
It is a good metric, but it was actually proposed for regular texts such as news articles.
On the other hand, we are dealing with source code here.
Now, regular texts and source code have different semantics and different structures.
They are not the same
So, metrics for regular texts are not appropriate for the source code– this is our hypothesis.
That is, each of three people, customer, past developer and JOE have their own vocabulary to describe a certain problem/concept.
In fact, any people will discuss the same problem with the same vocabulary, this probability is only 15%-20%
So, naturally, developer JOE finds it a great challenge to make a connection between bug report and the buggy code.
This costs development time, money and valuable efforts.
Here we see that burger is close sandwich. Why? They are eaten together. I do that all the time.
Well, that is not the case.
They are mentioned in the similar contexts by the people across the whole corpus.
The model recognizes such occurrences and thus put burger and sandwich close together.
Similarly, dumpling and ramen are close to each other.
Now, we propose this. This is original query, and this is reformulated query.
Now, a good reformulated query will cluster together the original query.
A bad reformulated query will NOT be able to cluster with the original query.
So, clustering tendency within the hyperspace is our weapon here.
We calculated Hopkins statistic and Polygon Area for calculating the clustering tendency.
Now, I am not going to discuss those studies in details.
But here is the glimpse.
Developers generally look for relevant code on the web using natural language query.
Please note that we are not talking about simply web search, rather talking about source code repository such as GitHub.
Now, GitHub provides this result. Now, you see it tries to match the query keywords with comment and identifiers.
But what we are dealing with source code right? So, we need source code friendly query for a better result.
So, we identify relevant API classes against this natural language query through extensive data mining and data analytics.
And once again, Stack Overflow is our friend in this grand challenge.
And, I am done with my talking.
Thanks a lot for your attention.
Now, I am ready to take some questions.