SlideShare ist ein Scribd-Unternehmen logo
1 von 56
Downloaden Sie, um offline zu lesen
EXPANDING IDENTIFIERS TO
                NORMALIZING SOURCE
                 CODE VOCABULARY
                            PRESENTED BY DAWN LAWRIE
                           LOYOLA UNIVERSITY MARYLAND


                        IN COLLABORATION WITH DAVE BINKLEY




Friday, October 7, 11
VOCABULARY MISMATCH


                        DIFFERENT VOCABULARY IN SOURCE CODE AND OTHER
                        SOFTWARE ARTIFACTS

                        EXAMPLE

                          REQUIREMENT - “FEATURE LOCATION”

                          SOURCE CODE - “FEATURELOCATION”

                            OR WORSE     “FLOC”




Friday, October 7, 11
PURPOSE OF NORMALIZE



                        COPE WITH VOCABULARY MISMATCH

                         SOURCE CODE

                         OTHER SOFTWARE DOCUMENTS




Friday, October 7, 11
EXAMPLE PROBLEMS



                        CONSIDER IDENTIFIERS

                         FEATURELOCATION

                         FLOC




Friday, October 7, 11
EXAMPLE PROBLEMS



                        CONSIDER IDENTIFIERS

                         FEATURE LOCATION      SPLITTING PROBLEM

                         FLOC




Friday, October 7, 11
EXAMPLE PROBLEMS



                        CONSIDER IDENTIFIERS

                         FEATURE LOCATION      SPLITTING PROBLEM

                         F LOC                 SPLITTING PROBLEM




Friday, October 7, 11
EXAMPLE PROBLEMS



                        CONSIDER IDENTIFIERS

                         FEATURE LOCATION      SPLITTING PROBLEM

                         FEATURE LOCATION      SPLITTING AND
                                               EXPANSION PROBLEM




Friday, October 7, 11
WHY NORMALIZE?



                        MANY SE PROBLEMS CAN BE ADDRESSED USING
                        INFORMATION RETRIEVAL (IR) TECHNIQUES

                        UN-NORMALIZED CODE LEADS TO AN UNDER
                        ESTIMATE OF THE IMPORTANCE OF CRUCIAL WORDS




Friday, October 7, 11
NORMALIZE PROBLEM STATEMENT




                        FIND THE BEST EXPANSION OVERALL POSSIBLE SPLITS




                            FLOC           FEATURE LOCATION


Friday, October 7, 11
NORMALIZE ALGORITHM



                        TERMINOLOGY

                         HARD-WORD - WHITEHOUSE_LAWN

                         SOFT-WORD - WHITE-HOUSE_LAWN




Friday, October 7, 11
NORMALIZE ALGORITHM



                        TERMINOLOGY

                         HARD-WORD - WHITEHOUSE_LAWN    (2)

                         SOFT-WORD - WHITE-HOUSE_LAWN




Friday, October 7, 11
NORMALIZE ALGORITHM



                        TERMINOLOGY

                         HARD-WORD - WHITEHOUSE_LAWN    (2)

                         SOFT-WORD - WHITE-HOUSE_LAWN   (3)




Friday, October 7, 11
NORMALIZE ALGORITHM




Friday, October 7, 11
NORMALIZE ALGORITHM


                        STRLEN    STRING LENGTH




Friday, October 7, 11
MACHINE TRANSLATION
                             APPROACH


                        EL   PAPA   VISITA   LA   IGLESIA




Friday, October 7, 11
MACHINE TRANSLATION
                              APPROACH


                        EL   PAPA  VISITA LA IGLESIA
                            FATHER VISITS
                        THE POTATO VISITOR THE CHURCH
                             POPE    HIT




Friday, October 7, 11
MACHINE TRANSLATION
                              APPROACH


                        EL   PAPA  VISITA LA IGLESIA
                            FATHER VISITS
                        THE POTATO VISITOR THE CHURCH
                             POPE    HIT




Friday, October 7, 11
MACHINE TRANSLATION
                              APPROACH


                        EL   PAPA   VISITA LA IGLESIA
                            FATHER VISITS
                        THE POTATO VISITOR THE CHURCH
                             POPE     HIT COH ESION
                                  STRONG




Friday, October 7, 11
MACHINE TRANSLATION
                              APPROACH


                        EL   PAPA   VISITA LA IGLESIA
                            FATHER VISITS
                        THE POTATO VISITOR THE CHURCH
                             POPE     HIT COH ESION
                                  STRONG




Friday, October 7, 11
NORMALIZE ALGORITHM




Friday, October 7, 11
NORMALIZE ALGORITHM

       STRLEN




Friday, October 7, 11
NORMALIZE ALGORITHM

       STRLEN
       S-TRLEN
        ST-RLEN
       STR-LEN
       STRL_EN
       STRLE_N
       S_T_RLEN
        S-TR-LEN
        S_TRL_EN
         S_TRLE_N
         ST_R_LEN
          ST_RL_EN
          ST_RLE_N
           STR_L_EN
           STR_LE_N
            STRL_E_N
            S_T_R_LEN
            S_T_RL_EN
             S_T_RLE_N
             S_TR_L_EN
              S_TR_LE_N
              S_TRL_E_N
               ST_R_L_EN
               ST_R_LE_N
                ST_RL_E_N
                STR_L_E_N
                S_T_R_L_EN
                 S_T_R_LE_N
                 S_TR_L_E_N
                  ST_R_L_E_N
                  S-T-R-L-E-N
Friday, October 7, 11
NORMALIZE ALGORITHM

       STRLEN
       S-TRLEN
                                E(RLEN) = {RIFLEMEN}
        ST-RLEN
       STR-LEN
       STRL_EN
       STRLE_N
       S_T_RLEN
        S-TR-LEN
        S_TRL_EN
         S_TRLE_N
         ST_R_LEN
          ST_RL_EN
          ST_RLE_N
           STR_L_EN
           STR_LE_N
            STRL_E_N
            S_T_R_LEN
            S_T_RL_EN
             S_T_RLE_N
             S_TR_L_EN
              S_TR_LE_N
              S_TRL_E_N
               ST_R_L_EN
               ST_R_LE_N
                ST_RL_E_N
                STR_L_E_N
                S_T_R_L_EN
                 S_T_R_LE_N
                 S_TR_L_E_N
                  ST_R_L_E_N
                  S-T-R-L-E-N
Friday, October 7, 11
NORMALIZE ALGORITHM

       STRLEN
       S-TRLEN
                                E(RLEN) = {RIFLEMEN}
        ST-RLEN
                                WILDCARD EXPANSION
       STR-LEN
       STRL_EN
       STRLE_N                       R*L*E*N*
       S_T_RLEN
        S-TR-LEN
        S_TRL_EN
         S_TRLE_N
         ST_R_LEN
          ST_RL_EN
          ST_RLE_N
           STR_L_EN
           STR_LE_N
            STRL_E_N
            S_T_R_LEN
            S_T_RL_EN
             S_T_RLE_N
             S_TR_L_EN
              S_TR_LE_N
              S_TRL_E_N
               ST_R_L_EN
               ST_R_LE_N
                ST_RL_E_N
                STR_L_E_N
                S_T_R_L_EN
                 S_T_R_LE_N
                 S_TR_L_E_N
                  ST_R_L_E_N
                  S-T-R-L-E-N
Friday, October 7, 11
NORMALIZE ALGORITHM

       STRLEN
                              E(ST) = {SET, STOP, STRING}
       S-TRLEN
                                 E(RLEN) = {RIFLEMEN}
        ST-RLEN
       STR-LEN             E(STR) = {STEER, STRING}
       STRL_EN            E(LEN) = {LENDER, LENGTH}
       STRLE_N
       S_T_RLEN
        S-TR-LEN
        S_TRL_EN
         S_TRLE_N
         ST_R_LEN
          ST_RL_EN
          ST_RLE_N
           STR_L_EN
           STR_LE_N
            STRL_E_N
            S_T_R_LEN
            S_T_RL_EN
             S_T_RLE_N
             S_TR_L_EN
              S_TR_LE_N
              S_TRL_E_N
               ST_R_L_EN
               ST_R_LE_N
                ST_RL_E_N
                STR_L_E_N
                S_T_R_L_EN
                 S_T_R_LE_N
                 S_TR_L_E_N
                  ST_R_L_E_N
                  S-T-R-L-E-N
Friday, October 7, 11
NORMALIZE ALGORITHM PART I
             STR
                                VS

                STRING               STEER




Friday, October 7, 11
NORMALIZE ALGORITHM PART I
             STR
                                  VS
                         LENDER                LENDER
                STRING                 STEER
                         LENGTH                LENGTH




Friday, October 7, 11
NORMALIZE ALGORITHM PART I
             STR
                                         VS
                                LENDER                 LENDER
                STRING                         STEER
                                LENGTH                 LENGTH




                        1. FIND COHESION BY SUMMING LOG OF
                             PROBABILITIES OF WORD PAIRS



Friday, October 7, 11
NORMALIZE ALGORITHM PART I
             STR
                                         VS
                         LENDER                        LENDER
                STRING                         STEER
                       + LENGTH                      + LENGTH
                       COHESIONA                     COHESIONB



                        1. FIND COHESION BY SUMMING LOG OF
                             PROBABILITIES OF WORD PAIRS



Friday, October 7, 11
NORMALIZE ALGORITHM PART I
             STR
                                         VS
                         LENDER                        LENDER
                STRING                         STEER
                       + LENGTH                      + LENGTH
                       COHESIONA                     COHESIONB



                        1. FIND COHESION BY SUMMING LOG OF
                             PROBABILITIES OF WORD PAIRS
                    2. SELECT EXPANSION THAT MAXIMIZES
                                  COHESION
Friday, October 7, 11
NORMALIZE ALGORITHM PART I
             STR
                                         VS
                         LENDER                        LENDER
                STRING                         STEER
                       + LENGTH                      + LENGTH
                       COHESIONA                     COHESIONB



                        1. FIND COHESION BY SUMMING LOG OF
                             PROBABILITIES OF WORD PAIRS
                    2. SELECT EXPANSION THAT MAXIMIZES
                                  COHESION
Friday, October 7, 11
NORMALIZE ALGORITHM PART I
             STR
                                         VS
                         LENDER                        LENDER
                STRING                         STEER
                       + LENGTH                      + LENGTH
                       COHESIONA                     COHESIONB

                                    STRING
                        1. FIND COHESION BY SUMMING LOG OF
                             PROBABILITIES OF WORD PAIRS
                    2. SELECT EXPANSION THAT MAXIMIZES
                                  COHESION
Friday, October 7, 11
NORMALIZE ALGORITHM PART II

                                  VS

                        STR-LEN        ST-RLEN




Friday, October 7, 11
NORMALIZE ALGORITHM PART II

                                        VS

                          STR-LEN              ST-RLEN
                        STRING LENGTH        STOP RIFLEMEN




Friday, October 7, 11
NORMALIZE ALGORITHM PART II

                                        VS

                          STR-LEN              ST-RLEN
                        STRING LENGTH        STOP RIFLEMEN




                    1. FIND COHESION OVER EXPANSIONS




Friday, October 7, 11
NORMALIZE ALGORITHM PART II

                                         VS

                          STR-LEN                 ST-RLEN
                        STRING LENGTH           STOP RIFLEMEN




                    1. FIND COHESION OVER EXPANSIONS
                        2. SELECT EXPANSION OF THE SPLIT
                            THAT MAXIMIZES COHESION

Friday, October 7, 11
NORMALIZE ALGORITHM PART II

                                         VS

                          STR-LEN                 ST-RLEN
                        STRING LENGTH           STOP RIFLEMEN




                    1. FIND COHESION OVER EXPANSIONS
                        2. SELECT EXPANSION OF THE SPLIT
                            THAT MAXIMIZES COHESION

Friday, October 7, 11
NORMALIZE ALGORITHM PART II

                                         VS

                          STR-LEN                 ST-RLEN
                        STRING LENGTH           STOP RIFLEMEN

                             STRING LENGTH
                    1. FIND COHESION OVER EXPANSIONS
                        2. SELECT EXPANSION OF THE SPLIT
                            THAT MAXIMIZES COHESION

Friday, October 7, 11
ADDING CONTEXT




Friday, October 7, 11
ADDING CONTEXT

             DIR




Friday, October 7, 11
ADDING CONTEXT

             DIR        E(DIR) = {DIRECTION, DIRECTORY}




Friday, October 7, 11
ADDING CONTEXT

             DIR         E(DIR) = {DIRECTION, DIRECTORY}
                        CONTEXT = {FORWARD, BACKWARD}




Friday, October 7, 11
ADDING CONTEXT

             DIR             E(DIR) = {DIRECTION, DIRECTORY}
                            CONTEXT = {FORWARD, BACKWARD}



                        FIND COHESION WITH CONTEXT WORDS IN ADDITION TO
                        EXPANSIONS OF OTHER SOFT WORDS

                        USED IN BOTH PART 1 AND PART 2




Friday, October 7, 11
NORMALIZE IMPLEMENTATION




                        USES GenTest TO SPLIT IDENTIFIERS

                          RETURNS MULTIPLE SPLITS

                        GOOGLE 5-GRAM DATASET




Friday, October 7, 11
EVALUATION

                    Program             Loc        SLoc     Unique Ids

                    which-2.20         3,670       2,293       487

                        a2ps-4.14      62,347     38,436       4,393


                    Program         Selected Ids Hard Words Soft Words

                    which-2.20          487        903         1214

                        a2ps-4.14       211        459         618




Friday, October 7, 11
EVALUATION

                        THREE GROUPS OF IDENTIFIERS

                          STANDARD LIBRARY CALLS

                          NAMES FROM STANDARD HEADER FILES / KEYWORDS

                          DOMAIN NAMES




Friday, October 7, 11
EVALUATION

                        THREE GROUPS OF IDENTIFIERS

                          STANDARD LIBRARY CALLS

                          NAMES FROM STANDARD HEADER FILES / KEYWORDS

                          DOMAIN NAMES




Friday, October 7, 11
EVALUATION

                        THREE GROUPS OF IDENTIFIERS

                          STANDARD LIBRARY CALLS

                          NAMES FROM STANDARD HEADER FILES / KEYWORDS

                          DOMAIN NAMES


                         Program         Filtered Ids   Reported Ids

                         which-2.20          152            335

                         a2ps-4.14            46            166

Friday, October 7, 11
EXAMPLE EXPANSIONS

                          id           Top 10         Top Expansion
                                     Expansion
                        nextchar    next_character     next_character
                        indfound   index_found_need     index_found
                         optarg      option_are_g          optarg
                        itemno       i_them_not           itemno




Friday, October 7, 11
RESEARCH QUESTIONS



                        WHAT IS THE OVERALL ACCURACY OF NORMALIZE?

                        DOES THE VOCABULARY USED HAVE A SIGNIFICANT
                        IMPACT ON THE EXPANSION’S ACCURACY?

                        CAN THE EXPANDER INFORM THE SPLITTER?

                        CAN THE SPLITTER INFORM THE EXPANDER?




Friday, October 7, 11
ACCURACY ON DOMAIN IDS




Friday, October 7, 11
SOURCE OF EXPANSION WORDS



                        SOURCE CODE

                        INTERNAL DOCUMENTATION

                        MANUAL




Friday, October 7, 11
BEST VOCABULARY SOURCE?




Friday, October 7, 11
FUTURE WORK


                        EXPLORING DIFFERENT SOURCES OF CO-OCCURRENCE
                        DATA

                        EXPLORING DIFFERENT WAYS OF CALCULATING
                        PROBABILITIES

                        EXAMINING NORMALIZATION IN CONTEXT OF AN
                        INFORMATION RETRIEVAL TASK




Friday, October 7, 11
SUMMARY



                        IDENTIFIERS ARE WRITTEN DIFFERENTLY THAN OTHER
                        SOFTWARE DOCUMENTS

                          DEGRADES PERFORMANCE OF IR TECHNIQUES

                        NORMALIZE CURRENTLY EXPANDS ABOUT HALF OF
                        SOFT WORDS CORRECTLY




Friday, October 7, 11
QUESTIONS?


                         Need an identifier split?
                        GenTest Splitter available at
                            splitit.cs.loyola.edu



Friday, October 7, 11

Weitere ähnliche Inhalte

Andere mochten auch

Migration and Refactoring - Identifying Overly Strong Conditions in Refactori...
Migration and Refactoring - Identifying Overly Strong Conditions in Refactori...Migration and Refactoring - Identifying Overly Strong Conditions in Refactori...
Migration and Refactoring - Identifying Overly Strong Conditions in Refactori...ICSM 2011
 
Industry - Estimating software maintenance effort from use cases an indu...
Industry - Estimating software maintenance effort from use cases an      indu...Industry - Estimating software maintenance effort from use cases an      indu...
Industry - Estimating software maintenance effort from use cases an indu...ICSM 2011
 
Postdoc Symposium - Abram Hindle
Postdoc Symposium - Abram HindlePostdoc Symposium - Abram Hindle
Postdoc Symposium - Abram HindleICSM 2011
 
Impact analysis - A Seismology-inspired Approach to Study Change Propagation
Impact analysis - A Seismology-inspired Approach to Study Change PropagationImpact analysis - A Seismology-inspired Approach to Study Change Propagation
Impact analysis - A Seismology-inspired Approach to Study Change PropagationICSM 2011
 
ERA - Clustering and Recommending Collections of Code Relevant to Task
ERA - Clustering and Recommending Collections of Code Relevant to TaskERA - Clustering and Recommending Collections of Code Relevant to Task
ERA - Clustering and Recommending Collections of Code Relevant to TaskICSM 2011
 
Richard Kemmerer Keynote icsm11
Richard Kemmerer Keynote icsm11Richard Kemmerer Keynote icsm11
Richard Kemmerer Keynote icsm11ICSM 2011
 
Lionel Briand ICSM 2011 Keynote
Lionel Briand ICSM 2011 KeynoteLionel Briand ICSM 2011 Keynote
Lionel Briand ICSM 2011 KeynoteICSM 2011
 
ERA - A Comparison of Stemmers on Source Code Identifiers for Software Search
ERA - A Comparison of Stemmers on Source Code Identifiers for Software SearchERA - A Comparison of Stemmers on Source Code Identifiers for Software Search
ERA - A Comparison of Stemmers on Source Code Identifiers for Software SearchICSM 2011
 
Dynamic Analysis - SCOTCH: Improving Test-to-Code Traceability using Slicing ...
Dynamic Analysis - SCOTCH: Improving Test-to-Code Traceability using Slicing ...Dynamic Analysis - SCOTCH: Improving Test-to-Code Traceability using Slicing ...
Dynamic Analysis - SCOTCH: Improving Test-to-Code Traceability using Slicing ...ICSM 2011
 
Postdoc symposium - A Logic Meta-Programming Foundation for Example-Driven Pa...
Postdoc symposium - A Logic Meta-Programming Foundation for Example-Driven Pa...Postdoc symposium - A Logic Meta-Programming Foundation for Example-Driven Pa...
Postdoc symposium - A Logic Meta-Programming Foundation for Example-Driven Pa...ICSM 2011
 
ICSM'01 Most Influential Paper - Rainer Koschke
ICSM'01 Most Influential Paper - Rainer KoschkeICSM'01 Most Influential Paper - Rainer Koschke
ICSM'01 Most Influential Paper - Rainer KoschkeICSM 2011
 
Impact Analysis - ImpactScale: Quantifying Change Impact to Predict Faults in...
Impact Analysis - ImpactScale: Quantifying Change Impact to Predict Faults in...Impact Analysis - ImpactScale: Quantifying Change Impact to Predict Faults in...
Impact Analysis - ImpactScale: Quantifying Change Impact to Predict Faults in...ICSM 2011
 
Components - Crossing the Boundaries while Analyzing Heterogeneous Component-...
Components - Crossing the Boundaries while Analyzing Heterogeneous Component-...Components - Crossing the Boundaries while Analyzing Heterogeneous Component-...
Components - Crossing the Boundaries while Analyzing Heterogeneous Component-...ICSM 2011
 
Industry - The Evolution of Information Systems. A Case Study on Document Man...
Industry - The Evolution of Information Systems. A Case Study on Document Man...Industry - The Evolution of Information Systems. A Case Study on Document Man...
Industry - The Evolution of Information Systems. A Case Study on Document Man...ICSM 2011
 
Faults and Regression testing - Localizing Failure-Inducing Program Edits Bas...
Faults and Regression testing - Localizing Failure-Inducing Program Edits Bas...Faults and Regression testing - Localizing Failure-Inducing Program Edits Bas...
Faults and Regression testing - Localizing Failure-Inducing Program Edits Bas...ICSM 2011
 
ERA - Tracking Technical Debt
ERA - Tracking Technical DebtERA - Tracking Technical Debt
ERA - Tracking Technical DebtICSM 2011
 
Industry - Relating Developers' Concepts and Artefact Vocabulary in a Financ...
Industry -  Relating Developers' Concepts and Artefact Vocabulary in a Financ...Industry -  Relating Developers' Concepts and Artefact Vocabulary in a Financ...
Industry - Relating Developers' Concepts and Artefact Vocabulary in a Financ...ICSM 2011
 
Industry - Evolution and migration - Incremental and Iterative Reengineering ...
Industry - Evolution and migration - Incremental and Iterative Reengineering ...Industry - Evolution and migration - Incremental and Iterative Reengineering ...
Industry - Evolution and migration - Incremental and Iterative Reengineering ...ICSM 2011
 
Natural Language Analysis - Mining Java Class Naming Conventions
Natural Language Analysis - Mining Java Class Naming ConventionsNatural Language Analysis - Mining Java Class Naming Conventions
Natural Language Analysis - Mining Java Class Naming ConventionsICSM 2011
 
Industry - Testing & Quality Assurance in Data Migration Projects
Industry - Testing & Quality Assurance in Data Migration Projects Industry - Testing & Quality Assurance in Data Migration Projects
Industry - Testing & Quality Assurance in Data Migration Projects ICSM 2011
 

Andere mochten auch (20)

Migration and Refactoring - Identifying Overly Strong Conditions in Refactori...
Migration and Refactoring - Identifying Overly Strong Conditions in Refactori...Migration and Refactoring - Identifying Overly Strong Conditions in Refactori...
Migration and Refactoring - Identifying Overly Strong Conditions in Refactori...
 
Industry - Estimating software maintenance effort from use cases an indu...
Industry - Estimating software maintenance effort from use cases an      indu...Industry - Estimating software maintenance effort from use cases an      indu...
Industry - Estimating software maintenance effort from use cases an indu...
 
Postdoc Symposium - Abram Hindle
Postdoc Symposium - Abram HindlePostdoc Symposium - Abram Hindle
Postdoc Symposium - Abram Hindle
 
Impact analysis - A Seismology-inspired Approach to Study Change Propagation
Impact analysis - A Seismology-inspired Approach to Study Change PropagationImpact analysis - A Seismology-inspired Approach to Study Change Propagation
Impact analysis - A Seismology-inspired Approach to Study Change Propagation
 
ERA - Clustering and Recommending Collections of Code Relevant to Task
ERA - Clustering and Recommending Collections of Code Relevant to TaskERA - Clustering and Recommending Collections of Code Relevant to Task
ERA - Clustering and Recommending Collections of Code Relevant to Task
 
Richard Kemmerer Keynote icsm11
Richard Kemmerer Keynote icsm11Richard Kemmerer Keynote icsm11
Richard Kemmerer Keynote icsm11
 
Lionel Briand ICSM 2011 Keynote
Lionel Briand ICSM 2011 KeynoteLionel Briand ICSM 2011 Keynote
Lionel Briand ICSM 2011 Keynote
 
ERA - A Comparison of Stemmers on Source Code Identifiers for Software Search
ERA - A Comparison of Stemmers on Source Code Identifiers for Software SearchERA - A Comparison of Stemmers on Source Code Identifiers for Software Search
ERA - A Comparison of Stemmers on Source Code Identifiers for Software Search
 
Dynamic Analysis - SCOTCH: Improving Test-to-Code Traceability using Slicing ...
Dynamic Analysis - SCOTCH: Improving Test-to-Code Traceability using Slicing ...Dynamic Analysis - SCOTCH: Improving Test-to-Code Traceability using Slicing ...
Dynamic Analysis - SCOTCH: Improving Test-to-Code Traceability using Slicing ...
 
Postdoc symposium - A Logic Meta-Programming Foundation for Example-Driven Pa...
Postdoc symposium - A Logic Meta-Programming Foundation for Example-Driven Pa...Postdoc symposium - A Logic Meta-Programming Foundation for Example-Driven Pa...
Postdoc symposium - A Logic Meta-Programming Foundation for Example-Driven Pa...
 
ICSM'01 Most Influential Paper - Rainer Koschke
ICSM'01 Most Influential Paper - Rainer KoschkeICSM'01 Most Influential Paper - Rainer Koschke
ICSM'01 Most Influential Paper - Rainer Koschke
 
Impact Analysis - ImpactScale: Quantifying Change Impact to Predict Faults in...
Impact Analysis - ImpactScale: Quantifying Change Impact to Predict Faults in...Impact Analysis - ImpactScale: Quantifying Change Impact to Predict Faults in...
Impact Analysis - ImpactScale: Quantifying Change Impact to Predict Faults in...
 
Components - Crossing the Boundaries while Analyzing Heterogeneous Component-...
Components - Crossing the Boundaries while Analyzing Heterogeneous Component-...Components - Crossing the Boundaries while Analyzing Heterogeneous Component-...
Components - Crossing the Boundaries while Analyzing Heterogeneous Component-...
 
Industry - The Evolution of Information Systems. A Case Study on Document Man...
Industry - The Evolution of Information Systems. A Case Study on Document Man...Industry - The Evolution of Information Systems. A Case Study on Document Man...
Industry - The Evolution of Information Systems. A Case Study on Document Man...
 
Faults and Regression testing - Localizing Failure-Inducing Program Edits Bas...
Faults and Regression testing - Localizing Failure-Inducing Program Edits Bas...Faults and Regression testing - Localizing Failure-Inducing Program Edits Bas...
Faults and Regression testing - Localizing Failure-Inducing Program Edits Bas...
 
ERA - Tracking Technical Debt
ERA - Tracking Technical DebtERA - Tracking Technical Debt
ERA - Tracking Technical Debt
 
Industry - Relating Developers' Concepts and Artefact Vocabulary in a Financ...
Industry -  Relating Developers' Concepts and Artefact Vocabulary in a Financ...Industry -  Relating Developers' Concepts and Artefact Vocabulary in a Financ...
Industry - Relating Developers' Concepts and Artefact Vocabulary in a Financ...
 
Industry - Evolution and migration - Incremental and Iterative Reengineering ...
Industry - Evolution and migration - Incremental and Iterative Reengineering ...Industry - Evolution and migration - Incremental and Iterative Reengineering ...
Industry - Evolution and migration - Incremental and Iterative Reengineering ...
 
Natural Language Analysis - Mining Java Class Naming Conventions
Natural Language Analysis - Mining Java Class Naming ConventionsNatural Language Analysis - Mining Java Class Naming Conventions
Natural Language Analysis - Mining Java Class Naming Conventions
 
Industry - Testing & Quality Assurance in Data Migration Projects
Industry - Testing & Quality Assurance in Data Migration Projects Industry - Testing & Quality Assurance in Data Migration Projects
Industry - Testing & Quality Assurance in Data Migration Projects
 

Kürzlich hochgeladen

"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 

Kürzlich hochgeladen (20)

"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 

Natural Language Analysis - Expanding Identifiers to Normalize Source Code Vocabulary

  • 1. EXPANDING IDENTIFIERS TO NORMALIZING SOURCE CODE VOCABULARY PRESENTED BY DAWN LAWRIE LOYOLA UNIVERSITY MARYLAND IN COLLABORATION WITH DAVE BINKLEY Friday, October 7, 11
  • 2. VOCABULARY MISMATCH DIFFERENT VOCABULARY IN SOURCE CODE AND OTHER SOFTWARE ARTIFACTS EXAMPLE REQUIREMENT - “FEATURE LOCATION” SOURCE CODE - “FEATURELOCATION” OR WORSE “FLOC” Friday, October 7, 11
  • 3. PURPOSE OF NORMALIZE COPE WITH VOCABULARY MISMATCH SOURCE CODE OTHER SOFTWARE DOCUMENTS Friday, October 7, 11
  • 4. EXAMPLE PROBLEMS CONSIDER IDENTIFIERS FEATURELOCATION FLOC Friday, October 7, 11
  • 5. EXAMPLE PROBLEMS CONSIDER IDENTIFIERS FEATURE LOCATION SPLITTING PROBLEM FLOC Friday, October 7, 11
  • 6. EXAMPLE PROBLEMS CONSIDER IDENTIFIERS FEATURE LOCATION SPLITTING PROBLEM F LOC SPLITTING PROBLEM Friday, October 7, 11
  • 7. EXAMPLE PROBLEMS CONSIDER IDENTIFIERS FEATURE LOCATION SPLITTING PROBLEM FEATURE LOCATION SPLITTING AND EXPANSION PROBLEM Friday, October 7, 11
  • 8. WHY NORMALIZE? MANY SE PROBLEMS CAN BE ADDRESSED USING INFORMATION RETRIEVAL (IR) TECHNIQUES UN-NORMALIZED CODE LEADS TO AN UNDER ESTIMATE OF THE IMPORTANCE OF CRUCIAL WORDS Friday, October 7, 11
  • 9. NORMALIZE PROBLEM STATEMENT FIND THE BEST EXPANSION OVERALL POSSIBLE SPLITS FLOC FEATURE LOCATION Friday, October 7, 11
  • 10. NORMALIZE ALGORITHM TERMINOLOGY HARD-WORD - WHITEHOUSE_LAWN SOFT-WORD - WHITE-HOUSE_LAWN Friday, October 7, 11
  • 11. NORMALIZE ALGORITHM TERMINOLOGY HARD-WORD - WHITEHOUSE_LAWN (2) SOFT-WORD - WHITE-HOUSE_LAWN Friday, October 7, 11
  • 12. NORMALIZE ALGORITHM TERMINOLOGY HARD-WORD - WHITEHOUSE_LAWN (2) SOFT-WORD - WHITE-HOUSE_LAWN (3) Friday, October 7, 11
  • 14. NORMALIZE ALGORITHM STRLEN STRING LENGTH Friday, October 7, 11
  • 15. MACHINE TRANSLATION APPROACH EL PAPA VISITA LA IGLESIA Friday, October 7, 11
  • 16. MACHINE TRANSLATION APPROACH EL PAPA VISITA LA IGLESIA FATHER VISITS THE POTATO VISITOR THE CHURCH POPE HIT Friday, October 7, 11
  • 17. MACHINE TRANSLATION APPROACH EL PAPA VISITA LA IGLESIA FATHER VISITS THE POTATO VISITOR THE CHURCH POPE HIT Friday, October 7, 11
  • 18. MACHINE TRANSLATION APPROACH EL PAPA VISITA LA IGLESIA FATHER VISITS THE POTATO VISITOR THE CHURCH POPE HIT COH ESION STRONG Friday, October 7, 11
  • 19. MACHINE TRANSLATION APPROACH EL PAPA VISITA LA IGLESIA FATHER VISITS THE POTATO VISITOR THE CHURCH POPE HIT COH ESION STRONG Friday, October 7, 11
  • 21. NORMALIZE ALGORITHM STRLEN Friday, October 7, 11
  • 22. NORMALIZE ALGORITHM STRLEN S-TRLEN ST-RLEN STR-LEN STRL_EN STRLE_N S_T_RLEN S-TR-LEN S_TRL_EN S_TRLE_N ST_R_LEN ST_RL_EN ST_RLE_N STR_L_EN STR_LE_N STRL_E_N S_T_R_LEN S_T_RL_EN S_T_RLE_N S_TR_L_EN S_TR_LE_N S_TRL_E_N ST_R_L_EN ST_R_LE_N ST_RL_E_N STR_L_E_N S_T_R_L_EN S_T_R_LE_N S_TR_L_E_N ST_R_L_E_N S-T-R-L-E-N Friday, October 7, 11
  • 23. NORMALIZE ALGORITHM STRLEN S-TRLEN E(RLEN) = {RIFLEMEN} ST-RLEN STR-LEN STRL_EN STRLE_N S_T_RLEN S-TR-LEN S_TRL_EN S_TRLE_N ST_R_LEN ST_RL_EN ST_RLE_N STR_L_EN STR_LE_N STRL_E_N S_T_R_LEN S_T_RL_EN S_T_RLE_N S_TR_L_EN S_TR_LE_N S_TRL_E_N ST_R_L_EN ST_R_LE_N ST_RL_E_N STR_L_E_N S_T_R_L_EN S_T_R_LE_N S_TR_L_E_N ST_R_L_E_N S-T-R-L-E-N Friday, October 7, 11
  • 24. NORMALIZE ALGORITHM STRLEN S-TRLEN E(RLEN) = {RIFLEMEN} ST-RLEN WILDCARD EXPANSION STR-LEN STRL_EN STRLE_N R*L*E*N* S_T_RLEN S-TR-LEN S_TRL_EN S_TRLE_N ST_R_LEN ST_RL_EN ST_RLE_N STR_L_EN STR_LE_N STRL_E_N S_T_R_LEN S_T_RL_EN S_T_RLE_N S_TR_L_EN S_TR_LE_N S_TRL_E_N ST_R_L_EN ST_R_LE_N ST_RL_E_N STR_L_E_N S_T_R_L_EN S_T_R_LE_N S_TR_L_E_N ST_R_L_E_N S-T-R-L-E-N Friday, October 7, 11
  • 25. NORMALIZE ALGORITHM STRLEN E(ST) = {SET, STOP, STRING} S-TRLEN E(RLEN) = {RIFLEMEN} ST-RLEN STR-LEN E(STR) = {STEER, STRING} STRL_EN E(LEN) = {LENDER, LENGTH} STRLE_N S_T_RLEN S-TR-LEN S_TRL_EN S_TRLE_N ST_R_LEN ST_RL_EN ST_RLE_N STR_L_EN STR_LE_N STRL_E_N S_T_R_LEN S_T_RL_EN S_T_RLE_N S_TR_L_EN S_TR_LE_N S_TRL_E_N ST_R_L_EN ST_R_LE_N ST_RL_E_N STR_L_E_N S_T_R_L_EN S_T_R_LE_N S_TR_L_E_N ST_R_L_E_N S-T-R-L-E-N Friday, October 7, 11
  • 26. NORMALIZE ALGORITHM PART I STR VS STRING STEER Friday, October 7, 11
  • 27. NORMALIZE ALGORITHM PART I STR VS LENDER LENDER STRING STEER LENGTH LENGTH Friday, October 7, 11
  • 28. NORMALIZE ALGORITHM PART I STR VS LENDER LENDER STRING STEER LENGTH LENGTH 1. FIND COHESION BY SUMMING LOG OF PROBABILITIES OF WORD PAIRS Friday, October 7, 11
  • 29. NORMALIZE ALGORITHM PART I STR VS LENDER LENDER STRING STEER + LENGTH + LENGTH COHESIONA COHESIONB 1. FIND COHESION BY SUMMING LOG OF PROBABILITIES OF WORD PAIRS Friday, October 7, 11
  • 30. NORMALIZE ALGORITHM PART I STR VS LENDER LENDER STRING STEER + LENGTH + LENGTH COHESIONA COHESIONB 1. FIND COHESION BY SUMMING LOG OF PROBABILITIES OF WORD PAIRS 2. SELECT EXPANSION THAT MAXIMIZES COHESION Friday, October 7, 11
  • 31. NORMALIZE ALGORITHM PART I STR VS LENDER LENDER STRING STEER + LENGTH + LENGTH COHESIONA COHESIONB 1. FIND COHESION BY SUMMING LOG OF PROBABILITIES OF WORD PAIRS 2. SELECT EXPANSION THAT MAXIMIZES COHESION Friday, October 7, 11
  • 32. NORMALIZE ALGORITHM PART I STR VS LENDER LENDER STRING STEER + LENGTH + LENGTH COHESIONA COHESIONB STRING 1. FIND COHESION BY SUMMING LOG OF PROBABILITIES OF WORD PAIRS 2. SELECT EXPANSION THAT MAXIMIZES COHESION Friday, October 7, 11
  • 33. NORMALIZE ALGORITHM PART II VS STR-LEN ST-RLEN Friday, October 7, 11
  • 34. NORMALIZE ALGORITHM PART II VS STR-LEN ST-RLEN STRING LENGTH STOP RIFLEMEN Friday, October 7, 11
  • 35. NORMALIZE ALGORITHM PART II VS STR-LEN ST-RLEN STRING LENGTH STOP RIFLEMEN 1. FIND COHESION OVER EXPANSIONS Friday, October 7, 11
  • 36. NORMALIZE ALGORITHM PART II VS STR-LEN ST-RLEN STRING LENGTH STOP RIFLEMEN 1. FIND COHESION OVER EXPANSIONS 2. SELECT EXPANSION OF THE SPLIT THAT MAXIMIZES COHESION Friday, October 7, 11
  • 37. NORMALIZE ALGORITHM PART II VS STR-LEN ST-RLEN STRING LENGTH STOP RIFLEMEN 1. FIND COHESION OVER EXPANSIONS 2. SELECT EXPANSION OF THE SPLIT THAT MAXIMIZES COHESION Friday, October 7, 11
  • 38. NORMALIZE ALGORITHM PART II VS STR-LEN ST-RLEN STRING LENGTH STOP RIFLEMEN STRING LENGTH 1. FIND COHESION OVER EXPANSIONS 2. SELECT EXPANSION OF THE SPLIT THAT MAXIMIZES COHESION Friday, October 7, 11
  • 40. ADDING CONTEXT DIR Friday, October 7, 11
  • 41. ADDING CONTEXT DIR E(DIR) = {DIRECTION, DIRECTORY} Friday, October 7, 11
  • 42. ADDING CONTEXT DIR E(DIR) = {DIRECTION, DIRECTORY} CONTEXT = {FORWARD, BACKWARD} Friday, October 7, 11
  • 43. ADDING CONTEXT DIR E(DIR) = {DIRECTION, DIRECTORY} CONTEXT = {FORWARD, BACKWARD} FIND COHESION WITH CONTEXT WORDS IN ADDITION TO EXPANSIONS OF OTHER SOFT WORDS USED IN BOTH PART 1 AND PART 2 Friday, October 7, 11
  • 44. NORMALIZE IMPLEMENTATION USES GenTest TO SPLIT IDENTIFIERS RETURNS MULTIPLE SPLITS GOOGLE 5-GRAM DATASET Friday, October 7, 11
  • 45. EVALUATION Program Loc SLoc Unique Ids which-2.20 3,670 2,293 487 a2ps-4.14 62,347 38,436 4,393 Program Selected Ids Hard Words Soft Words which-2.20 487 903 1214 a2ps-4.14 211 459 618 Friday, October 7, 11
  • 46. EVALUATION THREE GROUPS OF IDENTIFIERS STANDARD LIBRARY CALLS NAMES FROM STANDARD HEADER FILES / KEYWORDS DOMAIN NAMES Friday, October 7, 11
  • 47. EVALUATION THREE GROUPS OF IDENTIFIERS STANDARD LIBRARY CALLS NAMES FROM STANDARD HEADER FILES / KEYWORDS DOMAIN NAMES Friday, October 7, 11
  • 48. EVALUATION THREE GROUPS OF IDENTIFIERS STANDARD LIBRARY CALLS NAMES FROM STANDARD HEADER FILES / KEYWORDS DOMAIN NAMES Program Filtered Ids Reported Ids which-2.20 152 335 a2ps-4.14 46 166 Friday, October 7, 11
  • 49. EXAMPLE EXPANSIONS id Top 10 Top Expansion Expansion nextchar next_character next_character indfound index_found_need index_found optarg option_are_g optarg itemno i_them_not itemno Friday, October 7, 11
  • 50. RESEARCH QUESTIONS WHAT IS THE OVERALL ACCURACY OF NORMALIZE? DOES THE VOCABULARY USED HAVE A SIGNIFICANT IMPACT ON THE EXPANSION’S ACCURACY? CAN THE EXPANDER INFORM THE SPLITTER? CAN THE SPLITTER INFORM THE EXPANDER? Friday, October 7, 11
  • 51. ACCURACY ON DOMAIN IDS Friday, October 7, 11
  • 52. SOURCE OF EXPANSION WORDS SOURCE CODE INTERNAL DOCUMENTATION MANUAL Friday, October 7, 11
  • 54. FUTURE WORK EXPLORING DIFFERENT SOURCES OF CO-OCCURRENCE DATA EXPLORING DIFFERENT WAYS OF CALCULATING PROBABILITIES EXAMINING NORMALIZATION IN CONTEXT OF AN INFORMATION RETRIEVAL TASK Friday, October 7, 11
  • 55. SUMMARY IDENTIFIERS ARE WRITTEN DIFFERENTLY THAN OTHER SOFTWARE DOCUMENTS DEGRADES PERFORMANCE OF IR TECHNIQUES NORMALIZE CURRENTLY EXPANDS ABOUT HALF OF SOFT WORDS CORRECTLY Friday, October 7, 11
  • 56. QUESTIONS? Need an identifier split? GenTest Splitter available at splitit.cs.loyola.edu Friday, October 7, 11