Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.

Put Your Hands in the Mud: What Technique, Why, and How

1.034 Aufrufe

Veröffentlicht am

Mining Unstructured Data (MUD 2015) Workshop tutorial

Veröffentlicht in: Software
  • Als Erste(r) kommentieren

Put Your Hands in the Mud: What Technique, Why, and How

  1. 1. PutYour Hands in the Mud: What Technique, Why, and How? Massimiliano Di Penta University of Sannio, Italy dipenta@unisannio.it http://www.ing.unisannio.it/mdipenta
  2. 2. Outline MUD of software repositories Available techniques:
 Pattern matching
 Island parsers
 IR algebraic methods
 Natural Language Parsing Choosing the solution that fits your needs
  3. 3. Textual Analysis very successful in SE...
  4. 4. Why? Code identifiers and comments reflect semantics Software repositories contain a mix of structured and unstructured data
  5. 5. (Some) Applications Traceability Link Recovery Feature Location Clone Detection Conceptual Cohesion/Coupling Metrics Bug prediction Bug triaging Software remodularization Textual smells/antipatterns Duplicate bug report detection Software artifact summarization and labeling Generating assertions from textual descriptions ....
  6. 6. Different kinds of Repositories Different kinds of Repositories
  7. 7. Versioning Systems Besides code itself, the main source of unstructured data is represented by commit messages
  8. 8. Example • Add Checkclipse preferences to all projects so Checkstyle is preconfigured • Fix for issue 4503. • Fix for issue 4517: "changeability" on association ends not displayed properly. This was never adapted since UML 1.3 & NSUML were replaced. What kind of problem do you see here?
  9. 9. Be aware! • Commit notes are often insufficient to know everything about a change • Need to merge with issue tracker data
  10. 10. Issue Trackers
  11. 11. Issue Trackers • Contain a mix of structured and non- structured data • Structured: classification, priority, severity, status, but also stack traces and code snippets • Unstructured: issue description, comments • Often natural language mixed with method, class, package names, etc.
  12. 12. Challenges • Including comments in the text corpus may or may not add useful information • The mix of natural language and code elements may challenge the application of techniques such as natural language parsing
  13. 13. Techniques Nicolas Bettenburg, Stephen W. Thomas, Ahmed E. Hassan: Using Fuzzy Code Search to Link Code Fragments in Discussions to Source Code. CSMR 2012: 319-328 Using Fuzzy Code Search to Link Code Fragments in Discussions to Source Code Nicolas Bettenburg, Stephen W. Thomas, Ahmed E. Hassan Software Analysis and Intelligence Lab (SAIL) School of Computing, Queen’s University Kingston, Ontario, Canada Email: {nicbet, sthomas, ahmed}@cs.queensu.ca Abstract—When discussing software, practitioners often ref- erence parts of the project’s source code. Such references have different motivations, such as mentoring and guiding less experienced developers, pointing out code that needs changes, or proposing possible strategies for the implementation of future changes. The fact that particular parts of a source code are being discussed makes these parts of the software special. Knowing which code is being talked about the most can not only help practitioners to guide important software engineering and main- tenance activities, but also act as a high-level documentation of development activities for managers. In this paper, we use clone- detection as specific instance of a code search based approach for establishing links between code fragments that are discussed by developers and the actual source code of a project. Through a case study on the Eclipse project we explore the traceability links established through this approach, both quantitatively and qualitatively, and compare fuzzy code search based traceability linking to classical approaches, in particular change log analysis and information retrieval. We demonstrate a sample application of code search based traceability links by visualizing those parts of the project that are most discussed in issue reports with a Treemap visualization. The results of our case study show that the traceability links established through fuzzy code search- Past approaches to uncovering traceability links between doc- umentation and source code are commonly based on in- formation retrieval [2], [15], [18], [20], natural language processing [1], [19] and lightweight textual analyses [5], [11], [12]. Each approach, however, is tailored towards a specific set of goals and use cases. For example, when linking code changes to issue reports by analyzing transaction logs [28], we observe only the associations between a bug report and the final locations of the bug fix, but miss the bug fixing history: all the locations that a developer had to investigate and understand before he could find an appropriate way to fix the error. In this paper, we aim to find traceability links between issue reports and source code. For this purpose, we propose a new approach that uses token-based clone detection as an implementation of fuzzy code search for discovering links between code fragments mentioned in project discussions and the location of these fragments in the source code body of a software system. In a case study on the ECLIPSE project, we Using Fuzzy Code Search to Link Code Fragments in Discussions to Source Code Nicolas Bettenburg, Stephen W. Thomas, Ahmed E. Hassan Software Analysis and Intelligence Lab (SAIL) School of Computing, Queen’s University Kingston, Ontario, Canada Email: {nicbet, sthomas, ahmed}@cs.queensu.ca Abstract—When discussing software, practitioners often ref-Abstract—When discussing software, practitioners often ref-Abstract erence parts of the project’s source code. Such references have different motivations, such as mentoring and guiding less experienced developers, pointing out code that needs changes, or proposing possible strategies for the implementation of future changes. The fact that particular parts of a source code are being discussed makes these parts of the software special. Knowing which code is being talked about the most can not only help practitioners to guide important software engineering and main- tenance activities, but also act as a high-level documentation of Past approaches to uncovering traceability links between doc- umentation and source code are commonly based on in- formation retrieval [2], [15], [18], [20], natural language processing [1], [19] and lightweight textual analyses [5], [11], [12]. Each approach, however, is tailored towards a specific set of goals and use cases. For example, when linking code changes to issue reports by analyzing transaction logs [28], we observe only the associations between a bug report and
  14. 14. Emails Similar characteristics of issue trackers, with less structure From lisa at usna.navy.MIL Thu Jul 1 15:42:27 1999 From: lisa at usna.navy.MIL (Lisa Becktold {CADIG STAFF}) Date: Tue Dec 2 03:03:10 2003 Subject: nmbd/nmbd_processlogon.c - CODE 12??? Message-ID: <99Jul1.114236-0400edt.4995-357+39@jupiter.usna.navy.mil Hi: I have Samba 2.1.0-prealpha running on a Sun Ultra 10. My NT PC joined the domain without a problem, but I can't logon. Every time I attempt to log into the Samba domain, the NT screen blanks as it usually does during login, but then the "Begin Login" screen reappears. I see this message in my samba/var/log.nmb file whenever I try to login:
  15. 15. Web Forums
  16. 16. Stack Overflow
  17. 17. Mining Forums: Challenges • The usual problems you face with issue trackers and emails are still there • However in general the separation between different elements is much more consistent
  18. 18. StackOverflow Embedded Code
  19. 19. StackOverflow Embedded Code
  20. 20. However, this convention is not always followed!
  21. 21. User reviews
  22. 22. Not all reviews are equal AR-Miner: Mining Informative Reviews for Developers from Mobile App Marketplace Ning Chen, Jialiu Lin† , Steven C. H. Hoi, Xiaokui Xiao, Boshen Zhang Nanyang Technological University, Singapore, † Carnegie Mellon University, USA {nchen1,chhoi,xkxiao,bszhang}@ntu.edu.sg, † jialiul@cs.cmu.edu ABSTRACT With the popularity of smartphones and mobile devices, mo- bile application (a.k.a. “app”) markets have been growing exponentially in terms of number of users and download- s. App developers spend considerable e↵ort on collecting and exploiting user feedback to improve user satisfaction, but su↵er from the absence of e↵ective user review ana- lytics tools. To facilitate mobile app developers discover the most “informative” user reviews from a large and rapid- ly increasing pool of user reviews, we present “AR-Miner” — a novel computational framework for App Review Min- ing, which performs comprehensive analytics from raw user reviews by (i) first extracting informative user reviews by filtering noisy and irrelevant ones, (ii) then grouping the in- formative reviews automatically using topic modeling, (iii) further prioritizing the informative reviews by an e↵ective review ranking scheme, (iv) and finally presenting the group- s of most “informative” reviews via an intuitive visualization approach. We conduct extensive experiments and case s- tudies on four popular Android apps to evaluate AR-Miner, intense, in order to seize the initiative, developers tend to employ an iterative process to develop, test, and improve apps [23]. Therefore, timely and constructive feedback from users becomes extremely crucial for developers to fix bugs, implement new features, and improve user experience ag- ilely. One key challenge to many app developers is how to obtain and digest user feedback in an e↵ective and e cient manner, i.e., the “user feedback extraction” task. One way to extract user feedback is to adopt typical channels used in traditional software development, such as (i) bug/change repositories (e.g., Bugzilla [3]), (ii) crash reporting systems [19], (iii) online forums (e.g., SwiftKey feedback forum [6]), and (iv) emails [10]. Unlike the traditional channels, modern app marketplaces, such as Apple App Store and Google Play, o↵er a much eas- ier way (i.e., the web-based market portal and the market app) for users to rate and post app reviews. These reviews present user feedback on various aspects of apps (such as functionality, quality, performance, etc), and provide app developers a new and critical channel to extract user feed- AR-Miner: Mining Informative Reviews for Developers from Mobile App Marketplace Ning Chen, Jialiu Lin† , Steven C. H. Hoi, Xiaokui Xiao, Boshen Zhang Nanyang Technological University, Singapore, † Carnegie Mellon University, USA {nchen1,chhoi,xkxiao,bszhang}@ntu.edu.sg, † jialiul@cs.cmu.edu ABSTRACT With the popularity of smartphones and mobile devices, mo- bile application (a.k.a. “app”) markets have been growing exponentially in terms of number of users and download- s. App developers spend considerable e↵ort on collecting↵ort on collecting↵ and exploiting user feedback to improve user satisfaction, but su↵er from the absence of e↵er from the absence of e↵ ↵ective user review ana-↵ective user review ana-↵ lytics tools. To facilitate mobile app developers discover the most “informative” user reviews from a large and rapid- ly increasing pool of user reviews, we present “AR-Miner” intense, in order to seize the initiative, developers tend to employ an iterative process to develop, test, and improve apps [23]. Therefore, timely and constructive feedback from users becomes extremely crucial for developers to fix bugs, implement new features, and improve user experience ag- ilely. One key challenge to many app developers is how to obtain and digest user feedback in an e↵ective and e↵ective and e↵ cient manner, i.e., the “user feedback extraction” task. One way to extract user feedback is to adopt typical channels used in traditional software development, such as (i) bug/change repositories (e.g., Bugzilla [3]), (ii) crash reporting systems Ning Chen, Jialiu Lin, Steven C. H. Hoi, Xiaokui Xiao, Boshen Zhang:AR-miner: mining informative reviews for developers from mobile app marketplace. ICSE 2014: 767-778
  23. 23. Informative reviews… Functional None of the pictures will load in my news feed. Performance It lags and doesn't respond to my touch which almost always causes me to run into stuff. Feature (change) request Amazing app, although I wish there were more themes to choose from Please make it a little easy to get bananas please and make more power ups that would be awesome. Remove ads So many ads its unplayable! Fix permissions This game is adding for too much unexplained permissions.
  24. 24. Non-Informative reviews Purely emotional Great fun can't put it down! This is a crap app. Description of app/actions I have changed my review from 2 star to 1 star. Unclear issue description Bad game this is not working on my phone. Questions How can I get more points?
  25. 25. @ICSME 2015 
 Thurdsay, 13.50 - Mobile applications User Reviews Matter! Tracking Crowdsourced Reviews to Support Evolution of Successful Apps Fabio Palomba⇤, Mario Linares-V´asquez§, Gabriele Bavota†, Rocco Oliveto‡, Massimiliano Di Penta¶, Denys Poshyvanyk§, Andrea De Lucia⇤ ⇤University of Salerno, Fisciano (SA), Italy – §The College of William and Mary, VA, USA †Free University of Bozen-Bolzano, Bolzano (BZ), Italy – ‡University of Molise, Pesche (IS), Italy ¶University of Sannio, Benevento, Italy Abstract—Nowadays software applications, and especially mo- bile apps, undergo frequent release updates through app stores. After installing/updating apps, users can post reviews and provide ratings, expressing their level of satisfaction with apps, and possibly pointing out bugs or desired features. In this paper we show—by performing a study on 100 Android apps—how applications addressing user reviews increase their success in terms of rating. Specifically, we devise an approach, named CRISTAL, for tracing informative crowd reviews onto source code changes, and for monitoring the extent to which developers accommodate crowd requests and follow-up user reactions as reflected in their ratings. The results indicate that developers implementing user reviews are rewarded in terms of ratings. This poses the need for specialized recommendation systems aimed at analyzing informative crowd reviews and prioritizing feedback to be satisfied in order to increase the apps success. systems, and network conditions that may not necessarily be reproducible during development/testing activities. Consequently, by reading reviews and analyzing the ratings, development teams are encouraged to improve their apps, for example, by fixing bugs or by adding commonly requested features. According to a recent Gartner report’s recommenda- tion [21], given the complexity of mobile testing “development teams should monitor app store reviews to identify issues that are difficult to catch during testing, and to clarify issues that cause problems on the users’s side”. Moreover, useful app reviews reflect crowd-based needs and are a valuable source of comments, bug reports, feature requests, and informal user experience feedback [7], [8], [14], [19], [22], [27], [29]. In this paper we investigate to what extent app development User Reviews Matter! Tracking Crowdsourced Reviews to Support Evolution of Successful Apps Fabio Palomba⇤, Mario Linares-V´asquez§, Gabriele Bavota†, Rocco Oliveto‡, Massimiliano Di Penta¶, Denys Poshyvanyk§, Andrea De Lucia⇤ ⇤University of Salerno, Fisciano (SA), Italy – §The College of William and Mary, VA, USA †Free University of Bozen-Bolzano, Bolzano (BZ), Italy – ‡University of Molise, Pesche (IS), Italy ¶University of Sannio, Benevento, Italy Abstract—Nowadays software applications, and especially mo-Abstract—Nowadays software applications, and especially mo-Abstract bile apps, undergo frequent release updates through app stores. After installing/updating apps, users can post reviews and provide ratings, expressing their level of satisfaction with apps, and possibly pointing out bugs or desired features. In this paper we show—by performing a study on 100 Android apps—how applications addressing user reviews increase their success in terms of rating. Specifically, we devise an approach, named CRISTAL, for tracing informative crowd reviews onto source code changes, and for monitoring the extent to which developers accommodate crowd requests and follow-up user reactions as systems, and network conditions that may not necessarily be reproducible during development/testing activities. Consequently, by reading reviews and analyzing the ratings, development teams are encouraged to improve their apps, for example, by fixing bugs or by adding commonly requested features. According to a recent Gartner report’s recommenda- tion [21], given the complexity of mobile testing “development teams should monitor app store reviews to identify issues that are difficult to catch during testing, and to clarify issues that cause problems on the users’s side”. Moreover, useful app
  26. 26. So what we get? EmailsEmailsEmailsEmails Versioning Issue Reports
  27. 27. Continuous interleaving of structured and unstructured data Format not always consistent Technical/domain terms, abbreviations, acronyms Short documents, documents having different size
  28. 28. Mining unstructured data: techniques
  29. 29. Different techniques Techniques based on string / regexp matching Information Retrieval Models Natural Language Parsing Island and lake parsers
  30. 30. Pattern/Regular Expression Matching Pattern/Regular Expression Matching
  31. 31. Pattern matching: when and where • You’re not interested to the whole document content • Rather, you want to match some keywords • Very few variants • Context insentitiveness
  32. 32. Basic tools you might want to use • Unix command line tools: wc, cut, sort, uniq, head, tail • Regular expression engines: grep, sed, awk • Scripting Languages: Perl, Python • General purposes programming languages with regexp support:
 java.util.regexp (similar to Perl regexp)
  33. 33. Pattern Matching in Perl • The =~ or !~ searches a string for a pattern • /PATTERN/ define a pattern • e.g. if($a=~/PATTERN/i){...} or 
 if($a!~/PATTERN){...} • s/PATTERN/REPLACEMENT/egimosx • e.g. $a=~s/PATTERN/REPLACEMENT/g;
  34. 34. Special Symbols w Match a "word" character (alphanumeric plus "_") W Match a non-word character s Match a whitespace character S Match a non-whitespace character d Match a digit character D Match a non-digit character
  35. 35. Assertions b Match a word boundary B Match a non-(word boundary) A Match only at beginning of string Z Match only at end of string, or before newline at the end z Match only at end of string
  36. 36. Matching issue ids “fix 367920 setting pop3 messages as junk/not junk ignored when message quarantining turned on sr=mscott $note=~/Issue (d+)/i || $note=~/Issue number:s+(d +)/i || $note=~/Defects+(d+)/i $note=~/Fixs+(d+)/i
  37. 37. Mapping emails onto classes/files • Emails often refer classes / source files • One can trace email to file using IR-based traceability • In the end simple regexp matching is more precise and efficient Alberto Bacchelli, Michele Lanza, Romain Robbes: 
 Linking e-mails and source code artifacts. ICSE (1) 2010: 375-384
  38. 38. Example (Apache httpd) Hi, I'm running SunOS4.1.3 (with vif kernel hacks) and the Apache server. www.xyz.com is a CNAME alias for xyz.com. <VirtualHost www.xyz.com etc., is in my httpd.conf file. Accessing http://www.xyz.com *fails* (actually, it brings up the default DocumentRoot). Accessing http://xyz.com *succeeds*, however, bringing up theVirtual DocumentRoot. The get_local_addr() function in the Apache util.c file seems to return the wrong socket name, but I can't figure out why. Any help is greatly appreciated!
  39. 39. (Simplified) Perl implementation #! /usr/bin/perl @fileNames=`cat filelist.txt`; #content of filelist.txt goes into array @fileNames %hashFiles=(); foreach $f(@fileNames){ chomp($f); $hashFiles{$f}++; #populates hashtable %hashFiles } while(<>){ $l=$_; chomp($l); #matches filename composed of anum chars, ending with .c and word boundary (b) while($l=~/(w+.cb)/){ $name=$1; if(defined($hashFiles{$name})){ #checks if the filename is in the hash table print "Mail mapped to file $namen"; } $l=$'; #assigns post matching to $l } } while(<>){ $l=$_; chomp($l); #matches filename composed of anum chars, ending with .c and word boundary (b) while($l=~/(w+.cb)/){
  40. 40. RegExp: Pros and Cons • Often the easiest solution • Fast and precise enough • Unsupervised • Pattern variants can lead to
 misclassifications • Context-insensitive ✘ ✔ ✔ ✔ ✘
  41. 41. Key Ingredients • Often finding the right pattern is not straight-forward • A lot of manual analysis might be needed • Iterative process
  42. 42. Island and Lake Parsers
  43. 43. General Ideas Island parser: ignore everything (lake) until you find a specific grammar rule (island) Lake parser: parse everything (lake) unless you find something you’re not able to match a rule (island)
  44. 44. Traditional applications Leon Moonen: Generating Robust Parsers Using Island Grammars. WCRE 2001: 13-
  45. 45. free text error log free text source code free text stack trace Applications to MUD
  46. 46. Approach Alberto Bacchelli, Tommaso Dal Sasso, Marco D'Ambros, Michele Lanza: Content classification of development emails. ICSE 2012: 375-385 Content Classification of Development Emails Alberto Bacchelli, Tommaso Dal Sasso, Marco D’Ambros, Michele Lanza REVEAL @ Faculty of Informatics — University of Lugano, Switzerland Abstract—Emails related to the development of a software system contain information about design choices and issues encountered during the development process. Exploiting the knowledge embedded in emails with automatic tools is chal- lenging, due to the unstructured, noisy and mixed language nature of this communication medium. Natural language text is often not well-formed and is interleaved with languages with other syntaxes, such as code or stack traces. We present an approach to classify email content at line level. Our technique classifies email lines in five categories (i.e., text, junk, code, patch, and stack trace) to allow one to sub- sequently apply ad hoc analysis techniques for each category. We evaluated our approach on a statistically significant set of emails gathered from mailing lists of four unrelated open source systems. Keywords-Empirical software engineering; Unstructured Data Mining; Emails I. Introduction Software repositories supply information useful for sup- porting software analysis and program comprehension [17]. Di↵erent repositories o↵er di↵erent perspectives on systems: (e.g., IRC chat logs, mailing lists) require the most care, as the documents leave complete freedom to the authors. For example, Bettenburg et al. presented the risks of using email data without a proper cleaning pre-processing phase [9]. NL documents are usually treated as bags of words–a count of terms’ occurrences. This simplification is proven to be e↵ective in the information retrieval (IR) field, where techniques are tested on well-formed NL documents [25]. In software engineering, although e↵ective for some tasks (e.g., traceability between documents and code [1]), this approach reduces the quality, reliability, and comprehensibility of the available information, as NL text is often not well-formed and is interleaved with languages with di↵erent syntaxes: code fragments, stack traces, patches, etc. We present a work for advancing the analysis of the contents of development emails. We argue that we should not create a single bag with terms indiscriminately coming from NL parts, code fragments, email signatures, patches, etc. and treat them equally. We need to recognize every language in an email to enable techniques exploiting the peculiarities of Content Classification of Development Emails Alberto Bacchelli, Tommaso Dal Sasso, Marco D’Ambros, Michele Lanza REVEAL @ Faculty of Informatics — University of Lugano, Switzerland Abstract—Emails related to the development of a softwareAbstract—Emails related to the development of a softwareAbstract system contain information about design choices and issues encountered during the development process. Exploiting the knowledge embedded in emails with automatic tools is chal- lenging, due to the unstructured, noisy and mixed language nature of this communication medium. Natural language text is often not well-formed and is interleaved with languages with other syntaxes, such as code or stack traces. We present an approach to classify email content at line level. Our technique classifies email lines in five categories (i.e., text, junk, code, patch, and stack trace) to allow one to sub- (e.g., IRC chat logs, mailing lists) require the most care, as the documents leave complete freedom to the authors. For example, Bettenburg et al. presented the risks of using email data without a proper cleaning pre-processing phase [9]. NL documents are usually treated as bags of words–a count of terms’ occurrences. This simplification is proven to be e↵ective in the information retrieval (IR) field, where↵ective in the information retrieval (IR) field, where↵ techniques are tested on well-formed NL documents [25]. In software engineering, although e↵ective for some tasks (↵ective for some tasks (↵ e.g.,
  47. 47. Alternative approach Lines/paragraphs belonging to different elements contain different proportions of keywords, special characters, etc. Alberto Bacchelli, Marco D'Ambros, Michele Lanza: Extracting Source Code from E-Mails. ICPC 2010: 24-33 Extracting Source Code from E-Mails Alberto Bacchelli, Marco D’Ambros, Michele Lanza REVEAL@ Faculty of Informatics - University of Lugano, Switzerland Abstract—E-mails, used by developers and system users to communicate over a broad range of topics, offer a valuable source of information. If archived, e-mails can be mined to support program comprehension activities and to provide views of a software system that are alternative and complementary to those offered by the source code. However, e-mails are written in natural language, and therefore contain noise that makes it difficult to retrieve the important data. Thus, before conducting an effective system analysis and extracting data for program comprehension, it is necessary to select the relevant messages, and to expose only the meaningful information. In this work we focus both on classifying e-mails that hold fragments of the source code of a system, and on extracting the source code pieces inside the e-mail. We devised and analyzed a number of lightweight techniques to accomplish these tasks. To assess the validity of our techniques, we manually inspected and annotated a statistically significant number of e-mails from five unrelated open source software systems written in Java. With such a benchmark in place, we measured the effectiveness of each technique in terms of precision and recall. about a system. E-mails are used to discuss issues ranging from low-level decisions (e.g., implementation strategies, bug fixing) up to high-level considerations (e.g., design rationale, future planning). Our goal is to use this kind of artifacts to enhance program comprehension and analysis. We already devised techniques to recover the traceability links between source code entities and e-mails [4]. However, such ties can only be retrieved by knowing the entities of the system in advance, i.e., by having the model of the system generated from the source code. In this paper we tackle a different problem: We want to extract source code fragments from e-mail messages. To do this, we first need to select e-mails that contain source code fragments, and then we extract such fragments from the content in which they are enclosed. Separating source code from natural language in e-mail messages brings several benefits: The access to structured data (1) facilitates the reconstruction Extracting Source Code from E-Mails Alberto Bacchelli, Marco D’Ambros, Michele Lanza REVEAL@ Faculty of Informatics - University of Lugano, Switzerland Abstract—E-mails, used by developers and system users toAbstract—E-mails, used by developers and system users toAbstract communicate over a broad range of topics, offer a valuable source of information. If archived, e-mails can be mined to support program comprehension activities and to provide views of a software system that are alternative and complementary to those offered by the source code. However, e-mails are written in natural language, and therefore contain noise that makes it difficult to retrieve the important data. Thus, before conducting an effective system analysis and extracting data for program comprehension, it is necessary to select the relevant messages, and to expose only the meaningful information. In this work we focus both on classifying e-mails that hold about a system. E-mails are used to discuss issues ranging from low-level decisions (e.g., implementation strategies, bug fixing) up to high-level considerations (e.g., design rationale, future planning). Our goal is to use this kind of artifacts to enhance program comprehension and analysis. We already devised techniques to recover the traceability links between source code entities and e-mails [4]. However, such ties can only be retrieved by knowing the entities of the system in advance, i.e., by having the model of the system generated from the source code. In this paper we tackle a different problem: We want to extract source code
  48. 48. Learning hidden languages Luigi Cerulo, Massimiliano Di Penta, Alberto Bacchelli, Michele Ceccarelli, Gerardo Canfora: Irish: A Hidden Markov Model to detect coded information islands in free text. Sci. Comput. Program. 105: 26-43 (2015)
  49. 49. Hidden Markov Models A language category is modeled as an HMM where hidden states are language tokens 0.28 0.23 0.6 0.49 0.36 0.36 0.38 0.52 0.430.25 0.31 0.31 0.77 0.34 0.51 0.47 0.52 0.26 0.38 0.71 0.27 0.47 0.65 0.34 0.4 0.4 0.65 0.54 0.41 0.22 0.39 0.21 0.32 0.37 0.9 0.43 0.27 0.3 0.9 0.62 0.38 0.6 0.22 0.25 0.31 0.4 0.97 0.41 0.36 0.36 0.21 0.93 0.91 ] = ~ ! ' ` NUM ( [ # < - : , { ) NEWLN KEY $ } + . * ; & / ^ | % " ? @ WORD UNDSC > Fig. 4. A source code HMM trained on PosgreSQL source code (transition probabilities less than 0.2 are not shown). syntaxes. For example, in development mailing list messages usually there is no more than one source code fragment in a text message. If this is the case, we can assume, a priori, the transition probability from natural language text to source code approximated to 1/N, where N is the number of tokens in the message. It could happens that the transition between two language syntaxes could never occur. This may be the case of stack traces and patches. By observing developers’ emails, it is usual to notice that, after a stack trace is provided, the resolution patch is introduced with natural language phrases, such as “Here is the change that we made to fix the problem”. This leads to a null transition probability from stack trace to patch code. Other heuristics could be adopted to refine the estimation of transition probabilities that reflects specific properties or styles adopted [23]. IV. EMPIRICAL EVALUATION PROCEDURE To evaluate the effectiveness of our approach we adopt the Precision and Recall metrics, known respectively also as Positive Predictive Value (PPV) and Sensitivity [24]. In particular, for each language syntax, i, we compute: such as opening parenthesis. Formally, the HMM state space is defined as: Q = { T XT , SRC} where T XT = {WORDT XT , KEYT XT , . . . }, and SRC = {WORDSRC, KEYSRC, . . . }. Each state emits the corre- sponding alphabet symbol without subscript label T XT or SRC. For example, the KEY symbol can be emitted by KEYT XT or KEYSRC with a probability equal to 1. If the probability for staying in a natural language text is p and the probability of staying in source code text is q, then the transition from a state in T XT to a state in SRC is 1 p, instead the inverse transition is 1 q. The above defined HMM emits the sequence of symbols observed in a text by evolving through a sequence of states { 1, 2, . . . , i, i+1, . . . } with the transition probabilities tkl defined as: tkl = P( i = l| i 1 = k) · p, if k, l ⇤ T XT tkl = P( i = l| i 1 = k) · q, if k, l ⇤ SRC tkl = 1 p | | , if k ⇤ T XT , l ⇤ SRC tkl = 1 q | | , if k ⇤ SRC, l ⇤ T XT and the emission probabilities defined as: ekb = 1, if k = bT XT or k = bSRC, otherwise 0. Fig. 2 shows the global HMM composed by two sub-HMM, one modeling natural language text and another modeling source code. Fig. 3 and Fig. 4 show their transition proba- bilities, estimated on the Frankenstein novel and PostgreSQL source code respectively. We detail in Section III-D how these probabilities could be estimated. It is interesting to observe how typical token sequences are modeled by each HMM. For example, in the source code of an arithmetic/logic expressions, array indexing, or function argument enumeration. Fig. 2. The source code – natural text island HMM. 0.33 0.67 0.78 0.86 0.82 0.78 0.65 0.12 0.68 0.64 0.87 0.7 0.71 0.9 0.81 0.71 0.5 0.54 1.0 1.0 1.0 1.0 0.11 0.11 0.21 0.14 0.15 0.14 0.73 0.2 0.45 0.21 1.0 0.17 0.120.12 0.14 0.29 0.33 ] WORD KEY ! " @ ' ? ( * . NUM : NEWLN $ ) [ UNDSC - / ; % , Fig. 3. A natural text HMM trained on the Frankenstein novel (transition probabilities less than 0.1 are not shown). C. An extension of the basic model
  50. 50. Transition between languages… Language HMMs are connected by a language transition matrix into a single HMM if they after a to find s more aracter, e space SRC corre- XT or ted by If the p and hen the 1 p, HMM WORD modeling typical variable naming convention. Instead, in the natural language HMM, numbers (NUM) are preceded just by the dollar symbol ($) indicating currency, and likely followed by a dot, indicating text item enumerations. Instead, in the source code HMM it is noticeable that numbers are part of an arithmetic/logic expressions, array indexing, or function argument enumeration. 1-q 1-p p q Natural text HMM Soure code HMM Fig. 2. The source code – natural text island HMM. Natural language HMM Source code HMM
  51. 51. Information Retrieval Models
  52. 52. Indexing-based IR Document Query indexing indexing (Query analysis) Representation Representation Query evaluation
  53. 53. Document indexing Term extraction The above process may vary a lot... Stop word removal Stemming Term indexing Algebraic method
  54. 54. Tokenization in SE • Often identifiers are compound words
 getPointer, getpointer • Sometimes they contain abbreviations and acronyms
 getPntr, computeIPAddr
  55. 55. Camel Case splitter getWords getWords openIP open IP Pros: simple and fast Cons: convention not adopted everywhere
  56. 56. Samurai • Assumption1: A substring composing an identifier is also likely to be used elsewhere • Assumption 2: Given two possible split, prefer the one that contains identifiers having a high frequency in the program Eric Enslen, Emily Hill, Lori L. Pollock, K.Vijay-Shanker: Mining source code to automatically split identifiers for software analysis. MSR 2009: 71-80
  57. 57. More advanced approaches • Tidier/Tris, and Normalize • Allow expanding abbreviations and acronyms
 getPntr get pointer • Based on more advanced techniques, e.g. speech recognition Dawn Lawrie, David Binkley: Expanding identifiers to normalize source code vocabulary. ICSM 2011: 113-122 Latifa Guerrouj, Massimiliano Di Penta, Giuliano Antoniol andYann-Gaël Guéhéneuc.TIDIER: an identifier splitting approach using speech recognition techniques. Journal of Software: Evolution and Process.Vol. 25, Issue 6, June 2013, p. 575–599
  58. 58. Stop words for SE • English not enough • Programming language keywords may or may not be included • What about standard API? • Specific, recurring words of some documents (e.g., of bug reports or of emails)
  59. 59. Use of smoothing filters Andrea De Lucia, Massimiliano Di Penta, Rocco Oliveto, Annibale Panichella, Sebastiano Panichella: Applying a smoothing filter to improve IR-based traceability recovery processes: An empirical investigation. Information & Software Technology 55(4): 741-754 (2013) Applying a smoothing Ä lter to improve IR-based traceability recovery processes: An empirical investigation q Andrea De Lucia a , Massimiliano Di Penta b,⇑ , Rocco Oliveto c , Annibale Panichella a , Sebastiano Panichella b a University of Salerno, Via Ponte don Melillo, 84084 Fisciano (SA), Italy b University of Sannio, Viale Traiano, 82100 Benevento, Italy c University of Molise, C.da Fonte Lappone, 86090 Pesche (IS), Italy a r t i c l e i n f o Article history: Available online 24 August 2012 Keywords: Software traceability Information retrieval Smoothing Ä lters Empirical software engineering a b s t r a c t Context: Traceability relations among software artifacts often tend to be missing, outdated, or lost. For this reason, various traceability recovery approachesˆ based on Information Retrieval (IR) techniquesˆ have been proposed. The performances of such approaches are often inÅ uenced by ` ` noise' ' contained in software artifacts (e.g., recurring words in document templates or other words that do not contribute to the retrieval itself). Aim: As a complement and alternative to stop word removal approaches, this paper proposes the use of a smoothing Ä lter to remove ` ` noise' ' from the textual corpus of artifacts to be traced. Method: We evaluate the effect of a smoothing Ä lter in traceability recovery tasks involving different kinds of artifacts from Ä ve software projects, and applying three different IR methods, namely Vector Space Models, Latent Semantic Indexing, and Jensenç Shannon similarity model. Results: Our study indicates that, with the exception of some speciÄ c kinds of artifacts (i.e., tracing test cases to source code) the proposed approach is able to signiÄ cantly improve the performances of trace- ability recovery, and to remove ` ` noise' ' that simple stop word Ä lters cannot remove. Conclusions: The obtained results not only help to develop traceability recovery approaches able to work in presence of noisy artifacts, but also suggest that smoothing Ä lters can be used to improve perfor- mances of other software engineering approaches based on textual analysis. Ó 2012 Elsevier B.V. All rights reserved. 1. Introduction In recent and past years, textual analysis has been successfully applied to several kinds of software engineering tasks, for example impact analysis [1], clone detection [2], feature location [3,4], refactoring [5], deÄ nition of new cohesion and coupling metrics [6,7], software quality assessment [8ç 11], and, last but not least, traceability recovery [12ç 14]. Such a kind of analysis demonstrated to be effective and useful for various reasons: it is lightweight and to some extent independent on the pro- gramming language, as it does not require a full source code parsing, but only its tokenization and (for some applications) lexical analysis; it provides information (e.g., brought in comments and identiÄ - ers) complementary to what structural or dynamic analysis can provide [6,7]; it models software artifacts as textual documents, thus can be applied to different kinds of artifacts (i.e., it is not limited to the source code) and, above all, can be used to perform com- bined analysis of different kinds of artifacts (e.g., requirements and source code), as in the case of traceability recovery. Textual analysis has also some weaknesses, and poses chal- lenges for researchers. It strongly depends on the quality of the lex- icon: a bad lexicon often means inaccurateˆ if not completely wrongˆ results. There are two common problems in the textual analysis of software artifacts. The Ä rst is represented by the pres- ence of inconsistent terms in related documents (e.g., require- q This paper is an extension of the work ` ` Improving IR-based Traceability Recovery Using Smoothing Filters' ' appeared in the Proceedings of the 19th IEEE International Conference on Program Comprehension, Kingston, ON, Canada, pp. 21ç Information and Software Technology 55 (2013) 741ç 754 Contents lists available at SciVerse ScienceDirect Information and Software Technology journal homepage: www.elsevier.com/locate/infsof Applying a smoothing Ä lter to improve IR-based traceability recovery processes: An empirical investigation q Andrea De Lucia a , Massimiliano Di Penta b,⇑ , Rocco Oliveto c , Annibale Panichella a , Sebastiano Panichella b a University of Salerno, Via Ponte don Melillo, 84084 Fisciano (SA), Italy b University of Sannio, Viale Traiano, 82100 Benevento, Italy c University of Molise, C.da Fonte Lappone, 86090 Pesche (IS), Italy a r t i c l e i n f o Article history: Available online 24 August 2012 Keywords: Software traceability Information retrieval Smoothing Ä lters Empirical software engineering a b s t r a c t Context: Traceability relations among software artifacts often tend to be missing, outdated, or lost. For this reason, various traceability recovery approachesˆ based on Information Retrieval (IR) techniquesˆ have been proposed. The performances of such approaches are often inÅ uenced by ` ` noise' ' contained in software artifacts (e.g., recurring words in document templates or other words that do not contribute to the retrieval itself). Aim: As a complement and alternative to stop word removal approaches, this paper proposes the use of a smoothing Ä lter to remove ` ` noise' ' from the textual corpus of artifacts to be traced. Method: We evaluate the effect of a smoothing Ä lter in traceability recovery tasks involving different kinds of artifacts from Ä ve software projects, and applying three different IR methods, namely Vector Space Models, Latent Semantic Indexing, and Jensenç Shannon similarity model. Results: Our study indicates that, with the exception of some speciÄ c kinds of artifacts (i.e., tracing test cases to source code) the proposed approach is able to signiÄ cantly improve the performances of trace- ability recovery, and to remove ` ` noise' ' that simple stop word Ä lters cannot remove. Conclusions: The obtained results not only help to develop traceability recovery approaches able to work in presence of noisy artifacts, but also suggest that smoothing Ä lters can be used to improve perfor- mances of other software engineering approaches based on textual analysis. Ó 2012 Elsevier B.V. All rights reserved. 1. Introduction In recent and past years, textual analysis has been successfully applied to several kinds of software engineering tasks, for example impact analysis [1], clone detection [2], feature location [3,4], refactoring [5], deÄ nition of new cohesion and coupling metrics parsing, but only its tokenization and (for some applications) lexical analysis; it provides information (e.g., brought in comments and identiÄ - ers) complementary to what structural or dynamic analysis can provide [6,7]; it models software artifacts as textual documents, thus can be
  60. 60. Smoothing Filters S (mean source vector) T (mean target vector) Source Documents Target Documents Filtered
 Source Set - Filtered
 Target Set - s1 s2 s3 … sk t1 t2 t3 … tz
  61. 61. From stemming to lexical databases…
  62. 62. • Lexical database where concepts are organised in a semantic network • Organize lexical information in terms of word meaning, rather than word form • Interfaces for Java, Prolog, Lisp, Python, Perl, C# WordNet (http://wordnet.princeton.edu)
  63. 63. Word Categories Nouns • Topical hierarchies with lexical inheritance (hyponymy/hyperymy and meronymy/holonymy). Verbs • Entailment relations Adjectives and adverbs • Bipolar opposition relations (antonymy) Function words (articles, prepositions, etc.) • Simply omitted
  64. 64. Application: mining and classifying identifier renamings Generalization/specialization: 
 thrownExceptionSize → thrownExceptionLength Opposite meaning: 
 hasClosingBracket → hasOpeningBracket Whole/part relation: 
 filename → extension Venera Arnaoudova, Laleh Mousavi Eshkevari, Massimiliano Di Penta, Rocco Oliveto, Giuliano Antoniol,Yann-Gaël Guéhéneuc: REPENT:Analyzing the Nature of Identifier Renamings. IEEE Trans. Software Eng. 40(5): 502-532 (2014)
  65. 65. Challenge • Word relations in WordNet do not fully comprise relations relevant in the IT domain • Alternative approaches try to “learn” these relations Jinqiu Yang, Lin Tan: SWordNet: Inferring semantically related words from software context. Empirical Software Engineering 19(6): 1856-1886 (2014)
  66. 66. Term Indexing:Assigning Weights to Terms • Binary Weights: terms weighted as 0 (not appearing) or 1 (appearing) • (Raw) term frequency: terms weighted by the number of occurrences/frequency in the document • tf-idf: want to weight terms highly if they are frequent in relevant documents … BUT ... infrequent in the collection as a whole
  67. 67. What to use? tf-idf might be useful for text retrieval but… not (necessarily) the best choice to identify keywords, e.g. for classification purposes
  68. 68. Algebraic Models • Aim at providing representations to documents and at comparing/clustering documents • Examples • Vector Space Model • Latent Semantic Indexing (LSI) • latent Dirichlet allocation (LDA)
  69. 69. TheVector Space Model (VSM) • Assume t distinct terms remain after preprocessing • call them index terms or the vocabulary. • These “orthogonal” terms form a vector space. • Dimension = t = |vocabulary| • Each term, i, in a document or query, j, is given a real- valued weight, wij. • Both documents and queries are expressed as t- dimensional vectors: dj = (w1j, w2j, …, wtj)
  70. 70. VSM Representation T3 T1 T2 D1 = 2T1+ 3T2 + 5T3 D2 = 3T1 + 7T2 + T3 Q = 0T1 + 0T2 + 2T3 7 32 5
  71. 71. Latent Semantic Indexing (LSI) Overcomes limitations ofVSM (synonymy, polysemy, considering words occurring in documents as independent events) Shifting from a document-term space towards a document-concept space Dumais, S.T., Furnas, G.W., Landauer,T. K. and Deerwester, S. (1988), Using latent semantic analysis to improve information retrieval. In Proceedings of CHI'88: Conference on Human Factors in Computing, NewYork:ACM, 281-285.
  72. 72. latent Dirichlet Allocation (LDA) • Probabilistic model • Documents treated as a distributions of topics • Topics are distributions of words D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent Dirichlet Allocation,” The Journal of Machine Learning Research, vol. 3, pp. 993–1022, 2003.
  73. 73. Applications Alessandra Gorla, Ilaria Tavecchia, Florian Gross, Andreas Zeller: Checking app behavior against app descriptions. ICSE 2014: 1025-1035 Checking App Behavior Against App Descriptions Alessandra Gorla · Ilaria Tavecchia⇤ · Florian Gross · Andreas Zeller Saarland University Saarbrücken, Germany {gorla, tavecchia, fgross, zeller}@cs.uni-saarland.de ABSTRACT How do we know a program does what it claims to do? After clus- tering Android apps by their description topics, we identify outliers in each cluster with respect to their API usage. A “weather” app that sends messages thus becomes an anomaly; likewise, a “messaging” app would typically not be expected to access the current location. Applied on a set of 22,500+ Android applications, our CHABADA prototype identified several anomalies; additionally, it flagged 56% of novel malware as such, without requiring any known malware patterns. Categories and Subject Descriptors D.4.6 [Security and Protection]: Invasive software General Terms Security Keywords Android, malware detection, description analysis, clustering 1. INTRODUCTION Checking whether a program does what it claims to do is a long- standing problem for developers. Unfortunately, it now has become a problem for computer users, too. Whenever we install a new app, 1. App collection 2. Topics Weather, Map… Travel, Map… Theme 3. Clusters Weather + Travel Themes Access-LocationInternet Access-LocationInternet Send-SMS 4. Used APIs 5. Outliers Weather + Travel Figure 1: Detecting applications with unadvertised behavior. Starting from a collection of “good” apps (1), we identify their description topics (2) to form clusters of related apps (3). For each cluster, we identify the sentitive APIs used (4), and can then identify outliers that use APIs that are uncommon for that cluster (5). • An app that sends a text message to a premium number to raise money is suspicious? Maybe, but on Android, this is a Checking App Behavior Against App Descriptions Alessandra Gorla · Ilaria Tavecchia⇤ · Florian Gross · Andreas Zeller Saarland University Saarbrücken, Germany {gorla, tavecchia, fgross, zeller}@cs.uni-saarland.de ABSTRACT How do we know a program does what it claims to do? After clus- tering Android apps by their description topics, we identify outliers in each cluster with respect to their API usage. A “weather” app that sends messages thus becomes an anomaly; likewise, a “messaging” app would typically not be expected to access the current location. Applied on a set of 22,500+ Android applications, our CHABADA prototype identified several anomalies; additionally, it flagged 56% of novel malware as such, without requiring any known malware patterns. Categories and Subject Descriptors D.4.6 [Security and Protection]: Invasive software General Terms Security 1. App collection 2. Topics Weather, Map…Map… Travel, Map… Theme 3. Clusters Weather + Travel+ Travel ThemesThemes Access-LocationInternet Access-LocationInternet Send-SMS 4. Used APIs 5. Outliers Weather + Travel+ Travel
  74. 74. Need to properly calibrate the techniques • Many IR techniques require a careful calibration of many parameters • LSI number of concepts (k) • LDA k, α, β, number of iterations (n) • Without that, the performances can be sub-optimal
  75. 75. … and beyond that… • Should I use stop word removal? Which one? • Stemming? • Which weighting scheme? tf? tf-idf? others? • Which technique?VSM? LSI? LDA?
  76. 76. Approaches (I) Sugandha Lohar, Sorawit Amornborvornwong, Andrea Zisman, Jane Cleland-Huang: Improving trace accuracy through data-driven configuration and composition of tracing features. ESEC/SIGSOFT FSE 2013: 378-388 Improving Trace Accuracy through Data-Driven ConÆ guration and Composition of Tracing Features Sugandha Lohar, Sorawit Amornborvornwong DePaul University Chicago, IL, USA sonul.123@gmail.com, sorambww@hotmail.com Andrea Zisman Department of Computing The Open University Milton Keynes, MK7 6AA, UK andrea.zisman@open.ac.uk Jane Cleland-Huang DePaul University Systems and Requirements Engineering Center Chicago, IL, USA jhuang@cs.depaul.edu ABSTRACT Software traceability is a sought-after, yet often elusive qual- ity in large software-intensive systems primarily because the cost and effort of tracing can be overwhelming. State-of-the art solutions address this problem through utilizing trace re- trieval techniques to automate the process of creating and maintaining trace links. However, there is no simple one- size- ts all solution to trace retrieval. As this paper will show, nding the right combination of tracing techniques can lead to signi cant improvements in the quality of gener- ated links. We present a novel approach to trace retrieval in which the underlying infrastructure is con gured at runtime to optimize trace quality. We utilize a machine-learning ap- proach to search for the best con guration given an initial training set of validated trace links, a set of available tracing techniques speci ed in a feature model, and an architecture capable of instantiating all valid con gurations of features. critical software-intensive systems [4]. It is used to capture relationships between requirements, design, code, test-cases, and other software engineering artifacts, and support crit- ical activities such as impact analysis, compliance veri ca- tion, test-regression selection, and safety-analysis. As such, traceability is mandated in safety-critical domains includ- ing the automotive, aeronautics, and medical device indus- tries. Unfortunately, tracing costs can grow excessively high if trace links have to be created and maintained manually by human users, and as a result, practitioners often fail to establish adequate traceability in a project [27]. To address these needs, numerous researchers have devel- oped or adopted algorithms that semi-automate the process of creating trace links. These algorithms include the Vec- tor Space Model (VSM) [23], Probabilistic approaches [14], Latent Semantic Indexing [2, 12], Latent Dirichlet Alloca- tion (LDA) [13], rule-based approaches that identify rela- tionships across project artifacts [35], and approaches that Improving Trace Accuracy through Data-Driven ConÆ guration and Composition of Tracing Features Sugandha Lohar, Sorawit Amornborvornwong DePaul University Chicago, IL, USA sonul.123@gmail.com, sorambww@hotmail.com Andrea Zisman Department of Computing The Open University Milton Keynes, MK7 6AA, UK andrea.zisman@open.ac.uk Jane Cleland-Huang DePaul University Systems and Requirements Engineering Center Chicago, IL, USA jhuang@cs.depaul.edu ABSTRACT Software traceability is a sought-after, yet often elusive qual- ity in large software-intensive systems primarily because the cost and effort of tracing can be overwhelming. State-of-theffort of tracing can be overwhelming. State-of-theff art solutions address this problem through utilizing trace re- critical software-intensive systems [4]. It is used to capture relationships between requirements, design, code, test-cases, and other software engineering artifacts, and support crit- ical activities such as impact analysis, compliance veri ca- tion, test-regression selection, and safety-analysis. As such, rint ical activities such as impact analysis, compliance veri ca-rint ical activities such as impact analysis, compliance veri ca- tion, test-regression selection, and safety-analysis. As such, rint tion, test-regression selection, and safety-analysis. As such,
  77. 77. Discussion • Supervised, task specific • Model parameters varied through Genetic Algorithms • You train the model on a training set • Maximize precision/recall for a given task
  78. 78. Approaches (II) Annibale Panichella, Bogdan Dit, Rocco Oliveto, Massimiliano Di Penta, Denys Poshyvanyk, Andrea De Lucia: How to effectively use topic models for software engineering tasks? an approach based on genetic algorithms. ICSE 2013: 522-531 How to Effectively Use Topic Models for Software Engineering Tasks? An Approach Based on Genetic Algorithms Annibale Panichella1, Bogdan Dit2, Rocco Oliveto3, Massimilano Di Penta4, Denys Poshynanyk2, Andrea De Lucia1 1 University of Salerno, Fisciano (SA), Italy 2 The College of William and Mary, Williamsburg, VA, USA 3 University of Molise, Pesche (IS), Italy 4 University of Sannio, Benevento, Italy Abstractó Information Retrieval (IR) methods, and in partic- ular topic models, have recently been used to support essential software engineering (SE) tasks, by enabling software textual retrieval and analysis. In all these approaches, topic models have been used on software artifacts in a similar manner as they were used on natural language documents (e.g., using the same settings and parameters) because the underlying assumption was that source code and natural language documents are similar. However, applying topic models on software data using the same settings as for natural language text did not always produce the expected results. Recent research investigated this assumption and showed that source code is much more repetitive and predictable as compared to the natural language text. Our paper builds on this new fundamental nding and proposes a novel solution to adapt, con gure and effectively use a topic modeling technique, namely Latent Dirichlet Allocation (LDA), to achieve better (acceptable) performance across various SE tasks. Our paper introduces a novel solution called LDA-GA, which uses Genetic Algorithms (GA) to determine a near-optimal con guration for LDA in the context of three different SE tasks: (1) traceability link recovery, (2) feature location, and (3) software artifact labeling. The results of our empirical studies demonstrate that LDA-GA is able to identify robust LDA con gurations, which lead to a higher accuracy on all the datasets for these SE tasks as compared proposed to support software engineering tasks: feature lo- cation [4], change impact analysis [5], bug localization [6], clone detection [7], traceability link recovery [8], [9], expert developer recommendation [10], code measurement [11], [12], artifact summarization [13], and many others [14], [15], [16]. In all these approaches, LDA and LSI have been used on software artifacts in a similar manner as they were used on natural language documents (i.e., using the same settings, con gurations and parameters) because the underlying as- sumption was that source code (or other software artifacts) and natural language documents exhibit similar properties. More speci cally, applying LDA requires setting the number of topics and other parameters speci c to the particular LDA implementation. For example, the fast collapsed Gibbs sam- pling generative model for LDA requires setting the number of iterations n and the Dirichlet distribution parameters α and β [17]. Even though LDA was successfully used in the IR and natural language analysis community, applying it on software data, using the same parameter values used for natural language text, did not always produce the expected results [18]. As in the case of machine learning and optimization How to Effectively Use Topic Models for Software Engineering Tasks? An Approach Based on Genetic Algorithms Annibale Panichella1, Bogdan Dit2, Rocco Oliveto3, Massimilano Di Penta4, Denys Poshynanyk2, Andrea De Lucia1 1 University of Salerno, Fisciano (SA), Italy 2 The College of William and Mary, Williamsburg, VA, USA 3 University of Molise, Pesche (IS), Italy 4 University of Sannio, Benevento, Italy Abstractó Information Retrieval (IR) methods, and in partic- ular topic models, have recently been used to support essential software engineering (SE) tasks, by enabling software textual retrieval and analysis. In all these approaches, topic models have been used on software artifacts in a similar manner as they were used on natural language documents (e.g., using the same settings and parameters) because the underlying assumption was that source code and natural language documents are similar.that source code and natural language documents are similar.that source code and natural language documents are similar However, applying topic models on software data using the same settings as for natural language text did not always produce the proposed to support software engineering tasks: feature lo- cation [4], change impact analysis [5], bug localization [6], clone detection [7], traceability link recovery [8], [9], expert developer recommendation [10], code measurement [11], [12], artifact summarization [13], and many others [14], [15], [16]. In all these approaches, LDA and LSI have been used on software artifacts in a similar manner as they were used on natural language documents (i.e., using the same settings, con gurations and parameters) because the underlying as-
  79. 79. Choose a random population of LDA parameters LDA Determine fitness of each chromosome (individual) of each (individual) Fitness = silhouette coefficientFitness = silhouette coefficientFitness = silhouette coefficient
  80. 80. Choose a random population of LDA parameters LDA Determine fitness of each chromosome (individual) Select next generation Crossover, mutation Crossover Mutation
  81. 81. Choose a random population of LDA parameters LDA Determine fitness of each chromosome (individual) Select next generation Crossover, mutation Next generation Best model for first generation Best model for last generation
  82. 82. Supporting technology
  83. 83. • Integrated suite of software facilities for data manipulation, calculation and graphical display • VSM and LSI implemented in the lsa package • topicmodels and lda also available • many other text processing packages… R (www.r-project.org)
  84. 84. Creating a t-d matrix Creating a term-document matrix from a directory tm-textmatrix(/Users/Max/mydocs) Syntax: textmatrix(mydir, stemming=FALSE, language=“english,
 minWordLength=2, minDocFreq=1, stopwords=NULL, vocabulary=NULL)
  85. 85. Applying a weighting schema • lw_tf(m): returns a completely unmodified n times m matrix • lw_bintf(m): returns binary values of the n times m matrix. • gw_normalisation(m): returns a normalised n times m matrix • gw_gfidf(m): returns the global frequency multiplied with idf
  86. 86. Applying LSI • Decomposes tm and reduces the space of concepts to 2 lspace-lsa(tm,dims=2) • Converts lspace into a textmatrix m2-as.textmatrix(lspace)
  87. 87. Computing the cosine cosine(v1,v2) or cosine(matrix)
  88. 88. Application: Identifying duplicate bug reports Problem: people often report bugs someone has already reported Solution (in a nutshell): • Compute textual similarity among bug reports • Match (where available) stack traces Xiaoyin Wang, Lu Zhang,Tao Xie, John Anvik, Jiasu Sun:An approach to detecting duplicate bug reports using natural language and execution information. ICSE 2008: 461-470
  89. 89. Example (Eclipse) Bug 123 - Synchronize View: files nodes in tree should provide replace with action This scenario happens very often to me - I change a file in my workspace to test something - I do a release - In the release I encounter that there is no need to release the file - I want to replace it from the stream but a corresponding action is missing Now I have to go to the navigator to do the job. t-textmatrix(/Users/mdipenta/docs,stemming=TRUE) cosine(t) d1 d2 d1 1.0000000 0.6721491 d2 0.6721491 1.0000000 /mdipenta/docs,stemming=TRUE) Bug 4934 - DCR: Replace from Stream in Release mode Sometimes I touch files, without intention.These files are then unwanted outgoing changes.To fix it, I have go back to the package view/navigator and replace the file with the stream.After this a new synchronize is needed, for that I have a clean view. It would be very nice to have the 'replace with stream' in the outgoing tree. Or 'Revert'?
  90. 90. Other tools Lucene http://lucene.apache.org RapidMiner www.rapidminer.com Mallet http://mallet.cs.umass.edu
  91. 91. Choosing the IR model
  92. 92. Careful choice • In case you don’t want a GA chooses it for you… • Most exotic model not necessarily the best one • Simple models have advantages (e.g. scalability) • Try to experiment multiple models
  93. 93. IR methods: 
 Pros and Cons • Can be used to match similar (related) discussions • Some techniques deal with issues such as homonymy and polysemy • Software-related discussion can be very noisy • Noise different between different artifacts
 (bug reports, commit notes, code) • You have less control than with regular expressions • Do not capture dependencies in sentences 
 e.g.,“this bug will not be fixed” ✘ ✔ ✔ ✘ ✘ ✘
  94. 94. Natural Language Parsing
  95. 95. Technology Stanford Natural Language Parser
 http://nlp.stanford.edu DependenceVisualizer: DepenSee Grammar Browser: GrammarScope
  96. 96. (ROOT (S (NP (DT this) (NN bug)) (VP (MD will) (RB not) (VP (VB be) (VP (VBN fixed)))))) det(bug-2, this-1) nsubjpass(fixed-6, bug-2) aux(fixed-6, will-3) neg(fixed-6, not-4) auxpass(fixed-6, be-5) root(ROOT-0, fixed-6) Example this bug will not be fixed DT NN MD RB VB VBN det aux nsubjpass neg auxpass
  97. 97. Is it applicable to analyze source code identifiers? • It could be, once identifiers are split into compound words • Useful to tag part-of-speech • Results may not be very precise because identifiers are not natural language sentences
  98. 98. Rules (Abebe and Tonella, 2010) Surafel Lemma Abebe, Paolo Tonella: Natural Language Parsing of Program Element Names for Concept Extraction. ICPC 2010: 156-159
  99. 99. Example: OpenStructure open structure Let’s insert an artificial subject and an article: subjects open the structure (ROOT (NP (JJ open) (NN structure))) amod(structure-2, open-1) root(ROOT-0, structure-2) (ROOT (S (NP (NNS subjects)) (VP (VBP open) (NP (DT the) (NN structure))))) nsubj(open-2, subjects-1) root(ROOT-0, open-2) det(structure-4, the-3) dobj(open-2, structure-4)
  100. 100. Applications Rahul Pandita, Xusheng Xiao, Hao Zhong, Tao Xie, Stephen Oney, Amit M. Paradkar: Inferring method specifications from natural language API descriptions. ICSE 2012: 815-825 Inferring Method Specifications from Natural Language API Descriptions Rahul Pandita⇤, Xusheng Xiao⇤, Hao Zhong†, Tao Xie⇤, Stephen Oney‡, and Amit Paradkar§ ⇤Department of Computer Science, North Carolina State University, Raleigh, USA †Laboratory for Internet Software Technologies, Institute of Software, Chinese Academy of Sciences, Beijing, China ‡Human-Computer Interaction Institute, Carnegie Mellon University, Pittsburgh, USA §I.B.M. T. J. Watson Research Center, Hawthorne, NY, USA {rpandit, xxiao2, txie}@ncsu.edu, zhonghao@itechs.iscas.ac.cn, soney@cs.cmu.edu, paradkar@us.ibm.com Abstract—Application Programming Interface (API) docu- ments are a typical way of describing legal usage of reusable software libraries, thus facilitating software reuse. However, even with such documents, developers often overlook some documents and build software systems that are inconsistent with the legal usage of those libraries. Existing software verification tools require formal specifications (such as code contracts), and therefore cannot directly verify the legal usage described in natural language text in API documents against code using that library. However, in practice, most libraries do not come with formal specifications, thus hindering tool- based verification. To address this issue, we propose a novel approach to infer formal specifications from natural language text of API documents. Our evaluation results show that our approach achieves an average of 92% precision and 93% recall in identifying sentences that describe code contracts from more than 2500 sentences of API documents. Furthermore, our results show that our approach has an average 83% accuracy in inferring specifications from over 1600 sentences describing code contracts. I. INTRODUCTION Programming Interface (API) documents. Typically, such documents are provided to client-code developers through online access, or are shipped with the API code. For exam- ple, J2EE’s API documentation3 is one of the most popular API documents. Even with such documents, client-code developers often overlook some API documents and use methods in API libraries incorrectly [22]. Since these documents are written in natural language, existing tools cannot verify legal usage described in a library’s API documents against the client code of that library. One possible solution is to manually write code contracts based on the specifications described in API documents. However, due to a large number of sentences in API documents, manually hunting for contract sentences and writing code contracts for the API library is prohibitively time consuming and labor intensive. For instance, the File class of the C# .NET Framework has around 800 sentences. Moreover, not all of these sentences Inferring Method Specifications from Natural Language API Descriptions Rahul Pandita⇤, Xusheng Xiao⇤, Hao Zhong†, Tao Xie⇤, Stephen Oney‡, and Amit Paradkar§ ⇤Department of Computer Science, North Carolina State University, Raleigh, USA †Laboratory for Internet Software Technologies, Institute of Software, Chinese Academy of Sciences, Beijing, China ‡ Laboratory for Internet Software Technologies, Institute of Software, Chinese Academy of Sciences, Beijing, China ‡ Laboratory for Internet Software Technologies, Institute of Software, Chinese Academy of Sciences, Beijing, China Human-Computer Interaction Institute, Carnegie Mellon University, Pittsburgh, USA § Human-Computer Interaction Institute, Carnegie Mellon University, Pittsburgh, USA § Human-Computer Interaction Institute, Carnegie Mellon University, Pittsburgh, USA I.B.M. T. J. Watson Research Center, Hawthorne, NY, USA {rpandit, xxiao2, txie}@ncsu.edu, zhonghao@itechs.iscas.ac.cn, soney@cs.cmu.edu, paradkar@us.ibm.com Abstract—Application Programming Interface (API) docu-Abstract—Application Programming Interface (API) docu-Abstract ments are a typical way of describing legal usage of reusable software libraries, thus facilitating software reuse. However, even with such documents, developers often overlook some documents and build software systems that are inconsistent with the legal usage of those libraries. Existing software verification tools require formal specifications (such as code contracts), and therefore cannot directly verify the legal usage described in natural language text in API documents against code using that library. However, in practice, most libraries do not come with formal specifications, thus hindering tool- based verification. To address this issue, we propose a novel Programming Interface (API) documents. Typically, such documents are provided to client-code developers through online access, or are shipped with the API code. For exam- ple, J2EE’s API documentation3 is one of the most popular API documents. Even with such documents, client-code developers often overlook some API documents and use methods in API libraries incorrectly [22]. Since these documents are written in natural language, existing tools cannot verify legal usage described in a library’s API documents against the client
  101. 101. Applications How Can I Improve My App? Classifying User Reviews for Software Maintenance and Evolution S. Panichella⇤, A. Di Sorbo†, E. Guzman‡, C. A.Visaggio†, G. Canfora† and H. C. Gall⇤ ⇤University of Zurich, Switzerland †University of Sannio, Benevento, Italy ‡Technische Universit¨at M¨unchen, Garching, Germany panichella@ifi.uzh.ch, disorbo@unisannio.it, emitza.guzman@mytum.de, {visaggio,canfora}@unisannio.it, gall@ifi.uzh.ch Abstract—App Stores, such as Google Play or the Apple Store, allow users to provide feedback on apps by posting review comments and giving star ratings. These platforms constitute a useful electronic mean in which application developers and users can productively exchange information about apps. Previous research showed that users feedback contains usage scenarios, bug reports and feature requests, that can help app developers to accomplish software maintenance and evolution tasks. However, in the case of the most popular apps, the large amount of received feedback, its unstructured nature and varying quality can make the identification of useful user feedback a very challenging task. In this paper we present a taxonomy to classify app reviews into categories relevant to software maintenance and evolution, as well as an approach that merges three techniques: (1) Natural Language Processing, (2) Text Analysis and (3) Sentiment Analysis to automatically classify app reviews into the proposed categories. We show that the combined use of these techniques allows to achieve better results (a precision of 75% and a recall of 74%) than results obtained using each technique individually (precision of 70% and a recall of 67%). Index Terms—User Reviews, Mobile Applications, Natural Language Processing, Sentiment Analysis, Text classification form of unstructured text that is difficult to parse and analyze. Thus, developers and analysts have to read a large amount of textual data to become aware of the comments and needs of their users [10]. In addition, the quality of reviews varies greatly, from useful reviews providing ideas for improvement or describing specific issues to generic praises and complaints (e.g. “You have to be stupid to program this app”, “I love it!”, “this app is useless”). To handle this problem Chen et al. [10] proposed AR- Miner, an approach to help app developers discover the most informative user reviews. Specifically, the authors use: (i) text analysis and machine learning to filter out non-informative reviews and (ii) topic analysis to recognize topics treated in the reviews classified as informative. In this paper, we argue that text content represents just one of the possible dimensions that can be explored to detect informative reviews from a software maintenance and evolution perspective. In particular, topic analysis techniques are useful to discover How Can I Improve My App? Classifying User Reviews for Software Maintenance and Evolution S. Panichella⇤, A. Di Sorbo†, E. Guzman‡, C. A.Visaggio†, G. Canfora† and H. C. Gall⇤ ⇤University of Zurich, Switzerland †University of Sannio, Benevento, Italy ‡Technische Universitat M¨at M¨ unchen, Garching, Germany¨unchen, Garching, Germany¨ panichella@ifi.uzh.ch, disorbo@unisannio.it, emitza.guzman@mytum.de, {visaggio,canfora}@unisannio.it, gall@ifi.uzh.ch Abstract—App Stores, such as Google Play or the Apple Store,Abstract—App Stores, such as Google Play or the Apple Store,Abstract allow users to provide feedback on apps by posting review comments and giving star ratings. These platforms constitute a useful electronic mean in which application developers and users can productively exchange information about apps. Previous research showed that users feedback contains usage scenarios, bug reports and feature requests, that can help app developers to accomplish software maintenance and evolution tasks. However, in the case of the most popular apps, the large amount of received feedback, its unstructured nature and varying quality can make the identification of useful user feedback a very challenging task. In this paper we present a taxonomy to classify app reviews into categories relevant to software maintenance and evolution, as well as an approach that merges three techniques: (1) Natural Language Processing, (2) Text Analysis and (3) Sentiment Analysis to automatically classify app reviews into the proposed categories. We show that the combined use of these techniques allows to achieve better results (a precision of 75% and a recall of 74%) than results obtained using each technique individually (precision of 70% and a recall of 67%). form of unstructured text that is difficult to parse and analyze. Thus, developers and analysts have to read a large amount of textual data to become aware of the comments and needs of their users [10]. In addition, the quality of reviews varies greatly, from useful reviews providing ideas for improvement or describing specific issues to generic praises and complaints (e.g. “You have to be stupid to program this app”, “I love it!”, “this app is useless”). To handle this problem Chen et al. [10] proposed AR- Miner, an approach to help app developers discover the most informative user reviews. Specifically, the authors use: (i) text analysis and machine learning to filter out non-informative reviews and (ii) topic analysis to recognize topics treated in the reviews classified as informative. In this paper, we argue that text content represents just one of the possible dimensions that can be explored to detect informative reviews from a software maintenance and evolution perspective. In @ICSME 2015 - Thurdsay, 13.50 - Mobile applications
  102. 102. Natural Language Parsing: Pros and Cons • Can identify dependencies between words • Accurate part-of-speech analysis, better than lexical databases • Application to software entities (e.g. identifiers) may be problematic • Difficult to process data • NL parsing is difficult, especially when the corpus is noisy ✘ ✔ ✔ ✘ ✘
  103. 103. Summary
  104. 104. What technique?What technique?
  105. 105. What technique? Need to match precise elements
  106. 106. What technique? Separate structured and unstructured data
  107. 107. What technique? Account for the whole textual corpus of a document
  108. 108. What technique? Compare documents
  109. 109. What technique? Identify “topics” discussed in documents
  110. 110. What technique? Not just bag of words, but also dependencies, parts of speech, negations
  111. 111. Take aways Use structure when available Carefully (empirically) choose the most suitable technique Out of the box configuration might not work for you Manual analysis cannot be replaced
  112. 112. (ROOT (S (NP (DT this) (NN bug)) (VP (MD will) (RB not) (VP (VB be) (VP (VBN fixed)))))) det(bug-2, this-1) nsubjpass(fixed-6, bug-2) aux(fixed-6, will-3) neg(fixed-6, not-4) auxpass(fixed-6, be-5) root(ROOT-0, fixed-6) For example, the Stanford Natural Language Parser
 http://nlp.stanford.edu Techniques: Natural Language Parsing this bug will not be fixed DT NN MD RB VB VBN det aux nsubjpass neg auxpass (ROOT (S (NP (DT this) (NN bug)) (VP (MD will) (RB not) (VP (VB be) (VP (VBN fixed)))))) det(bug-2, this-1) nsubjpass(fixed-6, bug-2) aux(fixed-6, will-3) neg(fixed-6, not-4) auxpass(fixed-6, be-5) root(ROOT-0, fixed-6) For example, the Stanford Natural Language Parser http://nlp.stanford.edu Techniques: Natural Language Parsing this bug will not be fixed DT NN MD RB VB VBN det aux nsubjpass neg auxpass Maching issue ids “fix 367920 setting pop3 messages as junk/not junk ignored when message quarantining turned on sr=mscott $note=~/Issue (d+)/i || $note=~/Issue number:s+(d +)/i || $note=~/Defects+(d+)/i $note=~/Fixs+(d+)/ i Simple yet powerful technique to link commit notes to issue reports Maching issue ids “fix 367920 setting pop3 messages as junk/not junk ignored when message quarantining turned on sr=mscott $note=~/Issue (d+)/i || $note=~/Issue number:s+(d +)/i || $note=~/Defects+(d+)/i $note=~/Fixs+(d+)/ i Simple yet powerful technique to link commit notes to issue reports VSM Representation T3 T1 T2 D1 = 2T1+ 3T2 + 5T3 D2 = 3T1 + 7T2 + T3 Q = 0T1 + 0T2 + 2T3 7 32 5 VSM Representation T3T3T T1T1T T2T2T D1 = 2T1= 2T1= 2T + 3T2+ 3T2+ 3T + 5T3+ 5T3+ 5T D2 = 3T1= 3T1= 3T + 7T2+ 7T2+ 7T + T3T3T Q = 0T1Q = 0T1Q = 0T + 0T2+ 0T2+ 0T + 2T3+ 2T3+ 2T 7 32 5

×