DevEX - reference for building teams, processes, and platforms
Software Analytics:Towards Software Mining that Matters (2014)
1. Software Analytics:
Towards Software Mining that Matters
Tao Xie
Department of Computer Science
University of Illinois at Urbana-Champaign, USA
taoxie@illinois.edu
In Collaboration with Microsoft Research
2. Machine Learning that Matters
“The basic argument in her paper is that machine learning
might be in danger of losing its impact because the
community as a whole has become quite self-referential.
People are probably solving real-world problems using ML
methods, but there is little sharing of these results within
the community. Instead, people focus on existing
benchmarks which might have originally had some
connection to real-world problems which has been long
forgotten, however.”
“She proposes a number of tasks like $100M solved
through ML based decision making or a human life saved
through a diagnosis or an intervention recommended by
an ML system to get ML back on track.”
ICML’12
http://icml.cc/2012/papers/298.pdf
http://blog.mikiobraun.de/2012/06/is-machine-learning-losing-impact.html
3. 2012 NSF Workshop on Formal Methods
• Goal: to identify the future directions in research in
formal methods and its transition to industrial
practice.
• Success examples mentioned by the attendees
– SLAM/SDV
– ASTREE
– SMT-based tools
– …
http://goto.ucsd.edu/~rjhala/NSFWorkshop/
4. “What Happened to the Promise
of Software Tools?” – Jim Larus
http://www.srl.inf.ethz.ch/workshop2014/eth-larus.pdf
https://www.youtube.com/watch?v=kO9OYnkeRTM
5. Software Analytics
Software analytics is to enable software
practitioners to perform data exploration and
analysis in order to obtain insightful and
actionable information for data-driven tasks
around software and services.
Dongmei Zhang, Yingnong Dang, Jian-Guang Lou, Shi Han, Haidong Zhang, and Tao Xie. Software
Analytics as a Learning Case in Practice: Approaches and Experiences. In MALETS 2011
http://research.microsoft.com/en-us/groups/sa/malets11-analytics.pdf
6. Software Analytics
Software analytics is to enable software
practitioners to perform data exploration and
analysis in order to obtain insightful and
actionable information for data-driven tasks
around software and services.
http://research.microsoft.com/en-us/groups/sa/
http://research.microsoft.com/en-us/news/features/softwareanalytics-052013.aspx
7. “What Happened to the Promise
of Software Tools?” – Jim Larus
http://www.srl.inf.ethz.ch/workshop2014/eth-larus.pdf
https://www.youtube.com/watch?v=kO9OYnkeRTM
9. Performance debugging in the large
Pattern Matching
Trace Storage
Trace collection
Bug update
Problematic Pattern
Repository Bug Database
Network
Bug filing
Key to issue
discovery
Trace analysis
10. Performance debugging in the large
Pattern Matching
Trace Storage
Trace collection
Bug update
Problematic Pattern
Repository Bug Database
Network
Bug filing
Key to issue
discovery
Bottleneck of
scalability
Trace analysis
11. Performance debugging in the large
Pattern Matching
Trace Storage
Trace collection
Bug update
Problematic Pattern
Repository Bug Database
Network
Trace analysis
How many issues are
still unknown?
Bug filing
Key to issue
discovery
Bottleneck of
scalability
12. Performance debugging in the large
Pattern Matching
Trace Storage
Trace collection
Bug update
Problematic Pattern
Repository Bug Database
Network
Trace analysis
How many issues are
still unknown?
Which trace file should I
investigate first?
Bug filing
Key to issue
discovery
Bottleneck of
scalability
13. Technical highlights
• Data mining for software domain
– Discovery of problematic execution patterns formulated as
callstack mining & clustering
– Domain knowledge incorporated systematically
• Interactive performance analysis system
– Parallel mining infrastructure based on HPC + MPI
– Visualization aided interactive exploration
14. Impact: Debugging Productivity Boost
“We believe that the MSRA tool is highly valuable and much more
efficient for mass trace (100+ traces) analysis. For 1000 traces, we
believe the tool saves us 4-6 weeks of time to create new signatures,
which is quite a significant productivity boost.”
Highly effective new issue discovery on Windows
mini-hang
Continuous impact on future Windows
versions
16. XIAO: Code Clone Analysis
• Motivation
– Copy-and-paste is a common developer behavior
– A real tool widely adopted internally and externally
• XIAO enables code clone analysis in the following way
– High tunability
– High scalability
– High compatibility
– High explorability
17. High tunability – what you tune is what you get
• Intuitive similarity metric
– Effective control of the degree of syntactical differences between two code snippets
• Tunable at fine granularity
– Statement similarity
– % of inserted/deleted/modified statements
– Balance between code structure and disordered statements
for (i = 0; i < n; i ++) {
a ++;
b ++;
c = foo(a, b);
d = bar(a, b, c);
e = a + c; }
for (i = 0; i < n; i ++) {
c = foo(a, b);
a ++;
b ++;
d = bar(a, b, c);
e = a + d;
e ++; }
18. High explorability
1 2 3 4 5 6
1. Clone navigation based on source tree hierarchy
2. Pivoting of folder level statistics
3. Folder level statistics
4. Clone function list in selected folder
5. Clone function filters
6. Sorting by bug or refactoring potential
7. Tagging
7
1
1. Block correspondence
2. Block types
3. Block navigation
4. Copying
5. Bug filing
6. Tagging
2
4
3
6
1
5
19. Scenarios & Solutions
Quality gates at milestones
• Architecture refactoring
• Code clone clean up
• Bug fixing
Post-release maintenance
• Security bug investigation
• Bug investigation for sustained engineering
Development and testing
• Checking for similar issues before check-in
• Reference info for code review
• Supporting tool for bug triage
Online code clone search
Offline code clone analysis
20. Impact: Benefiting developer community
Available in Visual Studio 2012 RC
Searching similar snippets
for fixing bug once
Finding refactoring
opportunity
21. Impact: More secure Microsoft products
Code Clone Search service integrated into
workflow of Microsoft Security Response Center
Over 590 million lines of code indexed across
multiple products
Real security issues proactively identified and
addressed
22. Example – MS Security Bulletin MS12-034
Combined Security Update for Microsoft Office, Windows, .NET Framework, and
Silverlight, published: Tuesday, May 08, 2012
3 publicly disclosed vulnerabilities and 7 privately reported involved. Specifically, 1 is
exploited by the Duqu malware to execute arbitrary code when a user opened a
malicious Office document
Insufficient bounds check within the font parsing subsystem of win32k.sys
Cloned copy in gdiplus.dll, ogl.dll (office), Silver Light, Windows Journal viewer
Microsoft Technet Blog about this bulletin
However, we wanted to be sure to address the vulnerable code wherever it appeared
across the Microsoft code base. To that end, we have been working with Microsoft
Research to develop a “Cloned Code Detection” system that we can run for every
MSRC case to find any instance of the vulnerable code in any shipping product. This
system is the one that found several of the copies of CVE-2011-3402 that we are
now addressing with MS12-034.
24. Motivation
• Online services are increasingly popular & important
• High service quality is the key
Incident Management (IcM) is a critical task to
assure service quality
25. Incident Management: Workflow
Detect a
service
issue
Alert On-
Call
Engineers
(OCEs)
Investigate
the problem
Restore
the
service
Fix root cause
via
postmortem
analysis
26. SAS: Incident management of online services
SAS, developed and deployed to effectively reduce MTTR
(Mean Time To Restore) via automatically analyzing
monitoring data
2
6
Design Principle of SAS
Automating Analysis
Handling Heterogeneity
Accumulating Knowledge
Supporting human-in-the-loop (HITL)
28. Industry Impact of SAS
Deployment
• SAS deployed to
worldwide datacenters for
Service X (serving
hundreds of millions of
users) since June 2011
• OCEs now heavily depend
on SAS
Usage
• SAS helped successfully
diagnose ~76% of the
service incidents assisted
with SAS
30. Code Hunt Competition for Students
https://www.codehunt.com/
Precursor: http://www.pex4fun.com/
31. A Fun and Engaging Game – Win by Writing Code Supports Java and C#
Adapts to competitions as well as individual play
Users:
1,181,152
User Programs:
7,079,497
WWW.CODEHUNT.COM
32. Behind the Scene of Coding Duel
Secret Implementation
class Secret {
public static int Puzzle(int x) {
if (x <= 0) return 1;
return x * Puzzle(x-1);
}
}
Player Implementation
class Player {
public static int Puzzle(int x) {
return x;
}
}
class Test {
public static void Driver(int x) {
if (Secret.Puzzle(x) != Player.Puzzle(x))
throw new Exception(“Mismatch”);
}
}
behavior
Secret Impl == Player Impl
33
33. Experience Reports on Successful Tool Transfer
• Nikolai Tillmann, Jonathan de Halleux, and Tao Xie. Transferring an Automated Test
Generation Tool to Practice: From Pex to Fakes and Code Digger. In Proceedings of ASE
2014, Experience Papers. http://web.engr.illinois.edu/~taoxie/publications/ase14-
pexexperiences.pdf
• Jian-Guang Lou, Qingwei Lin, Rui Ding, Qiang Fu, Dongmei Zhang, and Tao Xie. Software
Analytics for Incident Management of Online Services: An Experience Report. In
Proceedings ASE 2013, Experience Paper.
http://web.engr.illinois.edu/~taoxie/publications/ase13-sas.pdf
• Dongmei Zhang, Shi Han, Yingnong Dang, Jian-Guang Lou, Haidong Zhang, and Tao Xie.
Software Analytics in Practice. IEEE Software, Special Issue on the Many Faces of Software
Analytics, 2013. http://web.engr.illinois.edu/~taoxie/publications/ieeesoft13-softanalytics.pdf
• Yingnong Dang, Dongmei Zhang, Song Ge, Chengyun Chu, Yingjun Qiu, and Tao Xie. XIAO:
Tuning Code Clones at Hands of Engineers in Practice. In Proceedings of ACSAC 2012.
http://web.engr.illinois.edu/~taoxie/publications/acsac12-xiao.pdf
34. Ex: Human Consumption of Tool Outputs
• Developer: Your tool generated “0”
• Pex team: What did you expect?
• Developer: Marc
Invariant candidates:
this.getPrice() > 0
this.getPrice() >= 0
http://www.agitar.com/ http://research.microsoft.com/projects/pex/
35. Q & A
Contact: taoxie@illinois.edu
http://research.microsoft.com/en-us/groups/sa/
http://www.cs.illinois.edu/homes/taoxie/
Supported in part by a Microsoft Research Award, NSF grants CCF-1349666, CNS-1434582, CCF-1434596, CCF-
1434590, CNS-1439481, and the USA National Security Agency (NSA) Science of Security Lablet.