This document discusses whether big data analysis is more of a "systems" task or "human" task. It presents research showing that software defect prediction, even when conducted by top experts using the same datasets and algorithms over many years, shows little improvement and high variability. This suggests that human factors like biases are important. The document proposes using data mining on source code and social media to classify developers by expertise and identify groups who could share knowledge to reduce defects. It outlines an initial approach using parsers, classifiers like Naive Bayes to distinguish novices from experts, and seeking larger datasets from partners. The goal is to strengthen the "human" aspects of big data analysis.
7. Premise of Big Data
Analysis is a “systems” task?
• Better conclusions =
same algorithms + more
data + more cpu
• If so, then …
– No role for human error
– All insight is auto-generated
from CPUs.
Analysis is a “human” task?
• Current results on “software
analytics”
– A human-intensive process
7
8. Q: Is Big Data a “Systems” or “Human”-task?
A: Yes
8
9. Code used in my
last paper
(1100 LOC of Python
calling scikitlearn)
9
10. Use a Higher-Level languages?
• ECL solves this problem?
• But if you can write it quick,
– you can write it wrong, quick.
10
11. Is this really a problem?
• Q: What would we expect
to see if…
– Top experts, publishing in top
journals
– Many of the same data sets
– 8 years of trying
• A:
– Perhaps some upward
progress
– Perhaps a little less variance
11
So, what do
we see?
12. • Software analytics
– Defect prediction
– Many of the same learners,
– Many of the same data sets
• 42 papers,
top journals,
• 23 author groups
• 2002 to 2010
• Y-axis measures
mean performance
12
Researcher Bias: The Use of Machine Learning in Software Defect Prediction, Martin Shepperd,
David Bowes, and Tracy Hall, IEEE TRANS on Soft. Eng. , 40(6), JUNE 2014
14. A little theory
• James D. Herbsleb, CMU
• Socio-Technical Coordination
• A predictor for higher defects:
– Groups of programmers working
on similar functions then,
– but do not sharing that expertise
14
15. Q: How to find expertise groups
within the HPCC community?
A: using data mining
15
16. Static features and commit history
can act as a cue for expertise
● Our motivation
o “relation between embodiment and language
acquisition by locating the ‘minimal set of
necessary features’ that enable language of any
kind to be learned” - The Philosophy of Expertise
16
17. Software analytics results:
learn predictors for expertise
● “...counts of the cumulative number of different
developers changing a file over its lifetime can help
to improve defect predictions…”[1]
● “Quantify person's experience with a part of code
using change history of the code”[2]
● “RevFinder, a file location-based code-reviewer
recommendation approach” [3]
● “30% of its code entities has more than 0.3 of
similarity with at least one developer vocabulary”
[4]
17
[1] Ostrand, Thomas J., Elaine J. Weyuker, and Robert M. Bell.
"Programmer-based fault prediction." Proceedings of the 6th
International Conference on Predictive Models in Software Engineering.
ACM, 2010.
[2] Mockus, Audris, and James D. Herbsleb. "Expertise browser: a
quantitative approach to identifying expertise." Proceedings of the
24th international conference on software engineering. ACM, 2002.
[3] Thongtanunam, Patanamon, et al. "Who should review my code? A
file location-based code-reviewer recommendation approach for
Modern Code Review."Software Analysis, Evolution and Reengineering
(SANER), 2015 IEEE 22nd International Conference on. IEEE, 2015.
[4] Santos, Katyusco de F., Dalton DS Guerrero, and Jorge CA de
Figueiredo. "Using Developers Contributions on Software Vocabularies
to Identify Experts."Information Technology-New Generations (ITNG),
2015 12th International Conference on. IEEE, 2015.
18. Q: And what data mining suite will we
use to mine data about programmers?
• A: need you ask?
18
23. Data processing
1. Github repos (for code) ➔ Social media(for years of work)
2. Static code analysis: frequency counts of AST features
(e.g. count loops, returns, var comparisons, map, etc )
3. Bayes classifier
Early
career
Later career
24. Classification
- Features: Nodes of AST
- Algorithms Used: Simple Cart, Random
Forest, Naive Bayes etc.
- Can distinguish expert from novice
programmers
•precision= 78% early career
•precision = 74% later career
* Using Weka
25. Current status
The good news
• Can auto-find groups of
better programmers
• Can do that for very large
data sets
– The ECL advantages
The other news
• Seeking larger data sets
• Talking to HackerRank
• Looking at ways to
instrument the HPCC
forums
– Matchmaker tools
– Affinity groups
25