Big Data: the weakest link

Big Data:
the weakest link
Vivek Nair, Tim Menzies
{vivekaxl,tim.menzies}@gmail.com
HPCC Eng. Summit - Sept 29, 2015

Premise of Big Data
Analysis is a “systems” task?
• Better conclusions =
same algorithms + more
data + more cpu
• If so, then …
– No role for human error
– All insight is auto-generated
from CPUs.
Analysis is a “human” task?
• Current results on “software
analytics”
– A human-intensive process
7

Q: Is Big Data a “Systems” or “Human”-task?
A: Yes
8

Code used in my
last paper
(1100 LOC of Python
calling scikitlearn)
9

Use a Higher-Level languages?
• ECL solves this problem?
• But if you can write it quick,
– you can write it wrong, quick.
10

Is this really a problem?
• Q: What would we expect
to see if…
– Top experts, publishing in top
journals
– Many of the same data sets
– 8 years of trying
• A:
– Perhaps some upward
progress
– Perhaps a little less variance
11
So, what do
we see?

• Software analytics
– Defect prediction
– Many of the same learners,
– Many of the same data sets
• 42 papers,
top journals,
• 23 author groups
• 2002 to 2010
• Y-axis measures
mean performance
12
Researcher Bias: The Use of Machine Learning in Software Defect Prediction, Martin Shepperd,
David Bowes, and Tracy Hall, IEEE TRANS on Soft. Eng. , 40(6), JUNE 2014

13
http://fivethirt
yeight.com/fe
atures/science
-isnt-broken/

A little theory
• James D. Herbsleb, CMU
• Socio-Technical Coordination
• A predictor for higher defects:
– Groups of programmers working
on similar functions then,
– but do not sharing that expertise
14

Q: How to find expertise groups
within the HPCC community?
A: using data mining
15

Static features and commit history
can act as a cue for expertise
● Our motivation
o “relation between embodiment and language
acquisition by locating the ‘minimal set of
necessary features’ that enable language of any
kind to be learned” - The Philosophy of Expertise
16

Software analytics results:
learn predictors for expertise
● “...counts of the cumulative number of different
developers changing a file over its lifetime can help
to improve defect predictions…”[1]
● “Quantify person's experience with a part of code
using change history of the code”[2]
● “RevFinder, a file location-based code-reviewer
recommendation approach” [3]
● “30% of its code entities has more than 0.3 of
similarity with at least one developer vocabulary”
[4]
17
[1] Ostrand, Thomas J., Elaine J. Weyuker, and Robert M. Bell.
"Programmer-based fault prediction." Proceedings of the 6th
International Conference on Predictive Models in Software Engineering.
ACM, 2010.
[2] Mockus, Audris, and James D. Herbsleb. "Expertise browser: a
quantitative approach to identifying expertise." Proceedings of the
24th international conference on software engineering. ACM, 2002.
[3] Thongtanunam, Patanamon, et al. "Who should review my code? A
file location-based code-reviewer recommendation approach for
Modern Code Review."Software Analysis, Evolution and Reengineering
(SANER), 2015 IEEE 22nd International Conference on. IEEE, 2015.
[4] Santos, Katyusco de F., Dalton DS Guerrero, and Jorge CA de
Figueiredo. "Using Developers Contributions on Software Vocabularies
to Identify Experts."Information Technology-New Generations (ITNG),
2015 12th International Conference on. IEEE, 2015.

Q: And what data mining suite will we
use to mine data about programmers?
• A: need you ask?
18

But what are we clustering?
Developer products
• Lightweight parsing of source code
• Developers profiles, accessed
via social media sites

Data processing
1. Github repos (for code) ➔ Social media(for years of work)
2. Static code analysis: frequency counts of AST features
(e.g. count loops, returns, var comparisons, map, etc )
3. Bayes classifier
Early
career
Later career

Classification
- Features: Nodes of AST
- Algorithms Used: Simple Cart, Random
Forest, Naive Bayes etc.
- Can distinguish expert from novice
programmers
•precision= 78% early career
•precision = 74% later career
* Using Weka

Current status
The good news
• Can auto-find groups of
better programmers
• Can do that for very large
data sets
– The ECL advantages
The other news
• Seeking larger data sets
• Talking to HackerRank
• Looking at ways to
instrument the HPCC
forums
– Matchmaker tools
– Affinity groups
25

We can make that link stronger
28

Acknowledgements:
Thanks to funding from LexisNexis
29

Big Data: the weakest link

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Big Data: the weakest link

Ähnlich wie Big Data: the weakest link (20)

Mehr von CS, NcState

Mehr von CS, NcState (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Big Data: the weakest link