Often, analyzing more and more data doesn’t improve your results: you just make the same mistakes at a larger scale. Crimson Hexagon CTO Christopher Bingham discusses several techniques that leverage the quantity of data, increasing accuracy as you scale. Big data can thus lead to better analysis–not just bigger analysis.
3. ABOUT CRIMSON HEXAGON
• Founded 4 years ago; now 40+ employees in Boston
• Help companies make actionable business decisions
• Based on unique analysis of social media and internal data
• Customers include F100, agencies, UN
• Tech stack:
• Java, with R for algorithms
• Massive Lucene infrastructure with custom shard management
• Distributed computing framework for analysis
• Hadoop increasingly used
4. BIG DATA, BETTER DATA, BETTER ALGORITHMS
• World’s largest searchable social media archive
• >200 billion posts in 2012
• Adding 1 billion every 2-3 days
• Twitter, Facebook, blogs, forums, comments, news, etc.
5. BIG DATA, BETTER DATA, BETTER ALGORITHMS
• Who’s talking and listening?
• Demographics
• Interests
• Relationships
• Trends and comparisons
• Compared to yourself, over time
• Compared to industry, competitors, etc.
• Human input
• Define specific business question and possible answers
• Provides focus and context
6. BIG DATA, BETTER DATA, BETTER ALGORITHMS
• Based on work by co-founder Gary King at Harvard
• Takes all those billions of posts, plus the human input
• Leverages the human judgment to massive scale
• Quantitative answers to specific business questions
• Accurate in any language
8. MACHINE LEARNING
Let’s consider a typical
data-analysis problem using
machine learning.
How does having more data
help (or hurt) us?
9. DEFINE CATEGORIES
A
Some set of user- B
defined categories
(AKA topics,
classes, etc.) C
D
10. PROVIDE TRAINING
A
B
Training examples
to map features to C
categories
D
11. LEARN A MODEL
A
Algorithm classifies
items into B
categories based
on training data
C
D
12. CLASSIFY ITEMS
A
B
w x y z
C
Incoming unknown
items to be classified D
13. OBTAIN RESULTS
A y
Result: Items are B w
classified, hopefully
correctly!
C x z
D
14. DID IT WORK?
A y A y
Compare algorithm to B w B w
human(s) to measure
accuracy—here “z”
was incorrectly C x C x z
classified
D z D
15. ERROR RATE
We were wrong
25% of the time.
What happens
when we add more
data?
75% correct
25% wrong
16. SCALE TO BIG DATA
We just make the
same mistakes on
a larger scale.
75% correct
75% correct
25% wrong
25% wrong
17. CAN MORE DATA HELP?
A
Can bigger data help us? In
some ways. B
• It can enable more types of
analysis C
• It can enable analysis of more
categories
• It can provide more raw material D
for training and validation
What about accuracy? E
F
18. HUMAN SCALE
A
More training usually
improves accuracy—but we
need not just more data, but B
more humans.
Humans don’t scale. C
D
19. FEEDBACK
For some
applications, users can A y
implicitly provide feedback
through their use.
B w
e.g. ad placement; spam
detection
C x z
But this isn’t possible in all
cases—and you can’t be
too wrong to begin with D
20. BOOTSTRAPPING
We can also feed the A y
classified items back into
the training set (no human
intervention). B w
Some incorrect
classifications will become C x z
part of the training! But that
doesn’t necessarily hurt.
D
21. BOOTSTRAPPING RESULT
The more data you have,
the more you can classify. r
A y
y s
The more you classify, the
more training data you
obtain. B w w
wt
The more training data, the
more accurate the results. C x z
x
u
And we didn’t have to scale
the human involvement. D x v
x
x
22. INDIVIDUAL VS. AGGREGATE
So far we’ve considered classification
of individual items. This is the
conventional machine-learning
approach. A y
B w
w x y z
C x z
D
23. INDIVIDUAL VS. AGGREGATE
What if we want to know the size of
each category, rather than which items
are in which category?
A 25% A
e.g. epidemiology, polls, market
research
B 25% B
w x y z
C 50% C
D 0% D
24. INDIVIDUAL VS. AGGREGATE
When considered individually, there’s a limited amount
of information we have about each item.
As a result, there will be limited correlation with the
training data, and therefore poor accuracy.
A? C?
w =
B? D?
x = 75% correct
y =
25% wrong
z =
25. INDIVIDUAL VS. AGGREGATE
When considered in the aggregate, there’s much
more data correlating with the training data for each
category.
As a result, we can make more accurate estimates of
the category proportions.
% % % %D
A B C
W+X+Y+ 85% correct
Z
=
15% wrong
26. INDIVIDUAL VS. AGGREGATE
Now, increasing the amount of data can
actually increase the accuracy—with the
same amount of human training data.
% % % %D
A B C
S+T+U+V+ 95% correct
W+X+Y+Z =
5% wrong
27. CONCLUSION
• Bigger data is important
• Better data is important
• Better algorithms are important
• The sweet spot is when one leverages the other
Bigger data can lead
to better algorithms.