Christopher Bingham, Crimson Hexagon: Better Algorithms from Bigger Data

BETTER ALGORITHMS
FROM BIGGER DATA
Chris Bingham, CTO, Crimson Hexagon

April 26th, 2012

INTRODUCTION
Crimson Hexagon and me

ABOUT CRIMSON HEXAGON

• Founded 4 years ago; now 40+ employees in Boston

• Help companies make actionable business decisions

• Based on unique analysis of social media and internal data

• Customers include F100, agencies, UN

• Tech stack:
• Java, with R for algorithms
• Massive Lucene infrastructure with custom shard management
• Distributed computing framework for analysis
• Hadoop increasingly used

BIG DATA, BETTER DATA, BETTER ALGORITHMS

• World’s largest searchable social media archive

• >200 billion posts in 2012

• Adding 1 billion every 2-3 days

• Twitter, Facebook, blogs, forums, comments, news, etc.


• Who’s talking and listening?
• Demographics
• Interests
• Relationships

• Trends and comparisons
• Compared to yourself, over time
• Compared to industry, competitors, etc.

• Human input
• Define specific business question and possible answers
• Provides focus and context


• Based on work by co-founder Gary King at Harvard

• Takes all those billions of posts, plus the human input

• Leverages the human judgment to massive scale

• Quantitative answers to specific business questions

• Accurate in any language

ALGORITHMS AND BIG DATA
The problem of leverage

MACHINE LEARNING

Let’s consider a typical
data-analysis problem using
machine learning.

How does having more data
help (or hurt) us?

DEFINE CATEGORIES

A

Some set of user- B
defined categories
(AKA topics,
classes, etc.) C

D

PROVIDE TRAINING

A

B

Training examples
to map features to C
categories

D

LEARN A MODEL

A

Algorithm classifies
items into B
categories based
on training data
C

D

CLASSIFY ITEMS

A

B
w x y z

C

Incoming unknown
items to be classified D

OBTAIN RESULTS

A y

Result: Items are B w
classified, hopefully
correctly!
C x z

D

DID IT WORK?

A y A y

Compare algorithm to B w B w
human(s) to measure
accuracy—here “z”
was incorrectly C x C x z
classified

D z D

ERROR RATE

We were wrong
25% of the time.
What happens
when we add more
data?

75% correct

25% wrong

SCALE TO BIG DATA

We just make the
same mistakes on
a larger scale.

75% correct
75% correct

25% wrong

25% wrong

CAN MORE DATA HELP?

A
Can bigger data help us? In
some ways. B

• It can enable more types of
analysis C
• It can enable analysis of more
categories
• It can provide more raw material D
for training and validation

What about accuracy? E

F

HUMAN SCALE

A
More training usually
improves accuracy—but we
need not just more data, but B
more humans.

Humans don’t scale. C

D

FEEDBACK

For some
applications, users can A y
implicitly provide feedback
through their use.
B w
e.g. ad placement; spam
detection
C x z
But this isn’t possible in all
cases—and you can’t be
too wrong to begin with D

BOOTSTRAPPING

We can also feed the A y
classified items back into
the training set (no human
intervention). B w

Some incorrect
classifications will become C x z
part of the training! But that
doesn’t necessarily hurt.
D

BOOTSTRAPPING RESULT

The more data you have,
the more you can classify. r
A y
y s
The more you classify, the
more training data you
obtain. B w w
wt
The more training data, the
more accurate the results. C x z
x
u
And we didn’t have to scale
the human involvement. D x v
x
x

INDIVIDUAL VS. AGGREGATE

So far we’ve considered classification
of individual items. This is the
conventional machine-learning
approach. A y

B w
w x y z

C x z

D


What if we want to know the size of
each category, rather than which items
are in which category?
A 25% A
e.g. epidemiology, polls, market
research
B 25% B
w x y z

C 50% C

D 0% D


When considered individually, there’s a limited amount
of information we have about each item.

As a result, there will be limited correlation with the
training data, and therefore poor accuracy.

A? C?
w =
B? D?

x = 75% correct

y =
25% wrong
z =


When considered in the aggregate, there’s much
more data correlating with the training data for each
category.

As a result, we can make more accurate estimates of
the category proportions.

% % % %D
A B C

W+X+Y+ 85% correct
Z
=

15% wrong


Now, increasing the amount of data can
actually increase the accuracy—with the
same amount of human training data.

% % % %D
A B C

S+T+U+V+ 95% correct
W+X+Y+Z =

5% wrong

CONCLUSION

• Bigger data is important

• Better data is important

• Better algorithms are important

• The sweet spot is when one leverages the other

Bigger data can lead
to better algorithms.

Christopher Bingham, Crimson Hexagon: Better Algorithms from Bigger Data

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Empfohlen

Empfohlen (20)

Christopher Bingham, Crimson Hexagon: Better Algorithms from Bigger Data