The Artful Business of Data Mining: Distributed Schema-less Document-Based Databases

The Artful Business
of Data Mining
Distributed Schema-less
Document-Based Databases

Wednesday 27 March 13

David Coallier
@davidcoallier


Data Scientist
At Engine Yard (.com)


Structure
Restrictions
Safety

id name age address

1 david 1 315
2 divad 3 51
3 foo 41 31
4 bar 42 98
5 john 3315 85
6 jack 4 11
7 jill 8 66
... ... ... ...


What If?


id name age address phone

1 david 26 IE 353
2 divad 27 US 1
3 foo 42 IE 353
4 bar 31 CA 1
5 john 17 NZ 131
6 jack 128 DK 311
7 jill 21 IE 353
... ... ... ... ...


Before
Moving on

What is JSON?


{
"firstName": "David",
"lastName": "Coallier",
"age": 26,
"address": {
"streetAddress": "Mansfield House",
"city": "Crosshaven",
},
"phoneNumbers": [
{
"type": "mobile",
"number": "0863299999"
}
]
}


What is HTTP?


What is a Schema?


Alternative


Schema-less


Does
NOT
Mean
Structure-less

Documents
and
K-V Buckets

CouchDB
Cluster of unreliable commodity hardware


Replication Attachments
Generated “random” ids
Dictionary Revisions?
JSON Objects
HTTP CRUD


Documents


{
"_id": "131dafsd1vasd",
"_rev": "12-fva32asdf",
"firstName": "David",
"lastName": "Coallier",
"age": 26,
"address": {
"streetAddress": "Mansfield House",
"city": "Crosshaven",
},
"phoneNumbers": [
{
"type": "mobile",
"number": "0863299999"
}
]
}


How do you
ﬁnd
Anything?

Map/Reduce


Dynamo
Paper

CAP
Theorem

Key-Value
Buckets

Differences?


CouchDB Riak
Storage Model append-only bitcask
Access HTTP HTTP, PB
Retrieval Views(M/R) M/R, Indexes, Search
Versioning Eventual Consistency Vector Clocks
Concurrency No Locking Client Resolution
Replication master/master/slave replication, clustering
Scaling In/Out Big Couch Built-in
Management Futon/Fuxton Riak Control
http://guide.couchdb.org http://downloads.basho.com/papers/bitcask-intro.pdf


Mapper:
Executed on document

Reducer:
Receives output from mappers


{
{ { {
"_id": "...",
"_id": "...", "_id": "...", "_id": "...",
"_rev": "...",
"_rev": "...", "_rev": "...", "_rev": "...",
"age": "32",
"age": "26" "age": "42" "age": "17"
"heads": "3",
} } }
}


{
"age": "32",
"heads": "3",
}


Map: ﬁnd-ages

{
{ { {
"_id": "...",
"_id": "...", "_id": "...", "_id": "...",
"_rev": "...",
"_rev": "...", "_rev": "...", "_rev": "...",
"age": "32",
"age": "26" "age": "42" "age": "17"
"heads": "3",
} } }
}


Map: ﬁnd-ages
function find_ages(doc) {
if (typeof(doc.age) != undefined) {
emit(doc._id, doc.age);
}
}


Map: ﬁnd-ages

{
{ { {
"_id": "...",
"_id": "...", "_id": "...", "_id": "...",
"_rev": "...",
"_rev": "...", "_rev": "...", "_rev": "...",
"age": "32",
"age": "26" "age": "42" "age": "17"
"heads": "3",
} } }
}

26 32 42 17

Map: ﬁnd-ages

26 32 42 17

Reduce: sum


Reduce: sum

function sum(values) {
return sum(values);
}


Map: ﬁnd-ages

26 32 42 17

Reduce: sum
117

So
What?

The
Machines
They Lurn.

The
Problem

Statistics
Example

Mean,
Std. Deviation
Age

n
1
µ = ∑ xi
n i=1

n
1
σ= ∑
n i=1
(xi − µ ) 2


Mapper:
Retrieve values, pre-process

Reducer:
Receive, process further.


[
[ 26, 676],
[ 32, 1024],
[ 42, 1764],
[ 17, 289 ]
]

/**
* Our mapper function.
*/
map: function(doc) {
emit(null, [doc.age, doc.age * doc.age]);
}

/**
* Our reducer...
*/
reduce: function(keys, values, rereduce) {
var N = 0;
var summed = 0;
var summedSquare = 0;

for (var i in values) {
N += 1;
summed += values[i][0];
summedSquare += values[i][1];
}

var mean = summed / N;
var standard_deviation = Math.sqrt(
(summedSquare / N) - (mean* mean)
)

return [mean, standard_deviation]
}


/**
* Our mapper function.
*/
map: function(doc) {
emit(null, [doc.age, doc.age * doc.age]);
}

/**
* Our reducer...
*/
reduce: function(keys, values, rereduce) {
var N = values.length;
var summed = sum(values.map(function(v) { return v[0]; }));
var summedSquares = sum(values.map(function(v) { return v[1];}));

var mean = summed / N;
var standard_deviation = Math.sqrt(
(summedSquares / N) - (mean*mean)
)

return [mean, standard_deviation]
}


Naive
Bayes

Real Life
Fraud

P(x j = k | y = fraudulent)
P(x j = k | y = normal)
P(y)


We need to:
Sum x j = k , for each y
to calculate P(x|y)


We need:
More than 1 mapper.


We need

4
mappers

Mapper #1:
∑1i P(x = k | y = fraudulent)
j


Mapper #2:
∑1i P(x = k | y = normal)
j


Mapper #3:
∑1i P(y = fraudulent)


Mapper #4:
∑1i P(y = normal)


Reducer
Sums up
results for
parameters

Cluster
Analysis

k-means


Mapper:
Divide vectors into subgroups,
Calculate d(p,q) between
vectors, ﬁnd centroids,
sum them up.

Reducer:
Sum up the sums,
get new centroids.


The Artful Business of Data Mining: Distributed Schema-less Document-Based Databases

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Andere mochten auch

Andere mochten auch (19)

Mehr von David Coallier

Mehr von David Coallier (18)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

The Artful Business of Data Mining: Distributed Schema-less Document-Based Databases