Data mining-2011-09

Data-mining, Hadoop and the Single Node

Map-Reduce Shuffle Input Output

MapR's Streaming Performance 11 x 7200rpm SATA 11 x 15Krpm SAS MB per sec Higher is better Tests: i. 16 streams x 120GB ii. 2000 streams x 1GB

Terasort on MapR 10+1 nodes: 8 core, 24GB DRAM, 11 x 1TB SATA 7200 rpm Elapsed time (mins) Lower is better

Data Flow Expected Volumes Node 6 x 1Gb/s = 600 MB / s 12 x 100MB/s = 900 MB / s Storage

MUCH faster for some operations Same 10 nodes … Create Rate # of files (millions)

Universal export to self Cluster Nodes Cluster Node Task NFS Server

Cluster Node Task NFS Server Cluster Node Task Cluster Node Task NFS Server NFS Server Nodes are identical

Sharded text Indexing Index text to local disk and then copy index to distributed file store Assign documents to shards Map Reducer Clustered index storage Input documents Copy to local disk typically required before index can be loaded Local disk Search Engine Local disk

Conventional data flow Failure of search engine requires another download of the index from clustered storage. Map Failure of a reducer causes garbage to accumulate in the local disk Reducer Clustered index storage Input documents Local disk Search Engine Local disk

Simplified NFS data flows Index to task work directory via NFS Map Reducer Search Engine Input documents Clustered index storage Failure of a reducer is cleaned up by map-reduce framework Search engine reads mirrored index directly.

K-means, the movie Centroids Assign to Nearest centroid I n p u t Aggregate new centroids

Parallel Stochastic Gradient Descent Model Train sub model I n p u t Average models

VariationalDirichlet Assignment Model Gather sufficient statistics I n p u t Update model

Old tricks, new dogs Mapper Assign point to cluster Emit cluster id, (1, point) Combiner and reducer Sum counts, weighted sum of points Emit cluster id, (n, sum/n) Output to HDFS Read from local disk from distributed cache Read from HDFS to local disk by distributed cache Written by map-reduce

Old tricks, new dogs Mapper Assign point to cluster Emit cluster id, (1, point) Combiner and reducer Sum counts, weighted sum of points Emit cluster id, (n, sum/n) Output to HDFS Read from NFS Written by map-reduce MapR FS

Poor man’s Pregel Mapper Lines in bold can use conventional I/O via NFS while not done: read and accumulate input models for each input: accumulate model write model synchronize reset input format emit summary 18

Mahout Scalable Data Mining for Everybody

What is Mahout Recommendations (people who x this also x that) Clustering (segment data into groups of) Classification (learn decision making from examples) Stuff (LDA, SVD, frequent item-set, math)

What is Mahout? Recommendations (people who x this also x that) Clustering (segment data into groups of) Classification (learn decision making from examples) Stuff (LDA, SVM, frequent item-set, math)

Classification in Detail Naive Bayes Family Hadoop based training Decision Forests Hadoop based training Logistic Regression (aka SGD) fast on-line (sequential) training

So What? big starts here Online training has low overhead for small and moderate size data-sets

And Another From: Dr. Paul Acquah Dear Sir, Re: Proposal for over-invoice Contract Benevolence Based on information gathered from the India hospital directory, I am pleased to propose a confidential business deal for our mutual benefit. I have in my possession, instruments (documentation) to transfer the sum of 33,100,000.00 eur thirty-three million one hundred thousand euros, only) into a foreign company's bank account for our favor. ... Date: Thu, May 20, 2010 at 10:51 AM From: George <george@fumble-tech.com> Hi Ted, was a pleasure talking to you last night at the Hadoop User Group. I liked the idea of going for lunch together. Are you available tomorrow (Friday) at noon?

Mahout’s SGD Learns on-line per example O(1) memory O(1) time per training example Sequential implementation fast, but not parallel

Special Features Hashed feature encoding Per-term annealing learn the boring stuff once Auto-magical learning knob turning learns correct learning rate, learns correct learning rate for learning learning rate, ...

Learning Rate Annealing Learning Rate # training examples seen

Per-term Annealing Common Feature Learning Rate Rare Feature # training examples seen

General Structure OnlineLogisticRegression Traditional logistic regression Stochastic Gradient Descent Per term annealing Too fast (for the disk + encoder)

Next Level CrossFoldLearner contains multiple primitive learners online cross validation 5x more work

And again AdaptiveLogisticRegression 20 x CrossFoldLearner evolves good learning and regularization rates 100 x more work than basic learner still faster than disk + encoding

A comparison Traditional view 400 x (read + OLR) Revised Mahout view 1 x (read + mu x 100 x OLR) x eta mu = efficiency from killing losers early eta = efficiency from stopping early

Click modeling architecture Map-reduce Side-data Now via NFS Feature extraction and down sampling I n p u t Data join Sequential SGD Learning

Click modeling architecture Map-reduce Map-reduce Side-data Map-reduce cooperates with NFS Sequential SGD Learning Feature extraction and down sampling Sequential SGD Learning I n p u t Data join Sequential SGD Learning Sequential SGD Learning

Deployment Training ModelSerializer.writeBinary(..., model) Deployment m = ModelSerializer.readBinary(...) r = m.classifyScalar(featureVector)

The Upshot One machine can go fast SITM trains in 2 billion examples in 3 hours Deployability pays off big simple sample server farm

Data mining-2011-09

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (19)

Ähnlich wie Data mining-2011-09

Ähnlich wie Data mining-2011-09 (20)

Mehr von Ted Dunning

Mehr von Ted Dunning (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Data mining-2011-09