Talk given on September 20 to the bay area data mining. The basic idea is that integrating map-reduce programs with the real world is easier than ever.
8. Cluster Node Task NFS Server Cluster Node Task Cluster Node Task NFS Server NFS Server Nodes are identical
9. Sharded text Indexing Index text to local disk and then copy index to distributed file store Assign documents to shards Map Reducer Clustered index storage Input documents Copy to local disk typically required before index can be loaded Local disk Search Engine Local disk
10. Conventional data flow Failure of search engine requires another download of the index from clustered storage. Map Failure of a reducer causes garbage to accumulate in the local disk Reducer Clustered index storage Input documents Local disk Search Engine Local disk
11. Simplified NFS data flows Index to task work directory via NFS Map Reducer Search Engine Input documents Clustered index storage Failure of a reducer is cleaned up by map-reduce framework Search engine reads mirrored index directly.
12. K-means, the movie Centroids Assign to Nearest centroid I n p u t Aggregate new centroids
16. Old tricks, new dogs Mapper Assign point to cluster Emit cluster id, (1, point) Combiner and reducer Sum counts, weighted sum of points Emit cluster id, (n, sum/n) Output to HDFS Read from local disk from distributed cache Read from HDFS to local disk by distributed cache Written by map-reduce
17. Old tricks, new dogs Mapper Assign point to cluster Emit cluster id, (1, point) Combiner and reducer Sum counts, weighted sum of points Emit cluster id, (n, sum/n) Output to HDFS Read from NFS Written by map-reduce MapR FS
18. Poor man’s Pregel Mapper Lines in bold can use conventional I/O via NFS while not done: read and accumulate input models for each input: accumulate model write model synchronize reset input format emit summary 18
21. What is Mahout Recommendations (people who x this also x that) Clustering (segment data into groups of) Classification (learn decision making from examples) Stuff (LDA, SVD, frequent item-set, math)
22. What is Mahout? Recommendations (people who x this also x that) Clustering (segment data into groups of) Classification (learn decision making from examples) Stuff (LDA, SVM, frequent item-set, math)
23. Classification in Detail Naive Bayes Family Hadoop based training Decision Forests Hadoop based training Logistic Regression (aka SGD) fast on-line (sequential) training
24. Classification in Detail Naive Bayes Family Hadoop based training Decision Forests Hadoop based training Logistic Regression (aka SGD) fast on-line (sequential) training
25. So What? big starts here Online training has low overhead for small and moderate size data-sets
27. And Another From: Dr. Paul Acquah Dear Sir, Re: Proposal for over-invoice Contract Benevolence Based on information gathered from the India hospital directory, I am pleased to propose a confidential business deal for our mutual benefit. I have in my possession, instruments (documentation) to transfer the sum of 33,100,000.00 eur thirty-three million one hundred thousand euros, only) into a foreign company's bank account for our favor. ... Date: Thu, May 20, 2010 at 10:51 AM From: George <george@fumble-tech.com> Hi Ted, was a pleasure talking to you last night at the Hadoop User Group. I liked the idea of going for lunch together. Are you available tomorrow (Friday) at noon?
28. Mahout’s SGD Learns on-line per example O(1) memory O(1) time per training example Sequential implementation fast, but not parallel
29. Special Features Hashed feature encoding Per-term annealing learn the boring stuff once Auto-magical learning knob turning learns correct learning rate, learns correct learning rate for learning learning rate, ...
37. And again AdaptiveLogisticRegression 20 x CrossFoldLearner evolves good learning and regularization rates 100 x more work than basic learner still faster than disk + encoding
38. A comparison Traditional view 400 x (read + OLR) Revised Mahout view 1 x (read + mu x 100 x OLR) x eta mu = efficiency from killing losers early eta = efficiency from stopping early
39. Click modeling architecture Map-reduce Side-data Now via NFS Feature extraction and down sampling I n p u t Data join Sequential SGD Learning
40. Click modeling architecture Map-reduce Map-reduce Side-data Map-reduce cooperates with NFS Sequential SGD Learning Feature extraction and down sampling Sequential SGD Learning I n p u t Data join Sequential SGD Learning Sequential SGD Learning