Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.
Page1 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Machine Learning with Apache Spark
Page2 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Agenda
• Machine Learning Overview
• Spark
– Spark Essentials
– ...
Page3 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
© Hortonworks Inc. 2014
Machine Learning Overview
Architecting t...
Page4 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Machine Learning
Arthur Samuel (1959) – Machine Learning: Field ...
Page5 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Machine Learning
• Supervised
 Regression
 Classification
– SV...
Page6 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Supervised Learning
Infer a target function from labeled dataset...
Page7 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Unsupervised Learning
Identify naturally occurring patterns in d...
Page8 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Example: Email Spam Detection
2 classes: Spam or Not-Spam
Featur...
Page9 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Regression Analysis
ID Age City Target
101 25 SF $200
102 35 LA ...
Page10 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Regression Example: Ad Click Through Rate (CTR) Prediction
Rank...
Page11 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Example: Netflix Movie Recommendations
Page12 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Clustering
Page13 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Recommendation Engine
5 2 4 ? ?
? ? 5 2 ?
1 2 ? ? 3
Harrypotter...
Page14 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Detecting Natural Groups
Page15 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Determining Class Labels
ID Total$ Age City Class
101 $200 25 S...
Page16 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Detecting Outliers
Outlier point
Page17 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Example: Credit Card Fraud Detection
Page18 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Task #6: Affinity Analysis
Y N N Y N
Y N N Y N
Y Y N Y N
N N Y ...
Page19 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Example: Market Basket Analysis
Use affinity analysis for
- sto...
Page20 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
© Hortonworks Inc. 2014
Apache Spark
Architecting the Future of...
Page21 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Apache Spark
• Apache Spark is an open source project for fast ...
Page22 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Brief History
Page23 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
RDD
• It is Spark’s abstraction for a distributed collection of...
Page24 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Actions and Transformation
Actions
 Which Returns Values
 Act...
Page25 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Spark in a Cluster
Spark Applications runs as independent set o...
Page26 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Spark Monitoring
Page27 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Sample Program (Java)
Page28 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Spark Sample
Executing the spark Shell on a Cluster
 $ spark-s...
Page29 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Demo
• Fire up a Spark VM with HDP
• Start the spark-shell
• Sp...
Page30 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
© Hortonworks Inc. 2014
MLIB (Machine Learning Library)
Archite...
Page31 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Spark Stack
Page32 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Machine Learning
MLlib is Spark’s scalable machine learning lib...
Page33 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
MLliB
• Basic Statistics
• Correlations
• Stratified sampling
•...
Page34 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Mllib – Collaborative Filtering
• Collaborative Filtering - The...
Page35 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Mllib – Clustering
• Clustering is an unsupervised learning pro...
Page36 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Mllib - Dimensionality reduction
Dimensionality reduction is th...
Page37 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
MLlib - Feature Extraction and Transformation
• TF-IDF - Term f...
Page38 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Machine Learning Demo
• Movie Rating
Page39 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
© Hortonworks Inc. 2014
Graphx
Architecting the Future of Big D...
Page40 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
GraphX
• Spark API for graphs and graph-parallel computation
Page41 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
GraphX - Demo
http://ampcamp.berkeley.edu/4/exercises/graph-ana...
Page42 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Questions
Nächste SlideShare
Wird geladen in …5
×

Machine Learning With Spark

4.732 Aufrufe

Veröffentlicht am

Machine Learning with Apache Spark

Veröffentlicht in: Daten & Analysen
  • Als Erste(r) kommentieren

Machine Learning With Spark

  1. 1. Page1 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Machine Learning with Apache Spark
  2. 2. Page2 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Agenda • Machine Learning Overview • Spark – Spark Essentials – Sample Code • Machine Learning Libraries in Spark • MLIB • Graphx • Code Example
  3. 3. Page3 © Hortonworks Inc. 2011 – 2014. All Rights Reserved © Hortonworks Inc. 2014 Machine Learning Overview Architecting the Future of Big Data Page 3
  4. 4. Page4 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Machine Learning Arthur Samuel (1959) – Machine Learning: Field of Study that gives the ability to learn without being explicitly programmed. – Checker Programmer
  5. 5. Page5 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Machine Learning • Supervised  Regression  Classification – SVM (Support Vector Machines) • Unsupervised  Clustering  Recommendation  Outlier detection  Affinity analysis • Learning theory • Re-enforcement Learning
  6. 6. Page6 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Supervised Learning Infer a target function from labeled dataset Example: classification, regression Labeled dataset Test data ID Total$ Age City Target 101 200 25 SF 2 102 350 35 LA 2 103 25 15 LA 1 … … … … 1 1 2 ID Total$ Age City 105 234 22 NYC 106 112 67 BOS Model Target
  7. 7. Page7 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Unsupervised Learning Identify naturally occurring patterns in data Example: clustering ID Total$ Age City 101 200 25 SF 102 350 35 LA 103 25 15 LA … … … … No labels Model Naturally occurring hidden structure
  8. 8. Page8 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Example: Email Spam Detection 2 classes: Spam or Not-Spam Features: words that appear (or not) in the email text
  9. 9. Page9 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Regression Analysis ID Age City Target 101 25 SF $200 102 35 LA $350 103 15 LA $25 … … … … Labeled dataset Test data ID Age City Target 104 17 NYC ? Model Techniques: linear regression, decision trees, etc Many more
  10. 10. Page10 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Regression Example: Ad Click Through Rate (CTR) Prediction Rank = bid * CTR Predict CTR for each ad to determine placement, based on: - Historical CTR - Keyword match - Etc…
  11. 11. Page11 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Example: Netflix Movie Recommendations
  12. 12. Page12 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Clustering
  13. 13. Page13 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Recommendation Engine 5 2 4 ? ? ? ? 5 2 ? 1 2 ? ? 3 Harrypotter X-Men Hobbit Argo Pirates 5 2 4 1 3 4 1 5 2 3 1 2 4 1 3 101 102 103 104 105 … 101 102 103 104 105 … Harrypotter X-Men Hobbit Argo Pirates
  14. 14. Page14 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Detecting Natural Groups
  15. 15. Page15 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Determining Class Labels ID Total$ Age City Class 101 $200 25 SF 2 102 $350 35 LA 2 103 $25 15 LA 1 … … … … 1 1 2 2 2 N Variables Some techniques: - Kmeans - Spectral clustering - DB-scan - Hierarchical clustering
  16. 16. Page16 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Detecting Outliers Outlier point
  17. 17. Page17 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Example: Credit Card Fraud Detection
  18. 18. Page18 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Task #6: Affinity Analysis Y N N Y N Y N N Y N Y Y N Y N N N Y Y Y Tx 1 Tx 2 Tx 3 Tx 4 Tx 5 … Item1 Item2 Item3 Item4 Item5 … Y N N Y N Y N N Y N Y Y N Y N N N Y Y Y Tx 1 Tx 2 Tx 3 Tx 4 Tx 5 … Item1 Item2 Item3 Item4 Item5 … Goal: identify frequent itemset Techniques: FP Growth, Apriori
  19. 19. Page19 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Example: Market Basket Analysis Use affinity analysis for - store layout design - Coupons
  20. 20. Page20 © Hortonworks Inc. 2011 – 2014. All Rights Reserved © Hortonworks Inc. 2014 Apache Spark Architecting the Future of Big Data Page 20
  21. 21. Page21 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Apache Spark • Apache Spark is an open source project for fast and large scale data processing. – Simple and expressive programming model – Machine learning, graph computation and Streaming – in-memory compute for iterative workloads • It does most of the processing in memory • It support programming languages – Java, Scala and Python • It provides a high level modules for – Mlib – GraphX – Sprak Streaming – Sprark SQL • Cluster Manager – Yarn (recommended) – Mesos – Sparks Own
  22. 22. Page22 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Brief History
  23. 23. Page23 © Hortonworks Inc. 2011 – 2014. All Rights Reserved RDD • It is Spark’s abstraction for a distributed collection of items • Resilient Distributed Dataset • It can be created – from Hadoop Inputformats – Transforming other RDD
  24. 24. Page24 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Actions and Transformation Actions  Which Returns Values  Actions results into a DAG of operations  DAG is compiled into stages where each stage is executed as series of tasks  Tasks : Fundamental units of work Transformations  Which return pointers to new RDD  Transformations are lazy (Not computed immediately)
  25. 25. Page25 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Spark in a Cluster Spark Applications runs as independent set of process in a Cluster SparkContext Object/Driver Manager initiates and co-ordinates it – SparkContext is created when you start the spark-shell. – It is accessible by “sc” – SparkContext(master: String, jobName: String) – Master : This is the location of the cluster Cluster Manager allocates resources on the cluster Spark acquires Executors on the Nodes Spark sends you application code and tasks to the Executors
  26. 26. Page26 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Spark Monitoring
  27. 27. Page27 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Sample Program (Java)
  28. 28. Page28 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Spark Sample Executing the spark Shell on a Cluster  $ spark-shell --master yarn-client --num-executors 1 --driver-memory 512m --executor- memory 512m --executor-cores 1 Executing the Spark Pi  $ spark-submit --class org.apache.spark.examples.SparkPi --master yarn-cluster --num- executors 3 --driver-memory 512m --executor-memory 512m --executor-cores 1 /root/Spark11/lib/spark-examples*.jar
  29. 29. Page29 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Demo • Fire up a Spark VM with HDP • Start the spark-shell • Spark Pi • Word Count Example
  30. 30. Page30 © Hortonworks Inc. 2011 – 2014. All Rights Reserved © Hortonworks Inc. 2014 MLIB (Machine Learning Library) Architecting the Future of Big Data Page 30
  31. 31. Page31 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Spark Stack
  32. 32. Page32 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Machine Learning MLlib is Spark’s scalable machine learning library consisting of common learning algorithms and utilities, including classification, regression, clustering, collaborative filtering, dimensionality reduction. Dependency • Breeze • Breeze is a library for numerical processing • netlib-java, and jblas • Numeric and Matrix library for Java • gfortran runtime library
  33. 33. Page33 © Hortonworks Inc. 2011 – 2014. All Rights Reserved MLliB • Basic Statistics • Correlations • Stratified sampling • Hypothesis testing • Random data generation • MLlib - Classification and Regression Problem Type Supported Methods Binary Classification Linear SVMs, logistic regression, decision tree, naïve Bayes Multiclass Classification decision trees, naive Bayes Regression Linear least squares, Lasso, ridge regression, descision trees
  34. 34. Page34 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Mllib – Collaborative Filtering • Collaborative Filtering - These techniques aim to fill in the missing entries of a user-item association matrix. • MLlib currently supports model-based collaborative filtering, in which users and products are described by a small set of latent factors that can be used to predict missing entries. • MLlib uses the alternating least squares (ALS) • Large-scale Parallel Collaborative Filtering for the Netflix Prize
  35. 35. Page35 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Mllib – Clustering • Clustering is an unsupervised learning problem • Mllib supports K-Means Clustering
  36. 36. Page36 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Mllib - Dimensionality reduction Dimensionality reduction is the process of reducing the number of variables under consideration. It can be used to extract latent features from raw and noisy features or compress data while maintaining the structure. MLlib provides support for dimensionality reduction on the RowMatrix class.
  37. 37. Page37 © Hortonworks Inc. 2011 – 2014. All Rights Reserved MLlib - Feature Extraction and Transformation • TF-IDF - Term frequency-inverse document frequency (TF-IDF) is a feature vectorization method widely used in text mining to reflect the importance of a term to a document in the corpus • Word2Vec - Word2Vec computes distributed vector representation of words. The main advantage of the distributed representations is that similar words are close in the vector space, which makes generalization to novel patterns easier and model estimation more robust. • StandardScaler • Normalizer
  38. 38. Page38 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Machine Learning Demo • Movie Rating
  39. 39. Page39 © Hortonworks Inc. 2011 – 2014. All Rights Reserved © Hortonworks Inc. 2014 Graphx Architecting the Future of Big Data Page 39
  40. 40. Page40 © Hortonworks Inc. 2011 – 2014. All Rights Reserved GraphX • Spark API for graphs and graph-parallel computation
  41. 41. Page41 © Hortonworks Inc. 2011 – 2014. All Rights Reserved GraphX - Demo http://ampcamp.berkeley.edu/4/exercises/graph-analytics-with- graphx.html
  42. 42. Page42 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Questions

×