Weitere ähnliche Inhalte Ähnlich wie Hadoop for the Data Scientist: Spark in Cloudera 5.5 (20) Mehr von Cloudera, Inc. (20) Kürzlich hochgeladen (20) Hadoop for the Data Scientist: Spark in Cloudera 5.51. 1© Cloudera, Inc. All rights reserved.
Hadoop for the Data Scientist:
Spark in Cloudera 5.5
Anand Iyer | Senior Product Manager | Cloudera
Sandy Ryza | Senior Data Scientist | Cloudera
2. 2© Cloudera, Inc. All rights reserved.
Agenda
• Apache Spark Overview
• Machine Learning with Hadoop and Spark
• Machine Learning Use Cases
• What’s Next
3. 3© Cloudera, Inc. All rights reserved.
Cloudera Enterprise
Making Hadoop Fast, Easy, and Secure
A new kind of data
platform:
• One place for unlimited data
• Unified, multi-framework data
access
Cloudera makes it:
• Fast for business
• Easy to manage
• Secure without compromise
OPERATIONS
DATA
MANAGEMENT
STRUCTURED UNSTRUCTURED
PROCESS, ANALYZE, SERVE
UNIFIED SERVICES
RESOURCE MANAGEMENT SECURITY
FILESYSTEM RELATIONAL NoSQL
STORE
INTEGRATE
BATCH STREAM SQL SEARCH SDK
4. 4© Cloudera, Inc. All rights reserved.
One Platform, Many Workloads
Batch, Interactive,
and Real-Time.
Leading performance and
usability in one platform.
• End-to-end analytic workflows
• Access more data
• Work with data in new ways
• Enable new users
OPERATIONS
Cloudera Manager
Cloudera Director
DATA
MANAGEMENT
Cloudera Navigator
Encrypt and KeyTrustee
Optimizer
STRUCTURED
Sqoop
UNSTRUCTURED
Kafka, Flume
PROCESS, ANALYZE, SERVE
UNIFIED SERVICES
RESOURCE MANAGEMENT
YARN
SECURITY
Sentry, RecordService
FILESYSTEM
HDFS
RELATIONAL
Kudu
NoSQL
HBase
STORE
INTEGRATE
BATCH
Spark, Hive, Pig
MapReduce
STREAM
Spark
SQL
Impala
SEARCH
Solr
SDK
Kite
5. 5© Cloudera, Inc. All rights reserved.
Apache Spark
Flexible, in-memory data processing for Hadoop
Easy
Development
Flexible Extensible
API
Fast Batch & Stream
Processing
• Rich APIs for Scala,
Java, and Python
• Interactive shell
• APIs for different
types of workloads:
• Batch
• Streaming
• Machine Learning
• Graph
• In-Memory
processing and
caching
6. 6© Cloudera, Inc. All rights reserved.
The Spark Ecosystem & Hadoop
STRUCTURED
Sqoop
UNSTRUCTURED
Kafka, Flume
UNIFIED SERVICES
RESOURCE MANAGEMENT
YARN
SECURITY
Sentry, RecordService
FILESYSTEM
HDFS
RELATIONAL
Kudu
NoSQL
HBase
STORE
INTEGRATE
SQL
Impala
SEARCH
Solr
SDK
Kite
BATCH & STREAM
Spark
Spark
Streaming Spark SQL DataFrames MLlib …
7. 7© Cloudera, Inc. All rights reserved.
Easy Machine Learning
on data distributed over a large cluster of machines
8. 8© Cloudera, Inc. All rights reserved.
Process Flow for ML Development
Traditional Data
Management
Model Development
Phase
Production
Modeling
Production
Scoring*
Metadata Management
Development Tools (IDEs, source
control, notebooks) Scheduling, Workflow, Publishing
Data
Ingest
Data
Prep Feature
Engineering
Visualization
Modeling (incl.
hyperparameter
search & model
validation)
Feature
Generation +
Model Building
Model
Quality,
Usage,
Perf.
Metrics
Experiments
Batch
Scorer
Online
Model
Update
Server +
Scoring
*There may be further steps after scoring such as
aggregations, visualizations, reporting, etc
9. 9© Cloudera, Inc. All rights reserved.
What is Mllib?
Library of machine learning and data mining algorithms and utilities
• Implemented in Spark
• Invoked within Java, Scala, or Python Spark applications
MLlib applications are Spark applications
• Requires Spark knowledge to effectively run
• Recommended deployment on YARN
• MLlib apps require the same set of parameters Spark applications require
(number of executors, memory per executor, etc)
10. 10© Cloudera, Inc. All rights reserved.
What Does MLlib Contain?
• Machine learning models for classification and regression
• Recommender System
• Clustering Algorithms
• Feature Engineering Algorithms and Utilities
• Data Mining Algorithms & Basic Statistical Analysis Utilities
11. 11© Cloudera, Inc. All rights reserved.
Classification & Regression
Traditional Models
• Linear and Logistic Regression
• Naïve Bayes
• Decision Trees
• Support Vector Machines
12. 12© Cloudera, Inc. All rights reserved.
Classification & Regression
Traditional Models
• Linear and Logistic Regression
• Naïve Bayes
• Decision Trees
• Support Vector Machines
Next-Gen Models
• Gradient Boosted Trees
• Random Forests
13. 13© Cloudera, Inc. All rights reserved.
Clustering Algorithms
• K-Means
• Power Iteration Clustering (PIC)
• Gaussian Mixture Model
• Streaming K-Means
14. 14© Cloudera, Inc. All rights reserved.
Clustering Algorithms
• K-Means
• Power Iteration Clustering (PIC)
• Gaussian Mixture Model
• Streaming K-Means
Textual data clustering i.e. identifying “topics” from a corpus of documents:
• Latent Dirichlet Allocation (LDA)
15. 15© Cloudera, Inc. All rights reserved.
• Predicting the interests of a user, by
collecting partial list of preferences
from many users
• Predicting missing items of a user-item
association matrix
• Algorithm used: Alternating Least Squares
• Admittedly limited choice of algorithms
?
?
?
?
?
?
?
?
?
?
Collaborative Filtering
For Building Recommender Systems
16. 16© Cloudera, Inc. All rights reserved.
Feature Engineering & Modeling Utilities
• Feature Scaling & Normalization
• Statistical Correlation Functions
(Pearson & Spearman’s)
• Tests of Statistical Significance
• Chi-Squared independence test for feature selection
• Evaluation metrics: Precision, Recall, AUROC, F1-Measure, etc
17. 17© Cloudera, Inc. All rights reserved.
Feature Engineering & Modeling Utilities
• Feature Scaling & Normalization
• Statistical Correlation Functions
(Pearson & Spearman’s)
• Tests of Statistical Significance
• Chi-Squared independence test for feature selection
• Evaluation metrics: Precision, Recall, AUROC, F1-Measure, etc
Dimensionality Reduction:
• Principal Component Analysis (PCA)
• Singular Value Decomposition (SVD)
18. 18© Cloudera, Inc. All rights reserved.
Feature Engineering & Modeling Utilities
• Feature Scaling & Normalization
• Statistical Correlation Functions
(Pearson & Spearman’s)
• Tests of Statistical Significance
• Chi-Squared independence test for feature selection
• Evaluation metrics: Precision, Recall, AUROC, F1-Measure, etc
Dimensionality Reduction:
• Principal Component Analysis (PCA)
• Singular Value Decomposition (SVD)
Textual Feature Generation:
• Word2Vec
• Term Frequency – Inverse Document
Frequency (TF-IDF)
19. 19© Cloudera, Inc. All rights reserved.
Data Mining: Frequent Pattern Mining
Data Mining Urban Legend:
Frequent Pattern Mining algorithm on supermarket purchase data revealed “Men
who buy diapers have a very high likelihood of buying beer!”
20. 20© Cloudera, Inc. All rights reserved.
Data Mining: Frequent Pattern Mining
Data Mining Urban Legend:
Frequent Pattern Mining algorithm on supermarket purchase data revealed “Men
who buy diapers have a very high likelihood of buying beer!”
Algorithms in MLlib:
• Frequent Pattern-Growth
• Association Rule Mining
• PrefixSpan
21. 21© Cloudera, Inc. All rights reserved.
What about “Deep Learning”?
Deep Learning is an umbrella term for large complex Multi-
Layer Neural Networks
• MLlib contains a robust Multilayer Neural Network implementation
22. 22© Cloudera, Inc. All rights reserved.
Pipeline API
Hooking the Pieces Together
• Inspired by scikit-learn pipelines
• ML involves running multiple sequential steps
Eg: Text Classification Pipeline
Bag of
Words
Tokenize TF-IDF LDA
Scale &
Normalize
Features
Train
Classifier
23. 23© Cloudera, Inc. All rights reserved.
Pipeline API
Hooking the Pieces Together
• Inspired by scikit-learn pipelines
• ML involves running multiple sequential steps
Eg: Text Classification Pipeline
Bag of
Words
Tokenize TF-IDF LDA
Scale &
Normalize
Features
Train
Classifier
Sequence is repeated during Training and Scoring
24. 24© Cloudera, Inc. All rights reserved.
Pipeline API: Hooking the pieces together
• Inspired by scikit-learn pipelines
• ML involves running multiple sequential steps
Eg: Text Classification Pipeline
Bag of
words
Tokenize TF-IDF LDA
Scale &
Normalize
Features
Train
Classifier
Sequence is repeated during Training and Scoring
Hyper-Parameter Tuning Repeat Sequence with different parameter values
25. 25© Cloudera, Inc. All rights reserved.
Overview of Pipeline API
• Create Pipeline as a sequence of Stages:
• Transformers: Transform or augment features
• Estimators: Fit a model
• Re-use Pipeline
• Basic save and load functionality available
• Invoke Pipeline with different set of parameters passed as ParamMap
26. 26© Cloudera, Inc. All rights reserved.
Process Flow for ML Development
Traditional Data
Management
Model Development
Phase
Production
Modeling
Production
Scoring*
Metadata Management
Development Tools (IDEs, source
control, notebooks) Scheduling, Workflow, Publishing
Data
Ingest
Data
Prep Feature
Engineering
Visualization
Modeling (incl.
hyperparameter
search & model
validation)
Feature
Generation +
Model Building
Model
Quality,
Usage,
Perf.
Metrics
Experiments
Batch
Scorer
Online
Model
Update
Server +
Scoring
*There may be further steps after scoring such as
aggregations, visualizations, reporting, etc
27. 27© Cloudera, Inc. All rights reserved.
Process Flow for ML Development
Traditional Data
Management
Model Development
Phase
Production
Modeling
Production
Scoring*
Metadata Management
Development Tools (IDEs, source
control, notebooks) Scheduling, Workflow, Publishing
Data
Ingest
Data
Prep Feature
Engineering
Visualization
Modeling (incl.
hyperparameter
search & model
validation)
Feature
Generation +
Model Building
Model
Quality,
Usage,
Perf.
Metrics
Experiments
Batch
Scorer
Online
Model
Update
Server +
Scoring
*There may be further steps after scoring such as
aggregations, visualizations, reporting, etc
28. 28© Cloudera, Inc. All rights reserved.
Process Flow for ML Development
Traditional Data
Management
Model Development
Phase
Production
Modeling
Production
Scoring*
Metadata Management
Development Tools (IDEs, source
control, notebooks) Scheduling, Workflow, Publishing
Data
Ingest
Data
Prep Feature
Engineering
Visualization
Modeling (incl.
hyperparameter
search & model
validation)
Feature
Generation +
Model Building
Model
Quality,
Usage,
Perf.
Metrics
Experiments
Batch
Scorer
Online
Model
Update
Server +
Scoring
*There may be further steps after scoring such as
aggregations, visualizations, reporting, etc
29. 29© Cloudera, Inc. All rights reserved.
Process Flow for ML Development
Traditional Data
Management
Model Development
Phase
Production
Modeling
Production
Scoring*
Metadata Management
Development Tools (IDEs, source
control, notebooks) Scheduling, Workflow, Publishing
Data
Ingest
Data
Prep Feature
Engineering
Visualization
Modeling (incl.
hyperparameter
search & model
validation)
Feature
Generation +
Model Building
Model
Quality,
Usage,
Perf.
Metrics
Experiments
Batch
Scorer
Online
Model
Update
Server +
Scoring
*There may be further steps after scoring such as
aggregations, visualizations, reporting, etc
30. 30© Cloudera, Inc. All rights reserved.
Process Flow for ML Development
Traditional Data
Management
Model Development
Phase
Production
Modeling
Production
Scoring*
Metadata Management
Development Tools (IDEs, source
control, notebooks) Scheduling, Workflow, Publishing
Data
Ingest
Data
Prep Feature
Engineering
Visualization
Modeling (incl.
hyperparameter
search & model
validation)
Feature
Generation +
Model Building
Model
Quality,
Usage,
Perf.
Metrics
Experiments
Batch
Scorer
Online
Model
Update
Server +
Scoring
*There may be further steps after scoring such as
aggregations, visualizations, reporting, etc
31. 31© Cloudera, Inc. All rights reserved.
Process Flow for ML Development
Traditional Data
Management
Model Development
Phase
Production
Modeling
Production
Scoring*
Metadata Management
Development Tools (IDEs, source
control, notebooks) Scheduling, Workflow, Publishing
Data
Ingest
Data
Prep Feature
Engineering
Visualization
Modeling (incl.
hyperparameter
search & model
validation)
Feature
Generation +
Model Building
Model
Quality,
Usage,
Perf.
Metrics
Experiments
Batch
Scorer
Online
Model
Update
Server +
Scoring
*There may be further steps after scoring such as
aggregations, visualizations, reporting, etc
Score streaming events in
Spark Streaming.
33. 33© Cloudera, Inc. All rights reserved.
Predicting Influencers at a Large Telco
• Customer loyalty difficult and expensive
• Aggressive competition
34. 34© Cloudera, Inc. All rights reserved.
Social Churn
• Churn is not an isolated event!
• When influential subscribers leave, they
take their friends with them
35. 35© Cloudera, Inc. All rights reserved.
Casting This as a Data Science Problem
• Can we quantify: Which lost users were the most influential?
• Can we predict: Which current subscribers have as much influence?
36. 36© Cloudera, Inc. All rights reserved.
The Challenge: Lots Customers, Lots of Data
• Over 100 million customers
• Over 1 billion connections
37. 37© Cloudera, Inc. All rights reserved.
The Challenge: Lots Customers, Lots of Data
• Over 100 million customers
• Over 1 billion connections
38. 38© Cloudera, Inc. All rights reserved.
Calculating Influencer Scores
• Connection: pair of users with communication both ways
• Influencer score: number of connected users that churn after user X churns
39. 39© Cloudera, Inc. All rights reserved.
Predicting Influencer Scores
MLlib!
• Regression model
• Linear regression
• Random forests
• Features
• # of connections, # calls to connections
• Internal vs. External
40. 40© Cloudera, Inc. All rights reserved.
Breaking Down the Work
Building User and Connection
Tables
Computing Historical
Influencer Scores
Feature Generation
Model Fitting
Model Evaluation
42. 42© Cloudera, Inc. All rights reserved.
Roadmap Update
MANAGEMENT
Initial Spark-on-YARN
integration for shared
resource management
SECURITY SCALE STREAMING
New metrics for easier
diagnosis
Improved Spark-on-YARN for
better multi-tenancy,
performance, ease of use
Automated configurations
to optimize over time
Visibility into resource
utilization
Improved PySpark
integration for Python access
Kerberos-based
authorization
Fine-grained
access control
Auditing and lineage
(Governance)
Integration with Intel’s
Advanced Encryption
libraries
Full PCI compliance
Improved integration with
HDFS to enable scheduling
Reduced memory pressure
on larger jobs
Dynamic resource utilization
and prioritization
Stress test at scale with
mixed multi-tenant
workloads
Spark Streaming resiliency
for zero data loss
Data ingest integration for
Kafka and Flume
Improved state management
for better performance
Higher-level language
extensions
✔
✔✔
✔
✔✔
✔
44. 44© Cloudera, Inc. All rights reserved.
Data Science & Spark Training Courses
university.cloudera.com
46. 46© Cloudera, Inc. All rights reserved.
Spark Resources
• Learn Spark
• O’Reilly Advanced Analytics with Spark eBook (written by Clouderans)
• Cloudera Developer Blog: blog.cloudera.com/spark
• Spark Page: cloudera.com/spark
• Get Trained
• Cloudera Spark Training: university.cloudera.com
• Try it Out
• Cloudera Live Spark Tutorial: cloudera.com/live
• Download Cloudera 5.5: cloudera.com/downloads