SlideShare ist ein Scribd-Unternehmen logo
1 von 46
Downloaden Sie, um offline zu lesen
Prerequisites:
• Spark RDD: Resilient Distributed Datasets
• Spark Streaming
1
The registration link is http://go2.valorem.com/s0TC0X00NAtL0V050h20c0Y. It’s best to
email Kay, so she can work to whitelist you.
Venture Café: http://www.vencafstl.org/event/the-venture-cafe-gathering-
4/?instance_id=17473. It’s a place for us to hang out!
3
Many coursera and edX courses, such as https://www.coursera.org/learn/big-data-
integration-processing/lecture/uW2js/spark-streaming, are good resources. I also used
Safari books to develop the contents.
HDInsight link: https://docs.microsoft.com/en-us/azure/data-lake-store/data-lake-store-
hdinsight-hadoop-use-portal
Spark feature engineering link: https://spark.apache.org/docs/2.1.0/ml-features.html
4
Microsoft Cloud Data Platform & some of the things I care about. We’ll talk about some of
these highlighted boxes. The R Services will be covered in the R UG meetup on 6/9.
5
Intelligent Cloud.
This is not only a pretty picture of the components of Cortana Intelligent Suite, but it serves
as an architecture as well.
6
https://docs.microsoft.com/en-us/azure/data-lake-store/data-lake-store-hdinsight-hadoop-
use-portal.
I’ll show you LIVE! how to provision an HDInsight Spark cluster using Azure Portal. The
demo will be what you need to do to provision a new cluster using an existing data lake
store. The key to this is to know how to hook up Data Lake Store correctly with appropriate
permissions assigned. I’ll share with you the other Hadoop clusters that HDInsight
supports, HDInsight dashboards, Jupyter Notebook, and ssh to the server.
The beauty of HDInsight is it’s a Platform as a Service (PaaS) Hadoop. It separates the data
store from the compute, where the data can be persistent independently even when the
compute is deleted. This separation helps to minimize the consumption.
Please sign up to use Azure for free, use it via MSDN, or your organization subscriptions.
7
Other interesting questions for you to find out.
• How do I work with data in Spark?
• How do I write Spark programs?
• What are Notebooks?
• How do I query data in Spark using SQL?
8
9
10
We demo this live.
11
12
13
Apache Spark will run all types of machines learning algorithm through mllib
The driver program will delegate the work to the executors and they will each process parts
of files that they read off of external storage and store the file data in cluster memory as
well as transformed data
The Machine Learning algorithm in this case is designed to run in a highly distributed, fault
tolerant environment where the data itself is stored across the cluster and the operations
using Machine Learning algorithms will be distributed across the Spark cluster too
14
These are the concepts we will cover mostly briefly today.
Spark MLLib can do just pretty much everything in the ML world when it comes to
the major groups of ML algorithms. Whether that’s be classification, regression,
recommender systems, and unsupervised learning and clustering.
<click>
15
(We’ll talk into detail about logistic regression for binary classification because it’s
our demo.)
We’ll explain the main components of an ML workflow and ML pipeline and the
motivations behind these.
We’ll talk about the code framework of how ML and streaming data come together
in Spark. And in fact, how convenient it is to use Spark in the streaming
environment. It makes ML that much more fun and that much more bleeding edge!
<click>
16
Binary Classification
17
18
https://www.quora.com/What-is-logistic-regression
Logit = Log (p/1-p) = log (probability of event happening/ probability of event not
happening) = log (Odds)
19
We’re not going to demo regression, but the slides are here for your reference.
Let’s talk about regression.
<click>
20
Understanding Regression
What are we trying to do with regression?
We are trying to predict the line of best fit between one or many variables from a scatter
plot of points
In order to find the line of best fit we need to calculate a couple of things about the line.
<CLICK>
We need to calculate the slope of the line m
<CLICK>
We also need to calculate the intercept with the y axis c
<CLICK>
In order to do this we need to measure the y distance of each of the points from a line of
best fit and then sum the error margin (i.e. the distance to the line)
<CLICK>
So we begin the equation of the line y = mx + c, remember from school?
We use the concept of Ordinary Least Squares by summing the square of the y-distances
between the points and the line.
A bit of maths – we can rearrange the formula and to give us beta (or m) in terms of the
number of points n, x and y – this will assume that we can minimise the mean error with
the line and the points and will be the best predictor for all of the points in the training set
21
and future feature vectors –
As such we will predict yn+1 from xn+1 where xn+1 is a feature vector
21
Understanding Linear Regression
We derive a “cost function” which is used in conjunction with the feature vector
We can apply Stochastic Gradient Descent to iteratively find the best fit for the line
A technique here is that we take several features which exist in n dimensions and we “fit”
this to our linear regression model which will enable us to use SGD.
SGD will allow us to “converge” on the “minimum” and this will lead us to determine the
multidimensional line of best of fit
22
Types of regression
Here is a few regression algorithms. They have been selected based on their popularity.
This is not a finite list. A larger list is below
Least square linear regression (LR)
Decision trees (TREE)
Bagging trees (BAGTREE)
Boosting trees (BOOST)
Neural networks (NN)
Extreme Learning Machines (ELM)
Support Vector Regression (SVR)
Kernel Ridge Regression (KRR), aka Least Squares SVM
Relevance Vector Machine (RVM)
Gaussian Process Regression (GPR)
Variational Heteroscedastic Gaussian Process Regression (VHGPR)
23
24
Looking at R code
<CLICK>
Read in a file – remember this is read into the memory of the current process
<CLICK>
Convert to a dataframe – need more memory – hope we don’t run it
<CLICK>
Run a linear regression – this may take 10-15 minutes depending on the weight data
<CLICK>
25
This code is Scala and you would find this executing in Apache Spark
<CLICK>
This reads a file into an RDD – if the file is large it will be read in across a number of worker
nodes
<CLICK>
We’ve missed a step here for brevity but we would look to detailing this – using a classifier
26
27
Vectors can be dense or sparse.
RDD: Resilient Distributed Datasets
28
We’ll show this live using Azure ML Notebook.
Approx 32,000 rows and 15 columns. We’ll use a subset of columns in our demo.
The problem is to learn from the training data to predict the income whether it is more
than $50 K a year or not. It’s a binary classification problem in ML.
The data is assumed to be transformed into the format that Spark ML will understand that
is vectors. The categorical variables will be replaced by indexes.
29
30
https://spark.apache.org/docs/2.1.0/ml-features.html
This is the first few rows of the actual training data. The red box has the labels or truth. The
green box has the features or independent variables or predictors.
31
This is the actual PySpark code in the demo.
32
This is the definition needed to parse the input data into LabeledPoint.
33
34
Spark ML Pipelines
The 4 stage process can be broken down into the following components- this forms the
machine learning workflow and is used to determine the best fit for a model and work out
it’s efficacy
Step 1: Ingest data from a source – this is usually a file-based source
<CLICK>
Step 2: Extract features – this will involve preprocessing of the file and determination of
which features in what form are necessary for the machine learning
<CLICK>
Step 3: Train model where you take training data and build a model from this to enable
future predictions
<CLICK>
Step 4: Validate the model to determine whether you can predict using new data of some
sort
35
36
Spark ML Pipelines
A Transformer is an algorithm which can transform one DataFrame into
another DataFrame. – an ML dataframe is a dataframe which takes in a set of feature
vectors in a DF and outputs an ML dataframe as a set of predictions
<CLICK>
An estimator is a machine learning algorithm which can be applied to a transformer to
“learn” and formulate a model
<CLICK>
An evaluator is the use of metrics to extract and test a model to see whether it is good or
bad
37
This is where we enjoy the convenience of the streaming environment that Spark provides.
38
39
Place the training and testing folders (and their sub-folders) under the root folder
depending on WASB or ADL.
Root folder: https://docs.microsoft.com/en-us/azure/data-lake-store/data-lake-store-
hdinsight-hadoop-use-portal
SGD: Stochastic gradient descent.
StreamingLogisticRegressionWithSGD inherit methods from
org.apache.spark.mllib.regression.StreamingLinearAlgorithm
val model = new StreamingLogisticRegressionWithSGD()
.setStepSize(0.5)
.setNumIterations(10)
.setInitialWeights(Vectors.dense(...))
.trainOn(DStream)
You can load the latest model to start or you can set the initial weights.
To load the latest mode, use .latestModel()
40
ssc.start()
ssc.awaitTerminationOrTimeout(10) is an alternative.
ssc.stop()
Instead of using ssc.stop() at the console, you can also use ctrl-c to interrupt the process
and exit() to quit PySpark. Type exit() one more time (at the Linux prompt) the exit the ssh
console properly.
41
This is scalar code for the income prediction demo.
42
43
An example of Kmeans & streaming.
<CLICK>
We make an input stream of vectors for training, as well as a stream of labelled data points
for testing - this isn't shown in the code segment below. We assume a StreamingContext
ssc has been created already.
<CLICK>
We create a model with random clusters and specify the number of clusters to find where
<CLICK>
Now register the streams for training and testing and start the job, printing the predicted
cluster assignments on new data points as they arrive.
<CLICK>
As you add new text files with data the cluster centers will update. Each training point
should be formatted as [x1, x2, x3], and each test data point should be formatted as (y, [x1,
x2, x3]), where y is some useful label or identifier (e.g. a cluster assignment). Anytime a
text file is placed in training dir the model will update. Anytime a text file is placed in test
dir you will see predictions. With new data, the cluster centers will change.
44
We’ll run the entire ML system live. You’ll see how to process training and testing, and how
to understand the output from the console as well as the predictions (files).
45
46

Weitere ähnliche Inhalte

Was ist angesagt?

Visualizing the Model Selection Process
Visualizing the Model Selection ProcessVisualizing the Model Selection Process
Visualizing the Model Selection ProcessBenjamin Bengfort
 
Machine Learning with Mahout
Machine Learning with MahoutMachine Learning with Mahout
Machine Learning with Mahoutbigdatasyd
 
Streaming Python on Hadoop
Streaming Python on HadoopStreaming Python on Hadoop
Streaming Python on HadoopVivian S. Zhang
 
Anomaly Detection with Apache Spark
Anomaly Detection with Apache SparkAnomaly Detection with Apache Spark
Anomaly Detection with Apache SparkCloudera, Inc.
 
Joey gonzalez, graph lab, m lconf 2013
Joey gonzalez, graph lab, m lconf 2013Joey gonzalez, graph lab, m lconf 2013
Joey gonzalez, graph lab, m lconf 2013MLconf
 
dmapply: A functional primitive to express distributed machine learning algor...
dmapply: A functional primitive to express distributed machine learning algor...dmapply: A functional primitive to express distributed machine learning algor...
dmapply: A functional primitive to express distributed machine learning algor...Bikash Chandra Karmokar
 
A Primer on Entity Resolution
A Primer on Entity ResolutionA Primer on Entity Resolution
A Primer on Entity ResolutionBenjamin Bengfort
 
Machine Learning With R
Machine Learning With RMachine Learning With R
Machine Learning With RDavid Chiu
 
Visual diagnostics for more effective machine learning
Visual diagnostics for more effective machine learningVisual diagnostics for more effective machine learning
Visual diagnostics for more effective machine learningBenjamin Bengfort
 
PyData 2015 Keynote: "A Systems View of Machine Learning"
PyData 2015 Keynote: "A Systems View of Machine Learning" PyData 2015 Keynote: "A Systems View of Machine Learning"
PyData 2015 Keynote: "A Systems View of Machine Learning" Joshua Bloom
 
Generalized Linear Models in Spark MLlib and SparkR
Generalized Linear Models in Spark MLlib and SparkRGeneralized Linear Models in Spark MLlib and SparkR
Generalized Linear Models in Spark MLlib and SparkRDatabricks
 
Application of MapReduce in Cloud Computing
Application of MapReduce in Cloud ComputingApplication of MapReduce in Cloud Computing
Application of MapReduce in Cloud ComputingMohammad Mustaqeem
 
The Pregel Programming Model with Spark GraphX
The Pregel Programming Model with Spark GraphXThe Pregel Programming Model with Spark GraphX
The Pregel Programming Model with Spark GraphXAndrea Iacono
 
Unit 3 writable collections
Unit 3 writable collectionsUnit 3 writable collections
Unit 3 writable collectionsvishal choudhary
 
Introduction of Feature Hashing
Introduction of Feature HashingIntroduction of Feature Hashing
Introduction of Feature HashingWush Wu
 

Was ist angesagt? (20)

Visualizing the Model Selection Process
Visualizing the Model Selection ProcessVisualizing the Model Selection Process
Visualizing the Model Selection Process
 
Machine Learning with Mahout
Machine Learning with MahoutMachine Learning with Mahout
Machine Learning with Mahout
 
Streaming Python on Hadoop
Streaming Python on HadoopStreaming Python on Hadoop
Streaming Python on Hadoop
 
Anomaly Detection with Apache Spark
Anomaly Detection with Apache SparkAnomaly Detection with Apache Spark
Anomaly Detection with Apache Spark
 
Joey gonzalez, graph lab, m lconf 2013
Joey gonzalez, graph lab, m lconf 2013Joey gonzalez, graph lab, m lconf 2013
Joey gonzalez, graph lab, m lconf 2013
 
dmapply: A functional primitive to express distributed machine learning algor...
dmapply: A functional primitive to express distributed machine learning algor...dmapply: A functional primitive to express distributed machine learning algor...
dmapply: A functional primitive to express distributed machine learning algor...
 
A Primer on Entity Resolution
A Primer on Entity ResolutionA Primer on Entity Resolution
A Primer on Entity Resolution
 
Machine Learning With R
Machine Learning With RMachine Learning With R
Machine Learning With R
 
Visual diagnostics for more effective machine learning
Visual diagnostics for more effective machine learningVisual diagnostics for more effective machine learning
Visual diagnostics for more effective machine learning
 
R- Introduction
R- IntroductionR- Introduction
R- Introduction
 
PyData 2015 Keynote: "A Systems View of Machine Learning"
PyData 2015 Keynote: "A Systems View of Machine Learning" PyData 2015 Keynote: "A Systems View of Machine Learning"
PyData 2015 Keynote: "A Systems View of Machine Learning"
 
Generalized Linear Models in Spark MLlib and SparkR
Generalized Linear Models in Spark MLlib and SparkRGeneralized Linear Models in Spark MLlib and SparkR
Generalized Linear Models in Spark MLlib and SparkR
 
Application of MapReduce in Cloud Computing
Application of MapReduce in Cloud ComputingApplication of MapReduce in Cloud Computing
Application of MapReduce in Cloud Computing
 
Unit 2 part-2
Unit 2 part-2Unit 2 part-2
Unit 2 part-2
 
The Pregel Programming Model with Spark GraphX
The Pregel Programming Model with Spark GraphXThe Pregel Programming Model with Spark GraphX
The Pregel Programming Model with Spark GraphX
 
Data Product Architectures
Data Product ArchitecturesData Product Architectures
Data Product Architectures
 
Map/Reduce intro
Map/Reduce introMap/Reduce intro
Map/Reduce intro
 
Unit 3 writable collections
Unit 3 writable collectionsUnit 3 writable collections
Unit 3 writable collections
 
Introduction of Feature Hashing
Introduction of Feature HashingIntroduction of Feature Hashing
Introduction of Feature Hashing
 
Networkx tutorial
Networkx tutorialNetworkx tutorial
Networkx tutorial
 

Ähnlich wie Spark ml streaming

Recent Developments In SparkR For Advanced Analytics
Recent Developments In SparkR For Advanced AnalyticsRecent Developments In SparkR For Advanced Analytics
Recent Developments In SparkR For Advanced AnalyticsDatabricks
 
Technical_Report_on_ML_Library
Technical_Report_on_ML_LibraryTechnical_Report_on_ML_Library
Technical_Report_on_ML_LibrarySaurabh Chauhan
 
Understanding Parallelization of Machine Learning Algorithms in Apache Spark™
Understanding Parallelization of Machine Learning Algorithms in Apache Spark™Understanding Parallelization of Machine Learning Algorithms in Apache Spark™
Understanding Parallelization of Machine Learning Algorithms in Apache Spark™Databricks
 
Spark streaming , Spark SQL
Spark streaming , Spark SQLSpark streaming , Spark SQL
Spark streaming , Spark SQLYousun Jeong
 
Best Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache SparkBest Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache SparkDatabricks
 
Scaling Analytics with Apache Spark
Scaling Analytics with Apache SparkScaling Analytics with Apache Spark
Scaling Analytics with Apache SparkQuantUniversity
 
Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Advanced Data Science on Spark-(Reza Zadeh, Stanford)Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Advanced Data Science on Spark-(Reza Zadeh, Stanford)Spark Summit
 
Nose Dive into Apache Spark ML
Nose Dive into Apache Spark MLNose Dive into Apache Spark ML
Nose Dive into Apache Spark MLAhmet Bulut
 
Proposal for google summe of code 2016
Proposal for google summe of code 2016 Proposal for google summe of code 2016
Proposal for google summe of code 2016 Mahesh Dananjaya
 
Microservices, containers, and machine learning
Microservices, containers, and machine learningMicroservices, containers, and machine learning
Microservices, containers, and machine learningPaco Nathan
 
Multiplatform Spark solution for Graph datasources by Javier Dominguez
Multiplatform Spark solution for Graph datasources by Javier DominguezMultiplatform Spark solution for Graph datasources by Javier Dominguez
Multiplatform Spark solution for Graph datasources by Javier DominguezBig Data Spain
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkC4Media
 
Large Scale Machine learning with Spark
Large Scale Machine learning with SparkLarge Scale Machine learning with Spark
Large Scale Machine learning with SparkMd. Mahedi Kaysar
 
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scalaAutomate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scalaChetan Khatri
 
MACHINE LEARNING FOR OPTIMIZING SEARCH RESULTS WITH DRUPAL & APACHE SOLR
MACHINE LEARNING FOR OPTIMIZING SEARCH RESULTS WITH DRUPAL & APACHE SOLRMACHINE LEARNING FOR OPTIMIZING SEARCH RESULTS WITH DRUPAL & APACHE SOLR
MACHINE LEARNING FOR OPTIMIZING SEARCH RESULTS WITH DRUPAL & APACHE SOLRDrupalCamp Kyiv
 
Machine Learning - Simple Linear Regression
Machine Learning - Simple Linear RegressionMachine Learning - Simple Linear Regression
Machine Learning - Simple Linear RegressionSiddharth Shrivastava
 
Big Data, Bigger Analytics
Big Data, Bigger AnalyticsBig Data, Bigger Analytics
Big Data, Bigger AnalyticsItzhak Kameli
 
Intro to Spark development
 Intro to Spark development  Intro to Spark development
Intro to Spark development Spark Summit
 

Ähnlich wie Spark ml streaming (20)

Recent Developments In SparkR For Advanced Analytics
Recent Developments In SparkR For Advanced AnalyticsRecent Developments In SparkR For Advanced Analytics
Recent Developments In SparkR For Advanced Analytics
 
Technical_Report_on_ML_Library
Technical_Report_on_ML_LibraryTechnical_Report_on_ML_Library
Technical_Report_on_ML_Library
 
Understanding Parallelization of Machine Learning Algorithms in Apache Spark™
Understanding Parallelization of Machine Learning Algorithms in Apache Spark™Understanding Parallelization of Machine Learning Algorithms in Apache Spark™
Understanding Parallelization of Machine Learning Algorithms in Apache Spark™
 
Spark streaming , Spark SQL
Spark streaming , Spark SQLSpark streaming , Spark SQL
Spark streaming , Spark SQL
 
Best Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache SparkBest Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache Spark
 
Scaling Analytics with Apache Spark
Scaling Analytics with Apache SparkScaling Analytics with Apache Spark
Scaling Analytics with Apache Spark
 
Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Advanced Data Science on Spark-(Reza Zadeh, Stanford)Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Advanced Data Science on Spark-(Reza Zadeh, Stanford)
 
Nose Dive into Apache Spark ML
Nose Dive into Apache Spark MLNose Dive into Apache Spark ML
Nose Dive into Apache Spark ML
 
Proposal for google summe of code 2016
Proposal for google summe of code 2016 Proposal for google summe of code 2016
Proposal for google summe of code 2016
 
Microservices, containers, and machine learning
Microservices, containers, and machine learningMicroservices, containers, and machine learning
Microservices, containers, and machine learning
 
Multiplatform Spark solution for Graph datasources by Javier Dominguez
Multiplatform Spark solution for Graph datasources by Javier DominguezMultiplatform Spark solution for Graph datasources by Javier Dominguez
Multiplatform Spark solution for Graph datasources by Javier Dominguez
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache Spark
 
Large Scale Machine learning with Spark
Large Scale Machine learning with SparkLarge Scale Machine learning with Spark
Large Scale Machine learning with Spark
 
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scalaAutomate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
 
MACHINE LEARNING FOR OPTIMIZING SEARCH RESULTS WITH DRUPAL & APACHE SOLR
MACHINE LEARNING FOR OPTIMIZING SEARCH RESULTS WITH DRUPAL & APACHE SOLRMACHINE LEARNING FOR OPTIMIZING SEARCH RESULTS WITH DRUPAL & APACHE SOLR
MACHINE LEARNING FOR OPTIMIZING SEARCH RESULTS WITH DRUPAL & APACHE SOLR
 
Oct.22nd.Presentation.Final
Oct.22nd.Presentation.FinalOct.22nd.Presentation.Final
Oct.22nd.Presentation.Final
 
Spark1
Spark1Spark1
Spark1
 
Machine Learning - Simple Linear Regression
Machine Learning - Simple Linear RegressionMachine Learning - Simple Linear Regression
Machine Learning - Simple Linear Regression
 
Big Data, Bigger Analytics
Big Data, Bigger AnalyticsBig Data, Bigger Analytics
Big Data, Bigger Analytics
 
Intro to Spark development
 Intro to Spark development  Intro to Spark development
Intro to Spark development
 

Mehr von Adam Doyle

Data Engineering Roles
Data Engineering RolesData Engineering Roles
Data Engineering RolesAdam Doyle
 
Managed Cluster Services
Managed Cluster ServicesManaged Cluster Services
Managed Cluster ServicesAdam Doyle
 
Delta lake and the delta architecture
Delta lake and the delta architectureDelta lake and the delta architecture
Delta lake and the delta architectureAdam Doyle
 
Great Expectations Presentation
Great Expectations PresentationGreat Expectations Presentation
Great Expectations PresentationAdam Doyle
 
May 2021 Spark Testing ... or how to farm reputation on StackOverflow
May 2021 Spark Testing ... or how to farm reputation on StackOverflowMay 2021 Spark Testing ... or how to farm reputation on StackOverflow
May 2021 Spark Testing ... or how to farm reputation on StackOverflowAdam Doyle
 
Automate your data flows with Apache NIFI
Automate your data flows with Apache NIFIAutomate your data flows with Apache NIFI
Automate your data flows with Apache NIFIAdam Doyle
 
Apache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEAApache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEAAdam Doyle
 
Localized Hadoop Development
Localized Hadoop DevelopmentLocalized Hadoop Development
Localized Hadoop DevelopmentAdam Doyle
 
The new big data
The new big dataThe new big data
The new big dataAdam Doyle
 
Feature store Overview St. Louis Big Data IDEA Meetup aug 2020
Feature store Overview   St. Louis Big Data IDEA Meetup aug 2020Feature store Overview   St. Louis Big Data IDEA Meetup aug 2020
Feature store Overview St. Louis Big Data IDEA Meetup aug 2020Adam Doyle
 
Snowflake Data Science and AI/ML at Scale
Snowflake Data Science and AI/ML at ScaleSnowflake Data Science and AI/ML at Scale
Snowflake Data Science and AI/ML at ScaleAdam Doyle
 
Operationalizing Data Science St. Louis Big Data IDEA
Operationalizing Data Science St. Louis Big Data IDEAOperationalizing Data Science St. Louis Big Data IDEA
Operationalizing Data Science St. Louis Big Data IDEAAdam Doyle
 
Retooling on the Modern Data and Analytics Tech Stack
Retooling on the Modern Data and Analytics Tech StackRetooling on the Modern Data and Analytics Tech Stack
Retooling on the Modern Data and Analytics Tech StackAdam Doyle
 
Stl meetup cloudera platform - january 2020
Stl meetup   cloudera platform  - january 2020Stl meetup   cloudera platform  - january 2020
Stl meetup cloudera platform - january 2020Adam Doyle
 
How stlrda does data
How stlrda does dataHow stlrda does data
How stlrda does dataAdam Doyle
 
Tailoring machine learning practices to support prescriptive analytics
Tailoring machine learning practices to support prescriptive analyticsTailoring machine learning practices to support prescriptive analytics
Tailoring machine learning practices to support prescriptive analyticsAdam Doyle
 
Synthesis of analytical methods data driven decision-making
Synthesis of analytical methods data driven decision-makingSynthesis of analytical methods data driven decision-making
Synthesis of analytical methods data driven decision-makingAdam Doyle
 
Big Data IDEA 101 2019
Big Data IDEA 101 2019Big Data IDEA 101 2019
Big Data IDEA 101 2019Adam Doyle
 
Data Engineering and the Data Science Lifecycle
Data Engineering and the Data Science LifecycleData Engineering and the Data Science Lifecycle
Data Engineering and the Data Science LifecycleAdam Doyle
 

Mehr von Adam Doyle (20)

ML Ops.pptx
ML Ops.pptxML Ops.pptx
ML Ops.pptx
 
Data Engineering Roles
Data Engineering RolesData Engineering Roles
Data Engineering Roles
 
Managed Cluster Services
Managed Cluster ServicesManaged Cluster Services
Managed Cluster Services
 
Delta lake and the delta architecture
Delta lake and the delta architectureDelta lake and the delta architecture
Delta lake and the delta architecture
 
Great Expectations Presentation
Great Expectations PresentationGreat Expectations Presentation
Great Expectations Presentation
 
May 2021 Spark Testing ... or how to farm reputation on StackOverflow
May 2021 Spark Testing ... or how to farm reputation on StackOverflowMay 2021 Spark Testing ... or how to farm reputation on StackOverflow
May 2021 Spark Testing ... or how to farm reputation on StackOverflow
 
Automate your data flows with Apache NIFI
Automate your data flows with Apache NIFIAutomate your data flows with Apache NIFI
Automate your data flows with Apache NIFI
 
Apache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEAApache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEA
 
Localized Hadoop Development
Localized Hadoop DevelopmentLocalized Hadoop Development
Localized Hadoop Development
 
The new big data
The new big dataThe new big data
The new big data
 
Feature store Overview St. Louis Big Data IDEA Meetup aug 2020
Feature store Overview   St. Louis Big Data IDEA Meetup aug 2020Feature store Overview   St. Louis Big Data IDEA Meetup aug 2020
Feature store Overview St. Louis Big Data IDEA Meetup aug 2020
 
Snowflake Data Science and AI/ML at Scale
Snowflake Data Science and AI/ML at ScaleSnowflake Data Science and AI/ML at Scale
Snowflake Data Science and AI/ML at Scale
 
Operationalizing Data Science St. Louis Big Data IDEA
Operationalizing Data Science St. Louis Big Data IDEAOperationalizing Data Science St. Louis Big Data IDEA
Operationalizing Data Science St. Louis Big Data IDEA
 
Retooling on the Modern Data and Analytics Tech Stack
Retooling on the Modern Data and Analytics Tech StackRetooling on the Modern Data and Analytics Tech Stack
Retooling on the Modern Data and Analytics Tech Stack
 
Stl meetup cloudera platform - january 2020
Stl meetup   cloudera platform  - january 2020Stl meetup   cloudera platform  - january 2020
Stl meetup cloudera platform - january 2020
 
How stlrda does data
How stlrda does dataHow stlrda does data
How stlrda does data
 
Tailoring machine learning practices to support prescriptive analytics
Tailoring machine learning practices to support prescriptive analyticsTailoring machine learning practices to support prescriptive analytics
Tailoring machine learning practices to support prescriptive analytics
 
Synthesis of analytical methods data driven decision-making
Synthesis of analytical methods data driven decision-makingSynthesis of analytical methods data driven decision-making
Synthesis of analytical methods data driven decision-making
 
Big Data IDEA 101 2019
Big Data IDEA 101 2019Big Data IDEA 101 2019
Big Data IDEA 101 2019
 
Data Engineering and the Data Science Lifecycle
Data Engineering and the Data Science LifecycleData Engineering and the Data Science Lifecycle
Data Engineering and the Data Science Lifecycle
 

Kürzlich hochgeladen

Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...amitlee9823
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...amitlee9823
 
Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...gajnagarg
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteedamy56318795
 
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...amitlee9823
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNKTimothy Spann
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...amitlee9823
 
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...gajnagarg
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...Elaine Werffeli
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 
➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men 🔝Ongole🔝 Escorts S...
➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men  🔝Ongole🔝   Escorts S...➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men  🔝Ongole🔝   Escorts S...
➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men 🔝Ongole🔝 Escorts S...amitlee9823
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...only4webmaster01
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Valters Lauzums
 
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...gajnagarg
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...ZurliaSoop
 

Kürzlich hochgeladen (20)

Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
 
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men 🔝Ongole🔝 Escorts S...
➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men  🔝Ongole🔝   Escorts S...➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men  🔝Ongole🔝   Escorts S...
➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men 🔝Ongole🔝 Escorts S...
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 

Spark ml streaming

  • 1. Prerequisites: • Spark RDD: Resilient Distributed Datasets • Spark Streaming 1
  • 2. The registration link is http://go2.valorem.com/s0TC0X00NAtL0V050h20c0Y. It’s best to email Kay, so she can work to whitelist you. Venture Café: http://www.vencafstl.org/event/the-venture-cafe-gathering- 4/?instance_id=17473. It’s a place for us to hang out! 3
  • 3. Many coursera and edX courses, such as https://www.coursera.org/learn/big-data- integration-processing/lecture/uW2js/spark-streaming, are good resources. I also used Safari books to develop the contents. HDInsight link: https://docs.microsoft.com/en-us/azure/data-lake-store/data-lake-store- hdinsight-hadoop-use-portal Spark feature engineering link: https://spark.apache.org/docs/2.1.0/ml-features.html 4
  • 4. Microsoft Cloud Data Platform & some of the things I care about. We’ll talk about some of these highlighted boxes. The R Services will be covered in the R UG meetup on 6/9. 5
  • 5. Intelligent Cloud. This is not only a pretty picture of the components of Cortana Intelligent Suite, but it serves as an architecture as well. 6
  • 6. https://docs.microsoft.com/en-us/azure/data-lake-store/data-lake-store-hdinsight-hadoop- use-portal. I’ll show you LIVE! how to provision an HDInsight Spark cluster using Azure Portal. The demo will be what you need to do to provision a new cluster using an existing data lake store. The key to this is to know how to hook up Data Lake Store correctly with appropriate permissions assigned. I’ll share with you the other Hadoop clusters that HDInsight supports, HDInsight dashboards, Jupyter Notebook, and ssh to the server. The beauty of HDInsight is it’s a Platform as a Service (PaaS) Hadoop. It separates the data store from the compute, where the data can be persistent independently even when the compute is deleted. This separation helps to minimize the consumption. Please sign up to use Azure for free, use it via MSDN, or your organization subscriptions. 7
  • 7. Other interesting questions for you to find out. • How do I work with data in Spark? • How do I write Spark programs? • What are Notebooks? • How do I query data in Spark using SQL? 8
  • 8. 9
  • 9. 10
  • 10. We demo this live. 11
  • 11. 12
  • 12. 13
  • 13. Apache Spark will run all types of machines learning algorithm through mllib The driver program will delegate the work to the executors and they will each process parts of files that they read off of external storage and store the file data in cluster memory as well as transformed data The Machine Learning algorithm in this case is designed to run in a highly distributed, fault tolerant environment where the data itself is stored across the cluster and the operations using Machine Learning algorithms will be distributed across the Spark cluster too 14
  • 14. These are the concepts we will cover mostly briefly today. Spark MLLib can do just pretty much everything in the ML world when it comes to the major groups of ML algorithms. Whether that’s be classification, regression, recommender systems, and unsupervised learning and clustering. <click> 15
  • 15. (We’ll talk into detail about logistic regression for binary classification because it’s our demo.) We’ll explain the main components of an ML workflow and ML pipeline and the motivations behind these. We’ll talk about the code framework of how ML and streaming data come together in Spark. And in fact, how convenient it is to use Spark in the streaming environment. It makes ML that much more fun and that much more bleeding edge! <click> 16
  • 17. 18
  • 18. https://www.quora.com/What-is-logistic-regression Logit = Log (p/1-p) = log (probability of event happening/ probability of event not happening) = log (Odds) 19
  • 19. We’re not going to demo regression, but the slides are here for your reference. Let’s talk about regression. <click> 20
  • 20. Understanding Regression What are we trying to do with regression? We are trying to predict the line of best fit between one or many variables from a scatter plot of points In order to find the line of best fit we need to calculate a couple of things about the line. <CLICK> We need to calculate the slope of the line m <CLICK> We also need to calculate the intercept with the y axis c <CLICK> In order to do this we need to measure the y distance of each of the points from a line of best fit and then sum the error margin (i.e. the distance to the line) <CLICK> So we begin the equation of the line y = mx + c, remember from school? We use the concept of Ordinary Least Squares by summing the square of the y-distances between the points and the line. A bit of maths – we can rearrange the formula and to give us beta (or m) in terms of the number of points n, x and y – this will assume that we can minimise the mean error with the line and the points and will be the best predictor for all of the points in the training set 21
  • 21. and future feature vectors – As such we will predict yn+1 from xn+1 where xn+1 is a feature vector 21
  • 22. Understanding Linear Regression We derive a “cost function” which is used in conjunction with the feature vector We can apply Stochastic Gradient Descent to iteratively find the best fit for the line A technique here is that we take several features which exist in n dimensions and we “fit” this to our linear regression model which will enable us to use SGD. SGD will allow us to “converge” on the “minimum” and this will lead us to determine the multidimensional line of best of fit 22
  • 23. Types of regression Here is a few regression algorithms. They have been selected based on their popularity. This is not a finite list. A larger list is below Least square linear regression (LR) Decision trees (TREE) Bagging trees (BAGTREE) Boosting trees (BOOST) Neural networks (NN) Extreme Learning Machines (ELM) Support Vector Regression (SVR) Kernel Ridge Regression (KRR), aka Least Squares SVM Relevance Vector Machine (RVM) Gaussian Process Regression (GPR) Variational Heteroscedastic Gaussian Process Regression (VHGPR) 23
  • 24. 24
  • 25. Looking at R code <CLICK> Read in a file – remember this is read into the memory of the current process <CLICK> Convert to a dataframe – need more memory – hope we don’t run it <CLICK> Run a linear regression – this may take 10-15 minutes depending on the weight data <CLICK> 25
  • 26. This code is Scala and you would find this executing in Apache Spark <CLICK> This reads a file into an RDD – if the file is large it will be read in across a number of worker nodes <CLICK> We’ve missed a step here for brevity but we would look to detailing this – using a classifier 26
  • 27. 27
  • 28. Vectors can be dense or sparse. RDD: Resilient Distributed Datasets 28
  • 29. We’ll show this live using Azure ML Notebook. Approx 32,000 rows and 15 columns. We’ll use a subset of columns in our demo. The problem is to learn from the training data to predict the income whether it is more than $50 K a year or not. It’s a binary classification problem in ML. The data is assumed to be transformed into the format that Spark ML will understand that is vectors. The categorical variables will be replaced by indexes. 29
  • 30. 30
  • 31. https://spark.apache.org/docs/2.1.0/ml-features.html This is the first few rows of the actual training data. The red box has the labels or truth. The green box has the features or independent variables or predictors. 31
  • 32. This is the actual PySpark code in the demo. 32
  • 33. This is the definition needed to parse the input data into LabeledPoint. 33
  • 34. 34
  • 35. Spark ML Pipelines The 4 stage process can be broken down into the following components- this forms the machine learning workflow and is used to determine the best fit for a model and work out it’s efficacy Step 1: Ingest data from a source – this is usually a file-based source <CLICK> Step 2: Extract features – this will involve preprocessing of the file and determination of which features in what form are necessary for the machine learning <CLICK> Step 3: Train model where you take training data and build a model from this to enable future predictions <CLICK> Step 4: Validate the model to determine whether you can predict using new data of some sort 35
  • 36. 36
  • 37. Spark ML Pipelines A Transformer is an algorithm which can transform one DataFrame into another DataFrame. – an ML dataframe is a dataframe which takes in a set of feature vectors in a DF and outputs an ML dataframe as a set of predictions <CLICK> An estimator is a machine learning algorithm which can be applied to a transformer to “learn” and formulate a model <CLICK> An evaluator is the use of metrics to extract and test a model to see whether it is good or bad 37
  • 38. This is where we enjoy the convenience of the streaming environment that Spark provides. 38
  • 39. 39
  • 40. Place the training and testing folders (and their sub-folders) under the root folder depending on WASB or ADL. Root folder: https://docs.microsoft.com/en-us/azure/data-lake-store/data-lake-store- hdinsight-hadoop-use-portal SGD: Stochastic gradient descent. StreamingLogisticRegressionWithSGD inherit methods from org.apache.spark.mllib.regression.StreamingLinearAlgorithm val model = new StreamingLogisticRegressionWithSGD() .setStepSize(0.5) .setNumIterations(10) .setInitialWeights(Vectors.dense(...)) .trainOn(DStream) You can load the latest model to start or you can set the initial weights. To load the latest mode, use .latestModel() 40
  • 41. ssc.start() ssc.awaitTerminationOrTimeout(10) is an alternative. ssc.stop() Instead of using ssc.stop() at the console, you can also use ctrl-c to interrupt the process and exit() to quit PySpark. Type exit() one more time (at the Linux prompt) the exit the ssh console properly. 41
  • 42. This is scalar code for the income prediction demo. 42
  • 43. 43
  • 44. An example of Kmeans & streaming. <CLICK> We make an input stream of vectors for training, as well as a stream of labelled data points for testing - this isn't shown in the code segment below. We assume a StreamingContext ssc has been created already. <CLICK> We create a model with random clusters and specify the number of clusters to find where <CLICK> Now register the streams for training and testing and start the job, printing the predicted cluster assignments on new data points as they arrive. <CLICK> As you add new text files with data the cluster centers will update. Each training point should be formatted as [x1, x2, x3], and each test data point should be formatted as (y, [x1, x2, x3]), where y is some useful label or identifier (e.g. a cluster assignment). Anytime a text file is placed in training dir the model will update. Anytime a text file is placed in test dir you will see predictions. With new data, the cluster centers will change. 44
  • 45. We’ll run the entire ML system live. You’ll see how to process training and testing, and how to understand the output from the console as well as the predictions (files). 45
  • 46. 46