2. The registration link is http://go2.valorem.com/s0TC0X00NAtL0V050h20c0Y. It’s best to
email Kay, so she can work to whitelist you.
Venture Café: http://www.vencafstl.org/event/the-venture-cafe-gathering-
4/?instance_id=17473. It’s a place for us to hang out!
3
3. Many coursera and edX courses, such as https://www.coursera.org/learn/big-data-
integration-processing/lecture/uW2js/spark-streaming, are good resources. I also used
Safari books to develop the contents.
HDInsight link: https://docs.microsoft.com/en-us/azure/data-lake-store/data-lake-store-
hdinsight-hadoop-use-portal
Spark feature engineering link: https://spark.apache.org/docs/2.1.0/ml-features.html
4
4. Microsoft Cloud Data Platform & some of the things I care about. We’ll talk about some of
these highlighted boxes. The R Services will be covered in the R UG meetup on 6/9.
5
5. Intelligent Cloud.
This is not only a pretty picture of the components of Cortana Intelligent Suite, but it serves
as an architecture as well.
6
6. https://docs.microsoft.com/en-us/azure/data-lake-store/data-lake-store-hdinsight-hadoop-
use-portal.
I’ll show you LIVE! how to provision an HDInsight Spark cluster using Azure Portal. The
demo will be what you need to do to provision a new cluster using an existing data lake
store. The key to this is to know how to hook up Data Lake Store correctly with appropriate
permissions assigned. I’ll share with you the other Hadoop clusters that HDInsight
supports, HDInsight dashboards, Jupyter Notebook, and ssh to the server.
The beauty of HDInsight is it’s a Platform as a Service (PaaS) Hadoop. It separates the data
store from the compute, where the data can be persistent independently even when the
compute is deleted. This separation helps to minimize the consumption.
Please sign up to use Azure for free, use it via MSDN, or your organization subscriptions.
7
7. Other interesting questions for you to find out.
• How do I work with data in Spark?
• How do I write Spark programs?
• What are Notebooks?
• How do I query data in Spark using SQL?
8
13. Apache Spark will run all types of machines learning algorithm through mllib
The driver program will delegate the work to the executors and they will each process parts
of files that they read off of external storage and store the file data in cluster memory as
well as transformed data
The Machine Learning algorithm in this case is designed to run in a highly distributed, fault
tolerant environment where the data itself is stored across the cluster and the operations
using Machine Learning algorithms will be distributed across the Spark cluster too
14
14. These are the concepts we will cover mostly briefly today.
Spark MLLib can do just pretty much everything in the ML world when it comes to
the major groups of ML algorithms. Whether that’s be classification, regression,
recommender systems, and unsupervised learning and clustering.
<click>
15
15. (We’ll talk into detail about logistic regression for binary classification because it’s
our demo.)
We’ll explain the main components of an ML workflow and ML pipeline and the
motivations behind these.
We’ll talk about the code framework of how ML and streaming data come together
in Spark. And in fact, how convenient it is to use Spark in the streaming
environment. It makes ML that much more fun and that much more bleeding edge!
<click>
16
19. We’re not going to demo regression, but the slides are here for your reference.
Let’s talk about regression.
<click>
20
20. Understanding Regression
What are we trying to do with regression?
We are trying to predict the line of best fit between one or many variables from a scatter
plot of points
In order to find the line of best fit we need to calculate a couple of things about the line.
<CLICK>
We need to calculate the slope of the line m
<CLICK>
We also need to calculate the intercept with the y axis c
<CLICK>
In order to do this we need to measure the y distance of each of the points from a line of
best fit and then sum the error margin (i.e. the distance to the line)
<CLICK>
So we begin the equation of the line y = mx + c, remember from school?
We use the concept of Ordinary Least Squares by summing the square of the y-distances
between the points and the line.
A bit of maths – we can rearrange the formula and to give us beta (or m) in terms of the
number of points n, x and y – this will assume that we can minimise the mean error with
the line and the points and will be the best predictor for all of the points in the training set
21
21. and future feature vectors –
As such we will predict yn+1 from xn+1 where xn+1 is a feature vector
21
22. Understanding Linear Regression
We derive a “cost function” which is used in conjunction with the feature vector
We can apply Stochastic Gradient Descent to iteratively find the best fit for the line
A technique here is that we take several features which exist in n dimensions and we “fit”
this to our linear regression model which will enable us to use SGD.
SGD will allow us to “converge” on the “minimum” and this will lead us to determine the
multidimensional line of best of fit
22
23. Types of regression
Here is a few regression algorithms. They have been selected based on their popularity.
This is not a finite list. A larger list is below
Least square linear regression (LR)
Decision trees (TREE)
Bagging trees (BAGTREE)
Boosting trees (BOOST)
Neural networks (NN)
Extreme Learning Machines (ELM)
Support Vector Regression (SVR)
Kernel Ridge Regression (KRR), aka Least Squares SVM
Relevance Vector Machine (RVM)
Gaussian Process Regression (GPR)
Variational Heteroscedastic Gaussian Process Regression (VHGPR)
23
25. Looking at R code
<CLICK>
Read in a file – remember this is read into the memory of the current process
<CLICK>
Convert to a dataframe – need more memory – hope we don’t run it
<CLICK>
Run a linear regression – this may take 10-15 minutes depending on the weight data
<CLICK>
25
26. This code is Scala and you would find this executing in Apache Spark
<CLICK>
This reads a file into an RDD – if the file is large it will be read in across a number of worker
nodes
<CLICK>
We’ve missed a step here for brevity but we would look to detailing this – using a classifier
26
28. Vectors can be dense or sparse.
RDD: Resilient Distributed Datasets
28
29. We’ll show this live using Azure ML Notebook.
Approx 32,000 rows and 15 columns. We’ll use a subset of columns in our demo.
The problem is to learn from the training data to predict the income whether it is more
than $50 K a year or not. It’s a binary classification problem in ML.
The data is assumed to be transformed into the format that Spark ML will understand that
is vectors. The categorical variables will be replaced by indexes.
29
35. Spark ML Pipelines
The 4 stage process can be broken down into the following components- this forms the
machine learning workflow and is used to determine the best fit for a model and work out
it’s efficacy
Step 1: Ingest data from a source – this is usually a file-based source
<CLICK>
Step 2: Extract features – this will involve preprocessing of the file and determination of
which features in what form are necessary for the machine learning
<CLICK>
Step 3: Train model where you take training data and build a model from this to enable
future predictions
<CLICK>
Step 4: Validate the model to determine whether you can predict using new data of some
sort
35
37. Spark ML Pipelines
A Transformer is an algorithm which can transform one DataFrame into
another DataFrame. – an ML dataframe is a dataframe which takes in a set of feature
vectors in a DF and outputs an ML dataframe as a set of predictions
<CLICK>
An estimator is a machine learning algorithm which can be applied to a transformer to
“learn” and formulate a model
<CLICK>
An evaluator is the use of metrics to extract and test a model to see whether it is good or
bad
37
38. This is where we enjoy the convenience of the streaming environment that Spark provides.
38
40. Place the training and testing folders (and their sub-folders) under the root folder
depending on WASB or ADL.
Root folder: https://docs.microsoft.com/en-us/azure/data-lake-store/data-lake-store-
hdinsight-hadoop-use-portal
SGD: Stochastic gradient descent.
StreamingLogisticRegressionWithSGD inherit methods from
org.apache.spark.mllib.regression.StreamingLinearAlgorithm
val model = new StreamingLogisticRegressionWithSGD()
.setStepSize(0.5)
.setNumIterations(10)
.setInitialWeights(Vectors.dense(...))
.trainOn(DStream)
You can load the latest model to start or you can set the initial weights.
To load the latest mode, use .latestModel()
40
41. ssc.start()
ssc.awaitTerminationOrTimeout(10) is an alternative.
ssc.stop()
Instead of using ssc.stop() at the console, you can also use ctrl-c to interrupt the process
and exit() to quit PySpark. Type exit() one more time (at the Linux prompt) the exit the ssh
console properly.
41
44. An example of Kmeans & streaming.
<CLICK>
We make an input stream of vectors for training, as well as a stream of labelled data points
for testing - this isn't shown in the code segment below. We assume a StreamingContext
ssc has been created already.
<CLICK>
We create a model with random clusters and specify the number of clusters to find where
<CLICK>
Now register the streams for training and testing and start the job, printing the predicted
cluster assignments on new data points as they arrive.
<CLICK>
As you add new text files with data the cluster centers will update. Each training point
should be formatted as [x1, x2, x3], and each test data point should be formatted as (y, [x1,
x2, x3]), where y is some useful label or identifier (e.g. a cluster assignment). Anytime a
text file is placed in training dir the model will update. Anytime a text file is placed in test
dir you will see predictions. With new data, the cluster centers will change.
44
45. We’ll run the entire ML system live. You’ll see how to process training and testing, and how
to understand the output from the console as well as the predictions (files).
45