SlideShare ist ein Scribd-Unternehmen logo
1 von 45
Downloaden Sie, um offline zu lesen
© 2016 MapR Technologies 10-1
© 2016 MapR Technologies
Machine Learning with Apache Spark
© 2016 MapR Technologies 10-2
© 2016 MapR Technologies
© 2016 MapR Technologies 10-3
Agenda
• Brief overview of
• Classification
• Clustering
• Collaborative Filtering
• Predicting Flight Delays using a Decision Tree
© 2016 MapR Technologies 10-4
Spark SQL
• Structured Data
• Querying with
SQL/HQL
• DataFrames
Spark Streaming
• Processing of live
streams
• Micro-batching
MLlib
• Machine Learning
• Multiple types of
ML algorithms
GraphX
• Graph processing
• Graph parallel
computations
RDD Transformations and Actions
• Task scheduling
• Memory management
• Fault recovery
• Interacting with storage systems
Spark Core
What is MLlib?
© 2016 MapR Technologies 10-5
MLlib Algorithms and Utilities
Algorithms and Utilities Description
Basic statistics Includes summary statistics, correlations, hypothesis testing, random data
generation
Classification and
regression
Includes methods for linear models, decision trees and Naïve Bayes
Collaborative filtering Supports model-based collaborative filtering using alternating least
squares (ALS) algorithm
Clustering Supports K-means clustering
Dimensionality reduction Supports dimensionality reduction on the RowMatrix class; singular value
decomposition (SVD) and principal component analysis (PCA)
Feature extraction and
transformation
Contains several classes for common feature transformations
© 2016 MapR Technologies 10-6
Examples of ML Algorithms
Supervised
• Classification
– Naïve Bayes
– SVM
– Random Decision
Forests
• Regression
– Linear
– Logistic
Machine Learning
Unsupervised
• Clustering
– K-means
• Dimensionality reduction
– Principal Component
Analysis
– SVD
© 2016 MapR Technologies 10-7
Examples of ML Algorithms
Supervised
• Classification
– Naïve Bayes
– SVM
– Random Decision
Forests
• Regression
– Linear
– Logistic
Machine Learning
Unsupervised
• Clustering
– K-means
• Dimensionality reduction
– Principal Component
Analysis
– SVD
© 2016 MapR Technologies 10-8
Examples of ML Algorithms
Machine Learning
Unsupervised
• Clustering
– K-means
• Dimensionality reduction
– Principal Component
Analysis
– SVD
Supervised
• Classification
– Naïve Bayes
– SVM
– Random Decision
Forests
• Regression
– Linear
– Logistic
© 2016 MapR Technologies 10-9
Three Categories of Techniques for Machine Learning
Collaborative Filtering
(Recommendation)
Classification Clustering
© 2016 MapR Technologies 10-10
Machine Learning: Classification
Classification
Identifies
category for item
© 2016 MapR Technologies 10-11
Classification: Definition
Form of ML that:
• Identifies which category an item belongs to
• Uses supervised learning algorithms
– Data is labeled
Sentiment
© 2016 MapR Technologies 10-12
If it Walks/Swims/Quacks Like a Duck …… Then It Must Be a Duck
swims
walks
quacks
Features:
walks
quacks
swims
Features:
© 2016 MapR Technologies 10-13
Building and Deploying a Classifier Model
Image reference O’Reilly Learning Spark
Spam:
free money now!
get this money
free savings $$$
Training Data
Non-spam:
how are you?
that Spark job
lunch plans
© 2016 MapR Technologies 10-14
Building and Deploying a Classifier Model
Image reference O’Reilly Learning Spark
+
+
̶+
̶ ̶
Feature Vectors
Featurization
Spam:
free money now!
get this money
free savings $$$
Training Data
Non-spam:
how are you?
that Spark job
lunch plans
© 2016 MapR Technologies 10-15
Building and Deploying a Classifier Model
Image reference O’Reilly Learning Spark
+
+
̶+
̶ ̶
Feature Vectors Model
Featurization TrainingSpam:
free money now!
get this money
free savings $$$
Training Data
Non-spam:
how are you?
that Spark job
lunch plans
+
+
̶+
̶ ̶
© 2016 MapR Technologies 10-16
Building and Deploying a Classifier Model
Image reference O’Reilly Learning Spark
+
+
̶+
̶ ̶
Feature Vectors Model
Featurization Training
Model
Evaluation
Best Model
Spam:
free money now!
get this money
free savings $$$
Training Data
Non-spam:
how are you?
that Spark job
lunch plans
+
+
̶+
̶ ̶
+
+
̶+
̶ ̶
+
+
̶+
̶ ̶
+
+
̶+
̶ ̶
© 2016 MapR Technologies 10-17
Machine Learning: Clustering
Classification Clustering
© 2016 MapR Technologies 10-18
Clustering: Definition
• Unsupervised learning task
• Groups objects into clusters of high similarity
© 2016 MapR Technologies 10-19
Clustering: Definition
• Unsupervised learning task
• Groups objects into clusters of high similarity
– Search results grouping
– Grouping of customers
– Anomaly detection
– Text categorization
© 2016 MapR Technologies 10-20
Clustering: Example
• Group similar objects
• Use MLlib K-means algorithm
1. Initialize coordinates to center
of clusters (centroid)
2. Assign all points to nearest
centroid
3. Update centroids to center of
points
4. Repeat until conditions met
© 2016 MapR Technologies 10-21
Three Categories of Techniques for Machine Learning
Collaborative Filtering
(Recommendation)
Classification Clustering
© 2016 MapR Technologies 10-22
Collaborative Filtering with Spark
• Recommend items
– (Filtering)
• Based on user preferences data
– (Collaborative)
4 5 5
5 5
5 ?
Ted
Carol
Bob
A B C
User Item Rating Matrix
© 2016 MapR Technologies 10-23
Train a Model to Make Predictions
Ted and Carol like movies B and C
Bob likes movie B, what might he like?
Bob likes movie B, predict C
Training
Data
ModelAlgorithm
New Data PredictionsModel
4 5 5
5 5
5 ?
Ted
Carol
Bob
A B C
User Item Rating Matrix
© 2016 MapR Technologies 10-24
© 2016 MapR Technologies
Predict Flight Delays
© 2016 MapR Technologies 10-25
Use Case: Flight Data
• Predict if a flight is going to be delayed
• Use Decision Tree for prediction
• Used for Classification and Regression
• Represents tree with nodes, Binary decision at each node
© 2016 MapR Technologies 10-26
Flight Data
© 2016 MapR Technologies 10-27
// Define the schema
case class Flight(dofM: String, dofW: String, carrier: String, tailnum: String,
flnum: Int, org_id: String, origin: String, dest_id: String, dest: String,
crsdeptime: Double, deptime: Double, depdelaymins: Double, crsarrtime: Double,
arrtime: Double, arrdelay: Double, crselapsedtime: Double, dist: Int)
def parseFlight(str: String): Flight = {
val line = str.split(",")
Flight(line(0), line(1), line(2), line(3), line(4).toInt, line(5), line(6),
line(7), line(8), line(9).toDouble, line(10).toDouble, line(11).toDouble,
line(12).toDouble, line(13).toDouble, line(14).toDouble, line(15).toDouble,
line(16).toInt)
}
// load file into a RDD
val rdd = sc.textFile(”flights.csv”)
// create an RDD of Flight objects
val flightRDD = rdd.map(parseFlight).cache()
//Array(Flight(1,3,AA,N338AA,1,12478,JFK,12892,LAX 900.0,914.0,14.0,1225.0,1238.0, 13.0,385.0,2475)
Parse Input
© 2016 MapR Technologies 10-28
Building and Deploying a Classifier Model
+
+
̶+
̶ ̶
Feature Vectors
Featurization
Delayed:
Friday
LAX
AA
Training Data
Not Delayed:
Wednesday
BNA
Delta
© 2016 MapR Technologies 10-29
Classification Learning Problem - Features
Label  delayed and not delayed - delayed if delay > 40 minutes
Features  {day_of_month, weekday, crsdeptime, crsarrtime, carrier,
crselapsedtime, origin, dest}
© 2016 MapR Technologies 10-30
// create map of airline -> number
var carrierMap: Map[String, Int] = Map()
var index: Int = 0
flightsRDD.map(flight => flight.carrier).distinct.collect.foreach(
x => { carrierMap += (x -> index); index += 1 }
)
carrierMap.toString
// String = Map(DL -> 5,US -> 9, AA -> 6, UA -> 4...)
// create map of destination airport -> number
var destMap: Map[String, Int] = Map()
var index2: Int = 0
flightsRDD.map(flight => flight.dest).distinct.collect.foreach(
x => { destMap += (x -> index2); index2 += 1 })
destMap.toString
// Map(JFK -> 214, LAX -> 294, ATL -> 273,MIA -> 175 ...
Transform non-numeric features into numeric values
© 2016 MapR Technologies 10-31
Classification Learning Problem - Features
Label  delayed and not delayed - delayed if delay > 40 minutes
Features  {day_of_month, weekday, crsdeptime, crsarrtime, carrier,
crselapsedtime, origin, dest}
MLLIB Datatypes:
Vector: Contains the feature data points
LabeledPoint: Contains feature vector and label
© 2016 MapR Technologies 10-32
// Defining the features array
val mlprep = flightsRDD.map(flight => {
val monthday = flight.dofM.toInt - 1 // category
val weekday = flight.dofW.toInt - 1 // category
val crsdeptime1 = flight.crsdeptime.toInt
val crsarrtime1 = flight.crsarrtime.toInt
val carrier1 = carrierMap(flight.carrier) // category
val crselapsedtime1 = flight.crselapsedtime.toDouble
val origin1 = originMap(flight.origin) // category
val dest1 = destMap(flight.dest) // category
val delayed = if (flight.depdelaymins.toDouble > 40) 1.0 else 0.0
Array(delayed.toDouble, monthday.toDouble, weekday.toDouble, crsdeptime1.toDouble, crsarrtime1.toDouble,
carrier1.toDouble, crselapsedtime1.toDouble, origin1.toDouble, dest1.toDouble)
})
mlprep.take(1)
//Array(Array(0.0, 0.0, 2.0, 900.0, 1225.0, 6.0, 385.0, 214.0, 294.0))
val mldata = mlprep.map(x => LabeledPoint(x(0),Vectors.dense(x(1),x(2),x(3),x(4), x(5),x(6),
x(7), x(8))))
mldata.take(1)
// Array[LabeledPoint] = Array((0.0,[0.0,2.0,900.0,1225.0,6.0,385.0,214.0,294.0]))
Define the features, Create LabeledPoint with Vector
© 2016 MapR Technologies 10-33
ML Cross-Validation Process
Data
Model
Training/
Building
Training
Set
Test Model
Predictions
Test
Set
Train/Test loop
© 2016 MapR Technologies 10-34
ML Cross-Validation Process
Data
Model
Training/
Building
Training
Set
Test Model
Predictions
Test
Set
Train/Test loop
Train algorithm with training dataset
Use test dataset on trained algorithm
© 2016 MapR Technologies 10-35
ML Cross-Validation Process
Data
Model
Training/
Building
Training
Set
Test Model
Predictions
Test
Set
Train/Test loop
Train algorithm with training dataset
Use test dataset on trained algorithm
© 2016 MapR Technologies 10-36
Build Model
Split data into:
• Training data RDD (80%)
• Test data RDD (20%)
Data
Build
Model
Training
Set
Test
Set
© 2016 MapR Technologies 10-37
// Randomly split RDD into training data RDD (80%) and test
data RDD (20%)
val splits = mldata.randomSplit(Array(0.8, 0.2))
val trainingRDD = splits(0).cache()
val testRDD = splits(1).cache()
testData.take(1)
//Array[LabeledPoint] =
Array((0.0,[18.0,6.0,900.0,1225.0,6.0,385.0,214.0,294.0]))
Split Data
© 2016 MapR Technologies 10-38
Build Model
Training Set with Labels, Build a model
Data
Build
Model
Training
Set
Test
Set
© 2016 MapR Technologies 10-39
Use Case: Flight Data
• Predict if a flight is going to be delayed
• Use Decision Tree for prediction
• Used for Classification and Regression
• Represents tree with nodes
• Binary decision at each node
© 2016 MapR Technologies 10-40
// set ranges for categorical features
var categoricalFeaturesInfo = Map[Int, Int]()
categoricalFeaturesInfo += (0 -> 31) //dofM 31 categories
categoricalFeaturesInfo += (1 -> 7) //dofW 7 categories
categoricalFeaturesInfo += (4 -> carrierMap.size) //number of carriers
categoricalFeaturesInfo += (6 -> originMap.size) //number of origin airports
categoricalFeaturesInfo += (7 -> destMap.size) //number of dest airports
val numClasses = 2
val impurity = "gini"
val maxDepth = 9
val maxBins = 7000
// call DecisionTree trainClassifier with the trainingData , which returns the model
val model = DecisionTree.trainClassifier(trainingData, numClasses, categoricalFeaturesInfo,
impurity, maxDepth, maxBins)
Build Model
© 2016 MapR Technologies 10-41
// print out the decision tree
model.toDebugString
// 0=dofM 4=carrier 3=crsarrtime1 6=origin
res20: String =
DecisionTreeModel classifier of depth 9 with 919 nodes
If (feature 0 in {11.0,12.0,13.0,14.0,15.0,16.0,17.0,18.0,19.0,20.0,21.0,
22.0,23.0,24.0,25.0,26.0,27.0,30.0})
If (feature 4 in {0.0,1.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0,9.0,10.0,11.0,13.0})
If (feature 3 <= 1603.0)
If (feature 0 in {11.0,12.0,13.0,14.0,15.0,16.0,17.0,18.0,19.0})
If (feature 6 in {0.0,1.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0,10.0,11.0,12.0,13.0...
Build Model
© 2016 MapR Technologies 10-42
Get Predictions
Test
Data
Without label
Predict
Delay or Not
Model
© 2016 MapR Technologies 10-43
// Get Predictions,create RDD of test Label, test Prediction
val labelAndPreds = testData.map { point =>
val prediction = model.predict(point.features)
(point.label, prediction)
}
labelAndPreds.take(1)
// Label, Prediction
//Array((0.0,0.0))
Get Predictions
© 2016 MapR Technologies 10-44
To Learn More:
• Download example code
– https://github.com/caroljmcdonald/sparkmldecisiontree
• Read explanation of example code
– https://www.mapr.com/blog/apache-spark-machine-learning-tutorial
• Engage with us!
– https://www.mapr.com/blog/author/carol-mcdonald
– https://community.mapr.com
© 2016 MapR Technologies 10-45
// get instances where label != prediction
val wrongPrediction =(labelAndPreds.filter{
case (label, prediction) => ( label !=prediction)
})
val wrong= wrongPrediction.count()
res35: Long = 11040
val ratioWrong=wrong.toDouble/testData.count()
ratioWrong: Double = 0.3157443157443157
Test Model

Weitere ähnliche Inhalte

Was ist angesagt?

Apache Spark Machine Learning Decision Trees
Apache Spark Machine Learning Decision TreesApache Spark Machine Learning Decision Trees
Apache Spark Machine Learning Decision TreesCarol McDonald
 
Fast Cars, Big Data How Streaming can help Formula 1
Fast Cars, Big Data How Streaming can help Formula 1Fast Cars, Big Data How Streaming can help Formula 1
Fast Cars, Big Data How Streaming can help Formula 1Carol McDonald
 
Anomaly Detection in Telecom with Spark - Tugdual Grall - Codemotion Amsterda...
Anomaly Detection in Telecom with Spark - Tugdual Grall - Codemotion Amsterda...Anomaly Detection in Telecom with Spark - Tugdual Grall - Codemotion Amsterda...
Anomaly Detection in Telecom with Spark - Tugdual Grall - Codemotion Amsterda...Codemotion
 
Advanced Threat Detection on Streaming Data
Advanced Threat Detection on Streaming DataAdvanced Threat Detection on Streaming Data
Advanced Threat Detection on Streaming DataCarol McDonald
 
Applying Machine Learning to Live Patient Data
Applying Machine Learning to  Live Patient DataApplying Machine Learning to  Live Patient Data
Applying Machine Learning to Live Patient DataCarol McDonald
 
Cognitive computing with big data, high tech and low tech approaches
Cognitive computing with big data, high tech and low tech approachesCognitive computing with big data, high tech and low tech approaches
Cognitive computing with big data, high tech and low tech approachesTed Dunning
 
How Big Data is Reducing Costs and Improving Outcomes in Health Care
How Big Data is Reducing Costs and Improving Outcomes in Health CareHow Big Data is Reducing Costs and Improving Outcomes in Health Care
How Big Data is Reducing Costs and Improving Outcomes in Health CareCarol McDonald
 
Streaming Patterns Revolutionary Architectures with the Kafka API
Streaming Patterns Revolutionary Architectures with the Kafka APIStreaming Patterns Revolutionary Architectures with the Kafka API
Streaming Patterns Revolutionary Architectures with the Kafka APICarol McDonald
 
Applying Machine learning to IOT: End to End Distributed Distributed Pipeline...
Applying Machine learning to IOT: End to End Distributed Distributed Pipeline...Applying Machine learning to IOT: End to End Distributed Distributed Pipeline...
Applying Machine learning to IOT: End to End Distributed Distributed Pipeline...Carol McDonald
 
Getting Started with HBase
Getting Started with HBaseGetting Started with HBase
Getting Started with HBaseCarol McDonald
 
Streaming patterns revolutionary architectures
Streaming patterns revolutionary architectures Streaming patterns revolutionary architectures
Streaming patterns revolutionary architectures Carol McDonald
 
Real time-hadoop
Real time-hadoopReal time-hadoop
Real time-hadoopTed Dunning
 
Predicting Flight Delays with Spark Machine Learning
Predicting Flight Delays with Spark Machine LearningPredicting Flight Delays with Spark Machine Learning
Predicting Flight Delays with Spark Machine LearningCarol McDonald
 
On-Prem Solution for the Selection of Wind Energy Models
On-Prem Solution for the Selection of Wind Energy ModelsOn-Prem Solution for the Selection of Wind Energy Models
On-Prem Solution for the Selection of Wind Energy ModelsDatabricks
 
What's new in Apache Mahout
What's new in Apache MahoutWhat's new in Apache Mahout
What's new in Apache MahoutTed Dunning
 
Apache Spark streaming and HBase
Apache Spark streaming and HBaseApache Spark streaming and HBase
Apache Spark streaming and HBaseCarol McDonald
 
NoSQL HBase schema design and SQL with Apache Drill
NoSQL HBase schema design and SQL with Apache Drill NoSQL HBase schema design and SQL with Apache Drill
NoSQL HBase schema design and SQL with Apache Drill Carol McDonald
 

Was ist angesagt? (20)

Introduction to Spark
Introduction to SparkIntroduction to Spark
Introduction to Spark
 
Apache Spark Machine Learning Decision Trees
Apache Spark Machine Learning Decision TreesApache Spark Machine Learning Decision Trees
Apache Spark Machine Learning Decision Trees
 
Fast Cars, Big Data How Streaming can help Formula 1
Fast Cars, Big Data How Streaming can help Formula 1Fast Cars, Big Data How Streaming can help Formula 1
Fast Cars, Big Data How Streaming can help Formula 1
 
MapR & Skytree:
MapR & Skytree: MapR & Skytree:
MapR & Skytree:
 
Anomaly Detection in Telecom with Spark - Tugdual Grall - Codemotion Amsterda...
Anomaly Detection in Telecom with Spark - Tugdual Grall - Codemotion Amsterda...Anomaly Detection in Telecom with Spark - Tugdual Grall - Codemotion Amsterda...
Anomaly Detection in Telecom with Spark - Tugdual Grall - Codemotion Amsterda...
 
MapR 5.2 Product Update
MapR 5.2 Product UpdateMapR 5.2 Product Update
MapR 5.2 Product Update
 
Advanced Threat Detection on Streaming Data
Advanced Threat Detection on Streaming DataAdvanced Threat Detection on Streaming Data
Advanced Threat Detection on Streaming Data
 
Applying Machine Learning to Live Patient Data
Applying Machine Learning to  Live Patient DataApplying Machine Learning to  Live Patient Data
Applying Machine Learning to Live Patient Data
 
Cognitive computing with big data, high tech and low tech approaches
Cognitive computing with big data, high tech and low tech approachesCognitive computing with big data, high tech and low tech approaches
Cognitive computing with big data, high tech and low tech approaches
 
How Big Data is Reducing Costs and Improving Outcomes in Health Care
How Big Data is Reducing Costs and Improving Outcomes in Health CareHow Big Data is Reducing Costs and Improving Outcomes in Health Care
How Big Data is Reducing Costs and Improving Outcomes in Health Care
 
Streaming Patterns Revolutionary Architectures with the Kafka API
Streaming Patterns Revolutionary Architectures with the Kafka APIStreaming Patterns Revolutionary Architectures with the Kafka API
Streaming Patterns Revolutionary Architectures with the Kafka API
 
Applying Machine learning to IOT: End to End Distributed Distributed Pipeline...
Applying Machine learning to IOT: End to End Distributed Distributed Pipeline...Applying Machine learning to IOT: End to End Distributed Distributed Pipeline...
Applying Machine learning to IOT: End to End Distributed Distributed Pipeline...
 
Getting Started with HBase
Getting Started with HBaseGetting Started with HBase
Getting Started with HBase
 
Streaming patterns revolutionary architectures
Streaming patterns revolutionary architectures Streaming patterns revolutionary architectures
Streaming patterns revolutionary architectures
 
Real time-hadoop
Real time-hadoopReal time-hadoop
Real time-hadoop
 
Predicting Flight Delays with Spark Machine Learning
Predicting Flight Delays with Spark Machine LearningPredicting Flight Delays with Spark Machine Learning
Predicting Flight Delays with Spark Machine Learning
 
On-Prem Solution for the Selection of Wind Energy Models
On-Prem Solution for the Selection of Wind Energy ModelsOn-Prem Solution for the Selection of Wind Energy Models
On-Prem Solution for the Selection of Wind Energy Models
 
What's new in Apache Mahout
What's new in Apache MahoutWhat's new in Apache Mahout
What's new in Apache Mahout
 
Apache Spark streaming and HBase
Apache Spark streaming and HBaseApache Spark streaming and HBase
Apache Spark streaming and HBase
 
NoSQL HBase schema design and SQL with Apache Drill
NoSQL HBase schema design and SQL with Apache Drill NoSQL HBase schema design and SQL with Apache Drill
NoSQL HBase schema design and SQL with Apache Drill
 

Ähnlich wie Free Code Friday - Machine Learning with Apache Spark

Intro to Apache Spark by Marco Vasquez
Intro to Apache Spark by Marco VasquezIntro to Apache Spark by Marco Vasquez
Intro to Apache Spark by Marco VasquezMapR Technologies
 
Big Data Everywhere Chicago: SQL on Hadoop
Big Data Everywhere Chicago: SQL on Hadoop Big Data Everywhere Chicago: SQL on Hadoop
Big Data Everywhere Chicago: SQL on Hadoop BigDataEverywhere
 
Putting Apache Drill into Production
Putting Apache Drill into ProductionPutting Apache Drill into Production
Putting Apache Drill into ProductionMapR Technologies
 
Analysis of Popular Uber Locations using Apache APIs: Spark Machine Learning...
Analysis of Popular Uber Locations using Apache APIs:  Spark Machine Learning...Analysis of Popular Uber Locations using Apache APIs:  Spark Machine Learning...
Analysis of Popular Uber Locations using Apache APIs: Spark Machine Learning...Carol McDonald
 
Cleveland Hadoop Users Group - Spark
Cleveland Hadoop Users Group - SparkCleveland Hadoop Users Group - Spark
Cleveland Hadoop Users Group - SparkVince Gonzalez
 
Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias...
Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias...Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias...
Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias...Arvind Surve
 
Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias...
Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias...Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias...
Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias...Arvind Surve
 
PGQL: A Language for Graphs
PGQL: A Language for GraphsPGQL: A Language for Graphs
PGQL: A Language for GraphsJean Ihm
 
Analyzing Real-World Data with Apache Drill
Analyzing Real-World Data with Apache DrillAnalyzing Real-World Data with Apache Drill
Analyzing Real-World Data with Apache DrillTomer Shiran
 
Querying Network Packet Captures with Spark and Drill
Querying Network Packet Captures with Spark and DrillQuerying Network Packet Captures with Spark and Drill
Querying Network Packet Captures with Spark and DrillVince Gonzalez
 
Microsoft R Server for Data Sciencea
Microsoft R Server for Data ScienceaMicrosoft R Server for Data Sciencea
Microsoft R Server for Data ScienceaData Science Thailand
 
Apache Spark Machine Learning
Apache Spark Machine LearningApache Spark Machine Learning
Apache Spark Machine LearningCarol McDonald
 
Scrap Your MapReduce - Apache Spark
 Scrap Your MapReduce - Apache Spark Scrap Your MapReduce - Apache Spark
Scrap Your MapReduce - Apache SparkIndicThreads
 
Analyzing Real-World Data with Apache Drill
Analyzing Real-World Data with Apache DrillAnalyzing Real-World Data with Apache Drill
Analyzing Real-World Data with Apache Drilltshiran
 
Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!
Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!
Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!Tugdual Grall
 
The Future of Hadoop: A deeper look at Apache Spark
The Future of Hadoop: A deeper look at Apache SparkThe Future of Hadoop: A deeper look at Apache Spark
The Future of Hadoop: A deeper look at Apache SparkCloudera, Inc.
 
Galvanise NYC - Scaling R with Hadoop & Spark. V1.0
Galvanise NYC - Scaling R with Hadoop & Spark. V1.0Galvanise NYC - Scaling R with Hadoop & Spark. V1.0
Galvanise NYC - Scaling R with Hadoop & Spark. V1.0vithakur
 

Ähnlich wie Free Code Friday - Machine Learning with Apache Spark (20)

Intro to Apache Spark by Marco Vasquez
Intro to Apache Spark by Marco VasquezIntro to Apache Spark by Marco Vasquez
Intro to Apache Spark by Marco Vasquez
 
Spark graphx
Spark graphxSpark graphx
Spark graphx
 
Big Data Everywhere Chicago: SQL on Hadoop
Big Data Everywhere Chicago: SQL on Hadoop Big Data Everywhere Chicago: SQL on Hadoop
Big Data Everywhere Chicago: SQL on Hadoop
 
Apache Spark & Hadoop
Apache Spark & HadoopApache Spark & Hadoop
Apache Spark & Hadoop
 
Putting Apache Drill into Production
Putting Apache Drill into ProductionPutting Apache Drill into Production
Putting Apache Drill into Production
 
Analysis of Popular Uber Locations using Apache APIs: Spark Machine Learning...
Analysis of Popular Uber Locations using Apache APIs:  Spark Machine Learning...Analysis of Popular Uber Locations using Apache APIs:  Spark Machine Learning...
Analysis of Popular Uber Locations using Apache APIs: Spark Machine Learning...
 
Cleveland Hadoop Users Group - Spark
Cleveland Hadoop Users Group - SparkCleveland Hadoop Users Group - Spark
Cleveland Hadoop Users Group - Spark
 
Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias...
Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias...Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias...
Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias...
 
Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias...
Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias...Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias...
Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias...
 
PGQL: A Language for Graphs
PGQL: A Language for GraphsPGQL: A Language for Graphs
PGQL: A Language for Graphs
 
Analyzing Real-World Data with Apache Drill
Analyzing Real-World Data with Apache DrillAnalyzing Real-World Data with Apache Drill
Analyzing Real-World Data with Apache Drill
 
Querying Network Packet Captures with Spark and Drill
Querying Network Packet Captures with Spark and DrillQuerying Network Packet Captures with Spark and Drill
Querying Network Packet Captures with Spark and Drill
 
Microsoft R Server for Data Sciencea
Microsoft R Server for Data ScienceaMicrosoft R Server for Data Sciencea
Microsoft R Server for Data Sciencea
 
Apache Spark Machine Learning
Apache Spark Machine LearningApache Spark Machine Learning
Apache Spark Machine Learning
 
Scrap Your MapReduce - Apache Spark
 Scrap Your MapReduce - Apache Spark Scrap Your MapReduce - Apache Spark
Scrap Your MapReduce - Apache Spark
 
Analyzing Real-World Data with Apache Drill
Analyzing Real-World Data with Apache DrillAnalyzing Real-World Data with Apache Drill
Analyzing Real-World Data with Apache Drill
 
Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!
Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!
Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!
 
The Future of Hadoop: A deeper look at Apache Spark
The Future of Hadoop: A deeper look at Apache SparkThe Future of Hadoop: A deeper look at Apache Spark
The Future of Hadoop: A deeper look at Apache Spark
 
Is Spark Replacing Hadoop
Is Spark Replacing HadoopIs Spark Replacing Hadoop
Is Spark Replacing Hadoop
 
Galvanise NYC - Scaling R with Hadoop & Spark. V1.0
Galvanise NYC - Scaling R with Hadoop & Spark. V1.0Galvanise NYC - Scaling R with Hadoop & Spark. V1.0
Galvanise NYC - Scaling R with Hadoop & Spark. V1.0
 

Mehr von MapR Technologies

Converging your data landscape
Converging your data landscapeConverging your data landscape
Converging your data landscapeMapR Technologies
 
ML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & EvaluationML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & EvaluationMapR Technologies
 
Self-Service Data Science for Leveraging ML & AI on All of Your Data
Self-Service Data Science for Leveraging ML & AI on All of Your DataSelf-Service Data Science for Leveraging ML & AI on All of Your Data
Self-Service Data Science for Leveraging ML & AI on All of Your DataMapR Technologies
 
Enabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data CaptureEnabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data CaptureMapR Technologies
 
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...MapR Technologies
 
ML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning LogisticsML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning LogisticsMapR Technologies
 
Machine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model ManagementMachine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model ManagementMapR Technologies
 
Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action MapR Technologies
 
Live Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIsLive Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIsMapR Technologies
 
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale StorageBringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale StorageMapR Technologies
 
Live Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn PredictionLive Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn PredictionMapR Technologies
 
An Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data PlatformAn Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data PlatformMapR Technologies
 
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...MapR Technologies
 
Best Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in HealthcareBest Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in HealthcareMapR Technologies
 
Geo-Distributed Big Data and Analytics
Geo-Distributed Big Data and AnalyticsGeo-Distributed Big Data and Analytics
Geo-Distributed Big Data and AnalyticsMapR Technologies
 
MapR Product Update - Spring 2017
MapR Product Update - Spring 2017MapR Product Update - Spring 2017
MapR Product Update - Spring 2017MapR Technologies
 
3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data Analytics3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data AnalyticsMapR Technologies
 
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA DeploymentsCisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA DeploymentsMapR Technologies
 
MapR and Cisco Make IT Better
MapR and Cisco Make IT BetterMapR and Cisco Make IT Better
MapR and Cisco Make IT BetterMapR Technologies
 
Evolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQLEvolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQLMapR Technologies
 

Mehr von MapR Technologies (20)

Converging your data landscape
Converging your data landscapeConverging your data landscape
Converging your data landscape
 
ML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & EvaluationML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & Evaluation
 
Self-Service Data Science for Leveraging ML & AI on All of Your Data
Self-Service Data Science for Leveraging ML & AI on All of Your DataSelf-Service Data Science for Leveraging ML & AI on All of Your Data
Self-Service Data Science for Leveraging ML & AI on All of Your Data
 
Enabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data CaptureEnabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data Capture
 
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
 
ML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning LogisticsML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning Logistics
 
Machine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model ManagementMachine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model Management
 
Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action
 
Live Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIsLive Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIs
 
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale StorageBringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
 
Live Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn PredictionLive Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn Prediction
 
An Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data PlatformAn Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data Platform
 
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
 
Best Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in HealthcareBest Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in Healthcare
 
Geo-Distributed Big Data and Analytics
Geo-Distributed Big Data and AnalyticsGeo-Distributed Big Data and Analytics
Geo-Distributed Big Data and Analytics
 
MapR Product Update - Spring 2017
MapR Product Update - Spring 2017MapR Product Update - Spring 2017
MapR Product Update - Spring 2017
 
3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data Analytics3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data Analytics
 
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA DeploymentsCisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
 
MapR and Cisco Make IT Better
MapR and Cisco Make IT BetterMapR and Cisco Make IT Better
MapR and Cisco Make IT Better
 
Evolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQLEvolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQL
 

Kürzlich hochgeladen

Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceDelhi Call girls
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...ZurliaSoop
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...amitlee9823
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Researchmichael115558
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...amitlee9823
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...amitlee9823
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxolyaivanovalion
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightDelhi Call girls
 

Kürzlich hochgeladen (20)

Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 

Free Code Friday - Machine Learning with Apache Spark

  • 1. © 2016 MapR Technologies 10-1 © 2016 MapR Technologies Machine Learning with Apache Spark
  • 2. © 2016 MapR Technologies 10-2 © 2016 MapR Technologies
  • 3. © 2016 MapR Technologies 10-3 Agenda • Brief overview of • Classification • Clustering • Collaborative Filtering • Predicting Flight Delays using a Decision Tree
  • 4. © 2016 MapR Technologies 10-4 Spark SQL • Structured Data • Querying with SQL/HQL • DataFrames Spark Streaming • Processing of live streams • Micro-batching MLlib • Machine Learning • Multiple types of ML algorithms GraphX • Graph processing • Graph parallel computations RDD Transformations and Actions • Task scheduling • Memory management • Fault recovery • Interacting with storage systems Spark Core What is MLlib?
  • 5. © 2016 MapR Technologies 10-5 MLlib Algorithms and Utilities Algorithms and Utilities Description Basic statistics Includes summary statistics, correlations, hypothesis testing, random data generation Classification and regression Includes methods for linear models, decision trees and Naïve Bayes Collaborative filtering Supports model-based collaborative filtering using alternating least squares (ALS) algorithm Clustering Supports K-means clustering Dimensionality reduction Supports dimensionality reduction on the RowMatrix class; singular value decomposition (SVD) and principal component analysis (PCA) Feature extraction and transformation Contains several classes for common feature transformations
  • 6. © 2016 MapR Technologies 10-6 Examples of ML Algorithms Supervised • Classification – Naïve Bayes – SVM – Random Decision Forests • Regression – Linear – Logistic Machine Learning Unsupervised • Clustering – K-means • Dimensionality reduction – Principal Component Analysis – SVD
  • 7. © 2016 MapR Technologies 10-7 Examples of ML Algorithms Supervised • Classification – Naïve Bayes – SVM – Random Decision Forests • Regression – Linear – Logistic Machine Learning Unsupervised • Clustering – K-means • Dimensionality reduction – Principal Component Analysis – SVD
  • 8. © 2016 MapR Technologies 10-8 Examples of ML Algorithms Machine Learning Unsupervised • Clustering – K-means • Dimensionality reduction – Principal Component Analysis – SVD Supervised • Classification – Naïve Bayes – SVM – Random Decision Forests • Regression – Linear – Logistic
  • 9. © 2016 MapR Technologies 10-9 Three Categories of Techniques for Machine Learning Collaborative Filtering (Recommendation) Classification Clustering
  • 10. © 2016 MapR Technologies 10-10 Machine Learning: Classification Classification Identifies category for item
  • 11. © 2016 MapR Technologies 10-11 Classification: Definition Form of ML that: • Identifies which category an item belongs to • Uses supervised learning algorithms – Data is labeled Sentiment
  • 12. © 2016 MapR Technologies 10-12 If it Walks/Swims/Quacks Like a Duck …… Then It Must Be a Duck swims walks quacks Features: walks quacks swims Features:
  • 13. © 2016 MapR Technologies 10-13 Building and Deploying a Classifier Model Image reference O’Reilly Learning Spark Spam: free money now! get this money free savings $$$ Training Data Non-spam: how are you? that Spark job lunch plans
  • 14. © 2016 MapR Technologies 10-14 Building and Deploying a Classifier Model Image reference O’Reilly Learning Spark + + ̶+ ̶ ̶ Feature Vectors Featurization Spam: free money now! get this money free savings $$$ Training Data Non-spam: how are you? that Spark job lunch plans
  • 15. © 2016 MapR Technologies 10-15 Building and Deploying a Classifier Model Image reference O’Reilly Learning Spark + + ̶+ ̶ ̶ Feature Vectors Model Featurization TrainingSpam: free money now! get this money free savings $$$ Training Data Non-spam: how are you? that Spark job lunch plans + + ̶+ ̶ ̶
  • 16. © 2016 MapR Technologies 10-16 Building and Deploying a Classifier Model Image reference O’Reilly Learning Spark + + ̶+ ̶ ̶ Feature Vectors Model Featurization Training Model Evaluation Best Model Spam: free money now! get this money free savings $$$ Training Data Non-spam: how are you? that Spark job lunch plans + + ̶+ ̶ ̶ + + ̶+ ̶ ̶ + + ̶+ ̶ ̶ + + ̶+ ̶ ̶
  • 17. © 2016 MapR Technologies 10-17 Machine Learning: Clustering Classification Clustering
  • 18. © 2016 MapR Technologies 10-18 Clustering: Definition • Unsupervised learning task • Groups objects into clusters of high similarity
  • 19. © 2016 MapR Technologies 10-19 Clustering: Definition • Unsupervised learning task • Groups objects into clusters of high similarity – Search results grouping – Grouping of customers – Anomaly detection – Text categorization
  • 20. © 2016 MapR Technologies 10-20 Clustering: Example • Group similar objects • Use MLlib K-means algorithm 1. Initialize coordinates to center of clusters (centroid) 2. Assign all points to nearest centroid 3. Update centroids to center of points 4. Repeat until conditions met
  • 21. © 2016 MapR Technologies 10-21 Three Categories of Techniques for Machine Learning Collaborative Filtering (Recommendation) Classification Clustering
  • 22. © 2016 MapR Technologies 10-22 Collaborative Filtering with Spark • Recommend items – (Filtering) • Based on user preferences data – (Collaborative) 4 5 5 5 5 5 ? Ted Carol Bob A B C User Item Rating Matrix
  • 23. © 2016 MapR Technologies 10-23 Train a Model to Make Predictions Ted and Carol like movies B and C Bob likes movie B, what might he like? Bob likes movie B, predict C Training Data ModelAlgorithm New Data PredictionsModel 4 5 5 5 5 5 ? Ted Carol Bob A B C User Item Rating Matrix
  • 24. © 2016 MapR Technologies 10-24 © 2016 MapR Technologies Predict Flight Delays
  • 25. © 2016 MapR Technologies 10-25 Use Case: Flight Data • Predict if a flight is going to be delayed • Use Decision Tree for prediction • Used for Classification and Regression • Represents tree with nodes, Binary decision at each node
  • 26. © 2016 MapR Technologies 10-26 Flight Data
  • 27. © 2016 MapR Technologies 10-27 // Define the schema case class Flight(dofM: String, dofW: String, carrier: String, tailnum: String, flnum: Int, org_id: String, origin: String, dest_id: String, dest: String, crsdeptime: Double, deptime: Double, depdelaymins: Double, crsarrtime: Double, arrtime: Double, arrdelay: Double, crselapsedtime: Double, dist: Int) def parseFlight(str: String): Flight = { val line = str.split(",") Flight(line(0), line(1), line(2), line(3), line(4).toInt, line(5), line(6), line(7), line(8), line(9).toDouble, line(10).toDouble, line(11).toDouble, line(12).toDouble, line(13).toDouble, line(14).toDouble, line(15).toDouble, line(16).toInt) } // load file into a RDD val rdd = sc.textFile(”flights.csv”) // create an RDD of Flight objects val flightRDD = rdd.map(parseFlight).cache() //Array(Flight(1,3,AA,N338AA,1,12478,JFK,12892,LAX 900.0,914.0,14.0,1225.0,1238.0, 13.0,385.0,2475) Parse Input
  • 28. © 2016 MapR Technologies 10-28 Building and Deploying a Classifier Model + + ̶+ ̶ ̶ Feature Vectors Featurization Delayed: Friday LAX AA Training Data Not Delayed: Wednesday BNA Delta
  • 29. © 2016 MapR Technologies 10-29 Classification Learning Problem - Features Label  delayed and not delayed - delayed if delay > 40 minutes Features  {day_of_month, weekday, crsdeptime, crsarrtime, carrier, crselapsedtime, origin, dest}
  • 30. © 2016 MapR Technologies 10-30 // create map of airline -> number var carrierMap: Map[String, Int] = Map() var index: Int = 0 flightsRDD.map(flight => flight.carrier).distinct.collect.foreach( x => { carrierMap += (x -> index); index += 1 } ) carrierMap.toString // String = Map(DL -> 5,US -> 9, AA -> 6, UA -> 4...) // create map of destination airport -> number var destMap: Map[String, Int] = Map() var index2: Int = 0 flightsRDD.map(flight => flight.dest).distinct.collect.foreach( x => { destMap += (x -> index2); index2 += 1 }) destMap.toString // Map(JFK -> 214, LAX -> 294, ATL -> 273,MIA -> 175 ... Transform non-numeric features into numeric values
  • 31. © 2016 MapR Technologies 10-31 Classification Learning Problem - Features Label  delayed and not delayed - delayed if delay > 40 minutes Features  {day_of_month, weekday, crsdeptime, crsarrtime, carrier, crselapsedtime, origin, dest} MLLIB Datatypes: Vector: Contains the feature data points LabeledPoint: Contains feature vector and label
  • 32. © 2016 MapR Technologies 10-32 // Defining the features array val mlprep = flightsRDD.map(flight => { val monthday = flight.dofM.toInt - 1 // category val weekday = flight.dofW.toInt - 1 // category val crsdeptime1 = flight.crsdeptime.toInt val crsarrtime1 = flight.crsarrtime.toInt val carrier1 = carrierMap(flight.carrier) // category val crselapsedtime1 = flight.crselapsedtime.toDouble val origin1 = originMap(flight.origin) // category val dest1 = destMap(flight.dest) // category val delayed = if (flight.depdelaymins.toDouble > 40) 1.0 else 0.0 Array(delayed.toDouble, monthday.toDouble, weekday.toDouble, crsdeptime1.toDouble, crsarrtime1.toDouble, carrier1.toDouble, crselapsedtime1.toDouble, origin1.toDouble, dest1.toDouble) }) mlprep.take(1) //Array(Array(0.0, 0.0, 2.0, 900.0, 1225.0, 6.0, 385.0, 214.0, 294.0)) val mldata = mlprep.map(x => LabeledPoint(x(0),Vectors.dense(x(1),x(2),x(3),x(4), x(5),x(6), x(7), x(8)))) mldata.take(1) // Array[LabeledPoint] = Array((0.0,[0.0,2.0,900.0,1225.0,6.0,385.0,214.0,294.0])) Define the features, Create LabeledPoint with Vector
  • 33. © 2016 MapR Technologies 10-33 ML Cross-Validation Process Data Model Training/ Building Training Set Test Model Predictions Test Set Train/Test loop
  • 34. © 2016 MapR Technologies 10-34 ML Cross-Validation Process Data Model Training/ Building Training Set Test Model Predictions Test Set Train/Test loop Train algorithm with training dataset Use test dataset on trained algorithm
  • 35. © 2016 MapR Technologies 10-35 ML Cross-Validation Process Data Model Training/ Building Training Set Test Model Predictions Test Set Train/Test loop Train algorithm with training dataset Use test dataset on trained algorithm
  • 36. © 2016 MapR Technologies 10-36 Build Model Split data into: • Training data RDD (80%) • Test data RDD (20%) Data Build Model Training Set Test Set
  • 37. © 2016 MapR Technologies 10-37 // Randomly split RDD into training data RDD (80%) and test data RDD (20%) val splits = mldata.randomSplit(Array(0.8, 0.2)) val trainingRDD = splits(0).cache() val testRDD = splits(1).cache() testData.take(1) //Array[LabeledPoint] = Array((0.0,[18.0,6.0,900.0,1225.0,6.0,385.0,214.0,294.0])) Split Data
  • 38. © 2016 MapR Technologies 10-38 Build Model Training Set with Labels, Build a model Data Build Model Training Set Test Set
  • 39. © 2016 MapR Technologies 10-39 Use Case: Flight Data • Predict if a flight is going to be delayed • Use Decision Tree for prediction • Used for Classification and Regression • Represents tree with nodes • Binary decision at each node
  • 40. © 2016 MapR Technologies 10-40 // set ranges for categorical features var categoricalFeaturesInfo = Map[Int, Int]() categoricalFeaturesInfo += (0 -> 31) //dofM 31 categories categoricalFeaturesInfo += (1 -> 7) //dofW 7 categories categoricalFeaturesInfo += (4 -> carrierMap.size) //number of carriers categoricalFeaturesInfo += (6 -> originMap.size) //number of origin airports categoricalFeaturesInfo += (7 -> destMap.size) //number of dest airports val numClasses = 2 val impurity = "gini" val maxDepth = 9 val maxBins = 7000 // call DecisionTree trainClassifier with the trainingData , which returns the model val model = DecisionTree.trainClassifier(trainingData, numClasses, categoricalFeaturesInfo, impurity, maxDepth, maxBins) Build Model
  • 41. © 2016 MapR Technologies 10-41 // print out the decision tree model.toDebugString // 0=dofM 4=carrier 3=crsarrtime1 6=origin res20: String = DecisionTreeModel classifier of depth 9 with 919 nodes If (feature 0 in {11.0,12.0,13.0,14.0,15.0,16.0,17.0,18.0,19.0,20.0,21.0, 22.0,23.0,24.0,25.0,26.0,27.0,30.0}) If (feature 4 in {0.0,1.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0,9.0,10.0,11.0,13.0}) If (feature 3 <= 1603.0) If (feature 0 in {11.0,12.0,13.0,14.0,15.0,16.0,17.0,18.0,19.0}) If (feature 6 in {0.0,1.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0,10.0,11.0,12.0,13.0... Build Model
  • 42. © 2016 MapR Technologies 10-42 Get Predictions Test Data Without label Predict Delay or Not Model
  • 43. © 2016 MapR Technologies 10-43 // Get Predictions,create RDD of test Label, test Prediction val labelAndPreds = testData.map { point => val prediction = model.predict(point.features) (point.label, prediction) } labelAndPreds.take(1) // Label, Prediction //Array((0.0,0.0)) Get Predictions
  • 44. © 2016 MapR Technologies 10-44 To Learn More: • Download example code – https://github.com/caroljmcdonald/sparkmldecisiontree • Read explanation of example code – https://www.mapr.com/blog/apache-spark-machine-learning-tutorial • Engage with us! – https://www.mapr.com/blog/author/carol-mcdonald – https://community.mapr.com
  • 45. © 2016 MapR Technologies 10-45 // get instances where label != prediction val wrongPrediction =(labelAndPreds.filter{ case (label, prediction) => ( label !=prediction) }) val wrong= wrongPrediction.count() res35: Long = 11040 val ratioWrong=wrong.toDouble/testData.count() ratioWrong: Double = 0.3157443157443157 Test Model