SlideShare ist ein Scribd-Unternehmen logo
1 von 42
Downloaden Sie, um offline zu lesen
Presented: J. White Bear
An Online Spark Pipeline:
Semi-Supervised Learning and Online
Retraining with Spark Streaming.
2
IBM
Spark Technology Center
• Founded in 2015.
• Location:
– Physical: 505 Howard St., San Francisco CA
– Web: http://spark.tc Twitter: @apachespark_tc
• Mission:
– Contribute intellectual and technical capital to theApache
Spark community.
– Make the core technology enterprise- and cloud-ready.
– Build data science skills to drive intelligence into business
applications — http://bigdatauniversity.com
• Key statistics:
– About 50 developers, co-located with 25 IBM designers.
– Major contributions toApache Spark http://jiras.spark.tc
– Apache SystemML is now a top level Apache project !
– Founding member of UC Berkeley AMPLab and RISE
Lab
– Member of R Consortium and Scala Center
Spark Technology Center
About Me
Education
• University ofMichigan-Computer Science
• Databases,Machine Learning/Computational
Biology,Cryptography
• University ofCalifornia SanFrancisco,University of
California Berkeley-
• Multi-objective Optimization/Computational
Biology/Bioinformatics
• McGill University
• Machine Learning/ Multi-objective
Optimizationfor PathPlanning/
Cryptography
Industry
• IBM
• Amazon
• TeraGrid
• Pfizer
• Researchat UC Berkeley,Purdue University,and
every university Iever attended.J
Fun Facts (?)
I love researchfor itsown sake. I like robots,helping
to cure diseases,advocating for social change and
reform,and breaking encryptions.Also,most
activitiesinvolving the OceanandI usually hate
taking pictures. J
Why do we need online semisupervised learning?
Malware/Fraud Detection
Stock Prediction
Real-time Diagnosis
NLP/Speech Recognition Real-time Visual Processing
Why online learning?
• Incremental/Sequential learning for real-time use
cases
• Predicts/learns from the newest data
• Optimized for low latency in real-time cases
• Often used in conjunction with streaming data
Semi-supervised learning
• Smaller training sets, less labeled data
• Classifying or predicting unlabeled data
• The underlying distribution P(x,y) may or may
not be known/ stable (PAC learning)
• How/Why do we bring these ideas together?
Key Challenges
• Acquiring a sufficient ratio training data
– Batch learning is much slower and more difficult to update
a model
• Maintaining accuracy
– Concept drift
– Catastrophic events
• Meeting latency requirements particularly in real-
time scenarios, like autonomous vehicles
• What about unlabeled data?....
The Framework: Hybrid Online and Semi-Supervised learning
RealTime Streaming Data
Real-Time Consumption
Feature Extraction
Feature Hashing
Hashing Trick
Matrix Factorization
Out-of-Core Component
RDD of All Data
(not in cache)
Background
Out-of-Core Processing
AllReduce Component
All Reduce Analysis of mini
batch data using dist. node
averaging
Can trigger Model Updates
Initiates semi-supervised
state, mini batches can be
sequential or subsampling
scheme
Background
In-Memory Processing
Spanning Tree Protocol
for AllReduce
(simulated in POC)
MiniBatches of RDD’s
distributed to nodes
Model Retraining Policy
Model/Retraining Cache
Global Cache Stores/Best Performers
Metrics from Out of Core
Local Cache Stores/Best Performers
Metrics from AllReduce
A Reliable Effective Terascale Linear Learning System
Alekh Agarwal, Olivier Chapelle, Miroslav Dudik, John Langford
Why this framework?
• Abundance of unlabeled data requires semi-supervised learning
– Real-time learning in the IoT setting
• Semi-Supervised learning can improve predictions
– Empirically studied for online use case with Big Data
• We need to address the challenges of incremental sequential
learning without losing valuable historical data
• We need to optimize for low latency without overly sacrificing
accuracy
• You may have an online case but you are not guaranteed labeled
data and definitely not in a timely fashion
• Hybrid frameworks address these challenges and allow for future
data mining to understand and correct for how your data changes
over time
Online and Semi-supervised Learning: Online Constraint
min
w2Rd
nX
i=1
l(w>
xi; yi) + R(w) (1)
Supervised learning over a batch
1
n
nX
i=1
l(w>
xi; yi) +
n
R(w) (2)
e learning over a mini-batch from a streaming data set
i=1
Supervised learning over a batch
1
n
nX
i=1
l(w>
xi;yi) +
n
R(w) (2
Online learning over a mini-batch from a streaming data set
m
n
nX
i=Sk
l(w>
xi;yi) +
n
R(w) (3
Distributed learning over a mini-batch from a streaming data set
1
n
nX
i=1
l(w>
xi; yi) +
n
R(w) (2)
Online learning over a mini-batch from a streaming data set
m
n
nX
i=Sk
l(w>
xi; yi) +
n
R(w) (3)
Distributed learning over a mini-batch from a streaming data set
1
Online and Semi-supervised Learning: Semi-supervised Constraint
m
n
nX
i=Sk
l(w>
xi; yi) +
n
R(w) (3)
Distributed learning over a mini-batch from a streaming data set
Let xp be an example where xp = (xp1, xp2, xp2, ..., xpD, !) where xp belongs to class
! and a D dimensional space. xpi is the value of the ith feature of the pth sample.
Assume a labeled set. L with n instances of x, with ! known. Assume a labeled
set. U with m instances of x, with ! unknown. Assume that the number of labeled
instances L, is less than the number of unlabeled instances, U.
L [ U represents the training set, T .
We want to infer a hypothesis using T and use this hypothesis to predict labels we
have not yet seen.
The semi-supervised learning case.
1
Online and Semi-supervised Learning: Semi-supervised Constraint
Labeled
Data
Unlabeled
Data
Initial Training
Set
Online/Streaming
Labeled/
Unlabeled Data
Model
Retraining
Label
ed
Data
Unlabel
ed
Data
The Framework: Batch Component
RealTime Streaming Data
Real-Time Consumption
Feature Extraction
Feature Hashing
Hashing Trick
Matrix Factorization
Batch
Out-of-Core Component
RDD of All Data
(not in cache)
Background
Out-of-Core Processing
AllReduce Component
All Reduce Analysis of mini
batch data using dist. node
averaging
Can trigger Model Updates
Initiates semi-supervised
state, mini batches can be
sequential or subsampling
scheme
Background
In-Memory Processing
Spanning Tree Protocol
for AllReduce
(simulated in POC)
MiniBatches of RDD’s
distributed to nodes
Model Retraining Policy
Model/Retraining Cache
Global Cache Stores/Best Performers
Metrics from Out of Core
Local Cache Stores/Best Performers
Metrics from AllReduce
Learning Modules
M1…MN
Batch
Out-of-Core Component
RDD of All Data
(not in cache)
Background
Out-of-Core Processing
Learning Modules
M1…MN
Customization & Policy
Options
Retraining Policy
Amending Policy
Committee Policy
The Framework: Batch Component
• Features
– Out-of-Core, can run as a background process
– Ensemble Learning: Multiple model/ Co-training
learning algorithms
– Amending Policy
– Retrain Policy
– Active Learning Policy
– Custom Parameters
The Framework: Batch Component
The Framework: Batch Component
Semi-Supervised Learning
Edited by Olivier Chapelle, Bernhard Schölkopf and Alexander Zien
Learning from labeled and unlabeled data
Tom Mitchell, CMU
The Framework: Batch Component
• Multiple Model Learning/ Co-training
– Expandable learning modules: multiple learning
algorithms
– Co-training for high-dimensionality, improved
confidence
– Accounts for different biases across learning
techniques to strengthen the hypothesis
– Improves confidence estimates in practice
– Yields better results when there is a significant
diversity among the models.
– Reduces variance and helps to avoid overfitting
The Framework: Batch Component
L [ U represents the training set, T .
We want to infer a hypothesis using T and use this hypothesis to predict labels we
have not yet seen.
The semi-supervised learning case.
Given a model, m 2 M
ym(x) = h(x) + ✏m(x) (4)
We can get the average error over all models, Eaverage
Eaverage =
1
M
MX
m=1
Ex[✏m(x)2
] (5)
Regression: Modal averaging sum of squares/ Prediction confidence
1Multiple Model-Based Reinforcement Learning
Kenji Doya, KazuyukiSamejima,Ken-ichiKatagiriand Mitsuo Kawato
Vainsencher, Daniel, Shie Mannor, and Huan Xu. "Learning multiple models via regularized weighting." Advances in NeuralInformation
Processing Systems.2013.
The Framework: Batch Component
• Amending Policy
– Instances are added sequentially and maintain order
– Only the most ‘confident’ predictions are added
– If validation is received, inaccurate predictions are
removed and retrained (w/n a threshold)
– Low confidence, unlabeled data are stored for future
validation or rescoring
Avrim Blum and Tom Mitchell. 1998. Combining labeled and unlabeled data with co-training. In Proceedings of the
eleventh annual conference on Computational learning theory (COLT' 98). ACM, New York, NY, USA, 92-100.
DOI=http://dx.doi.org/10.1145/279943.279962
The Framework: Batch Component
Training Batch
Labeled
Data
Unlabeled
Low
Confidence
Data
Amending Policy
Avrim Blum and Tom Mitchell. 1998. Combining labeled and unlabeled data with co-training. In Proceedings of the
eleventh annual conference on Computational learning theory (COLT' 98). ACM, New York, NY, USA, 92-100.
DOI=http://dx.doi.org/10.1145/279943.279962
The Framework: Batch Component
• Retrain Policy
– Loss Minimization
• Increases in expected loss above threshold
– Accuracy
• Ratio/Number of incorrect predictions
– Schedule
• Regular retraining schedule
– Optimizations
• Limited to a window over the given data set
– Train over a specific time period to optimize for catastrophic events
and concept shifts
• Subsampling
– Very large datasets can set a subsampling methodology to improve
batch processing speed
The Framework: Batch Component
• What happens to low confidence predictions?
• Active Learning Policy
– Separate dataset holds these instances
– Human labeling and/or an oracle is queried for accurate labels
• Google has over 10,000 contractors performing this function
– https://arstechnica.com/features/2017/04/the-secret-lives-of-google-raters/
• Facebook is hiring 3000 new content monitors for a job AI cannot do
– http://www.popsci.com/Facebook-hiring-3000-content-monitors
• Netflix/Amazon are bringing in humans to improve the CTR
recommendations
– http://blog.echen.me/2014/10/07/moving-beyond-ctr-better-
recommendations-through-human-evaluation/
The Framework: Batch Component
• Custom Parameters:
– Window Size
• Specify a time or range
– Amending times
• Schedule when amending policy runs
• Default is automated at accuracy rates
– Loss threshold
• Specify maximum loss for a given loss function
– Accuracy threshold
• Specify the precision/recall rates before retraining is called on
batch
– Subsampling
• Run a random subsample to train
The Framework: All-Reduce Component
A Reliable Effective Terascale Linear Learning System
Alekh Agarwal, Olivier Chapelle, Miroslav Dudik, John Langford
RealTime Streaming Data
Real-Time Consumption
Feature Extraction
Feature Hashing
Hashing Trick
Matrix Factorization
Batch
Out-of-Core Component
RDD of All Data
(not in cache)
Background
Out-of-Core Processing
AllReduce Component
All Reduce Analysis of mini
batch data using dist. node
averaging
Can trigger Model Updates
Initiates semi-supervised
state, mini batches can be
sequential or subsampling
scheme
Background
In-Memory Processing
Spanning Tree Protocol
for AllReduce
(simulated in POC)
MiniBatches of RDD’s
distributed to nodes
Model Retraining Policy
Model/Retraining Cache
Global Cache Stores/Best Performers
Metrics from Out of Core
Local Cache Stores/Best Performers
Metrics from AllReduce
Learning Modules
M1…MN
The Framework: All-Reduce Component
A Reliable Effective Terascale Linear Learning System
Alekh Agarwal, Olivier Chapelle, Miroslav Dudik, John Langford
• SGD/ LBFGS/Regression
• In-memory
• Minimal network latency
• Mini-batches of n/m for
each nodes
• Head nodes holds an
average of the weight
parameters
• Can be added to existing
implementations to
optimize for online
training
The Framework: Cache Component
Bringing it all together…
RealTime Streaming Data
Real-Time Consumption
Feature Extraction
Feature Hashing
Hashing Trick
Matrix Factorization
Batch
Out-of-Core Component
RDD of All Data
(not in cache)
Background
Out-of-Core Processing
AllReduce Component
All Reduce Analysis of mini
batch data using dist. node
averaging
Can trigger Model Updates
Initiates semi-supervised
state, mini batches can be
sequential or subsampling
scheme
Background
In-Memory Processing
Spanning Tree Protocol
for AllReduce
(simulated in POC)
MiniBatches of RDD’s
distributed to nodes
Model Retraining Policy
Model/Retraining Cache
Global Cache Stores/Best Performers
Metrics from Out of Core
Local Cache Stores/Best Performers
Metrics from AllReduce
Learning Modules
M1…MN
The Framework: Cache Component
• Model Weights & Loss
• Custom Parameters
• When a the retraining policy is triggered in the
monitor
– A new best performing weight vector is selected from
the cache
– The running model weights are updated and used in
real-time predictions
The Framework: Monitor Component
RealTime Streaming Data
Monitor
Real-Time Consumption
Feature Extraction
Feature Hashing
Hashing Trick
Matrix Factorization
Batch
Out-of-Core Component
RDD of All Data
(not in cache)
Background
Out-of-Core Processing
AllReduce Component
All Reduce Analysis of mini
batch data using dist. node
averaging
Can trigger Model Updates
Initiates semi-supervised
state, mini batches can be
sequential or subsampling
scheme
Background
In-Memory Processing
Spanning Tree Protocol
for AllReduce
(simulated in POC)
MiniBatches of RDD’s
distributed to nodes
Model Retraining Policy
Model/Retraining Cache
Global Cache Stores/Best Performers
Metrics from Out of Core
Local Cache Stores/Best Performers
Metrics from AllReduce
Learning Modules
M1…MN
The Framework: Monitor Component
• Retraining in All-Reduce happens after each online pass
• Option 1: Makes real-time prediction with only local latency when
initiated (Model Retrain Policy)
– very low latency
– stable predictions
– Significant change in weights since last pass
– Loss Minimization
• Increases in expected loss above threshold
– Accuracy
• Option 2: Can opt to use All-Reduce weights after the online pass
– Low Latency
– Greater sensitivy
– Periodic queries to cache fro best weights (Model Retrain Policy)
Performance: IBM’s Data Science Experience
IBM Data Science Experience is an
environment that brings together everything that
a Data Scientist needs.
It includes the most popular Open Source tools
such as Code in Scala/Python/R/SQL, Jupyter
Notebooks, RStudio IDE and Shiny apps,
Apache Spark and IBM unique value-add
functionalities with community and social
features, integrated as a first class citizen to
make Data Scientists more successful.
Performance: IBM’s Data Science Experience
Performance: IBM’s Data Science Experience
Performance: Accuracy Gains
Semi-Supervised Learning
Edited by Olivier Chapelle, Bernhard Schölkopf and Alexander Zien
Learning from labeled and unlabeled data
Tom Mitchell, CMU
Performance: NASDAQ
• NASDAQ Data
– Daily stock market values since 1971; ~50K instances
– Features (excerpt)
• Date, Open, High, Low, Close ,Volume, Adj Close
– Fully Labeled, y = close
– k fold cross validation
– Regression
Performance (1): Linear Regression
1 1.5 2 2.5 3 3.5 4 4.5 5
Percentage of Unlabeled
789.5403
789.5404
789.5405
789.5406
789.5407
789.5408
789.5409
AverageRMSD
Average RMSD
1 2 3 4 5 6 7 8 9 10
K-Fold
788
788.5
789
789.5
790
790.5
791
791.5
RMSD
RMSE Supervised- 50% Unlabeled
Supervised
10% Unlabeled
20% Unlabeled
30% Unlabeled
50% Unlabeled
Performance (1A): Classification
1 2 3 4 5 6 7 8 9 10
0.795
0.8
0.805
0.81
0.815
0.82
0.825
0.83
Supervised
10% Unlabeled
20% Unlabeled
30% Unlabeled
50% Unlabeled
1 1.5 2 2.5 3 3.5 4 4.5 5
% Unlabeled
0.8158
0.816
0.8162
0.8164
0.8166
0.8168
0.817
0.8172
Accuracy
Ave Log Accuracy
Average Accuracy
Performance (2): Online & Semi-Supervised
• HealthStats provides key health, nutrition and population
statistics gathered from a variety of international sources
• Global Health Data since1960; ~100K instances
– Features (excerpt)
• This dataset includes 345 indicators, such as immunization rates,
malnutrition prevalence, and vitamin A supplementation rates across
263 countries around the world. Data was collected on a yearly basis
from 1960-2016.Fully Labeled, y = close
– k fold cross validation
– Classification
Performance (3): Online Semi-Supervised Learning
• IBM Employee Attrition and Performance
– Uncover the factors that lead to employee attrition
– Features (excerpt)
• Education 1 'Below College' 2 'College' 3 'Bachelor' 4 'Master' 5
'Doctor'EnvironmentSatisfaction 1 'Low' 2 'Medium' 3 'High' 4 'Very
High'JobInvolvement 1 'Low' 2 'Medium' 3 'High' 4 'Very
High'JobSatisfaction 1 'Low' 2 'Medium' 3 'High' 4 'Very
High'PerformanceRating 1 'Low' 2 'Good' 3 'Excellent' 4
'Outstanding'RelationshipSatisfaction 1 'Low' 2 'Medium' 3 'High' 4 'Very
High'WorkLifeBalance 1 'Bad' 2 'Good' 3 'Better' 4 'Best'Fully Labeled
– k fold cross validation
– Classification
Future Work
• More complex real-time data sets
• Real-Time Streaming and Semi-
Supervised Learning for Autonomous
Vehicles
• IoT and the Autonomous Vehicle in the
Clouds: Simultaneous Localization and
Mapping (SLAM) with Kafka and Spark
Streaming (Spark Summit East 2017)
• Full Framework!!!
Future Work
• Improving Batch retraining policy to incorporate
more information about the distributions of data
• Investigating model switching vs retraining
• Adding boosting mechanisms in batch
• Adding feature extraction for high dimensionality
• Expansion of Spark Streaming ML algorithms
Further Reading
A Reliable Effective Terascale Linear Learning System
Alekh Agarwal, Olivier Chapelle, Miroslav Dudik, John Langford
Semi-Supervised Learning
Edited by Olivier Chapelle, Bernhard Schölkopf and Alexander Zien
Learning from labeled and unlabeled data
Tom Mitchell, CMU
Multiple Model-Based Reinforcement Learning
Kenji Doya, Kazuyuki Samejima,Ken-ichi Katagiri and Mitsuo Kawato
Vainsencher, Daniel, Shie Mannor, and Huan Xu. "Learning multiple models via regularized weighting."
Advances in Neural Information Processing Systems. 2013.
Avrim Blum and Tom Mitchell. 1998. Combining labeled and unlabeled data with co-training. In Proceedings of the
eleventh annual conference on Computational learning theory (COLT' 98). ACM, New York, NY, USA, 92-100.
DOI=http://dx.doi.org/10.1145/279943.279962
The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Second Edition (Springer
Series in Statistics) 2nd Editionby Trevor Hastie (Author), Robert Tibshirani (Author), Jerome Friedman
Thank You.
J. WhiteBear (jwhiteb@us.ibm.com)
IBM SparkTechnology Center
505 Howard St San Francisco, CA

Weitere ähnliche Inhalte

Was ist angesagt?

Resource-Efficient Deep Learning Model Selection on Apache Spark
Resource-Efficient Deep Learning Model Selection on Apache SparkResource-Efficient Deep Learning Model Selection on Apache Spark
Resource-Efficient Deep Learning Model Selection on Apache SparkDatabricks
 
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)Spark Summit
 
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...Alex Zeltov
 
Deep Learning on Apache® Spark™ : Workflows and Best Practices
Deep Learning on Apache® Spark™ : Workflows and Best PracticesDeep Learning on Apache® Spark™ : Workflows and Best Practices
Deep Learning on Apache® Spark™ : Workflows and Best PracticesJen Aman
 
Transitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to SparkTransitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to SparkSlim Baltagi
 
Speeding Up Spark with Data Compression on Xeon+FPGA with David Ojika
Speeding Up Spark with Data Compression on Xeon+FPGA with David OjikaSpeeding Up Spark with Data Compression on Xeon+FPGA with David Ojika
Speeding Up Spark with Data Compression on Xeon+FPGA with David OjikaDatabricks
 
Spark Summit EU talk by Christos Erotocritou
Spark Summit EU talk by Christos ErotocritouSpark Summit EU talk by Christos Erotocritou
Spark Summit EU talk by Christos ErotocritouSpark Summit
 
Real time fraud detection at 1+M scale on hadoop stack
Real time fraud detection at 1+M scale on hadoop stackReal time fraud detection at 1+M scale on hadoop stack
Real time fraud detection at 1+M scale on hadoop stackDataWorks Summit/Hadoop Summit
 
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...DataWorks Summit/Hadoop Summit
 
Don't Let the Spark Burn Your House: Perspectives on Securing Spark
Don't Let the Spark Burn Your House: Perspectives on Securing SparkDon't Let the Spark Burn Your House: Perspectives on Securing Spark
Don't Let the Spark Burn Your House: Perspectives on Securing SparkDataWorks Summit
 
Apache Spark in Scientific Applciations
Apache Spark in Scientific ApplciationsApache Spark in Scientific Applciations
Apache Spark in Scientific ApplciationsDr. Mirko Kämpf
 
A machine learning and data science pipeline for real companies
A machine learning and data science pipeline for real companiesA machine learning and data science pipeline for real companies
A machine learning and data science pipeline for real companiesDataWorks Summit
 
Solving Real Problems with Apache Spark: Archiving, E-Discovery, and Supervis...
Solving Real Problems with Apache Spark: Archiving, E-Discovery, and Supervis...Solving Real Problems with Apache Spark: Archiving, E-Discovery, and Supervis...
Solving Real Problems with Apache Spark: Archiving, E-Discovery, and Supervis...Spark Summit
 
Announcing Databricks Cloud (Spark Summit 2014)
Announcing Databricks Cloud (Spark Summit 2014)Announcing Databricks Cloud (Spark Summit 2014)
Announcing Databricks Cloud (Spark Summit 2014)Databricks
 
Intro to big data analytics using microsoft machine learning server with spark
Intro to big data analytics using microsoft machine learning server with sparkIntro to big data analytics using microsoft machine learning server with spark
Intro to big data analytics using microsoft machine learning server with sparkAlex Zeltov
 
End-to-End Data Pipelines with Apache Spark
End-to-End Data Pipelines with Apache SparkEnd-to-End Data Pipelines with Apache Spark
End-to-End Data Pipelines with Apache SparkBurak Yavuz
 
Large-Scale Data Science on Hadoop (Intel Big Data Day)
Large-Scale Data Science on Hadoop (Intel Big Data Day)Large-Scale Data Science on Hadoop (Intel Big Data Day)
Large-Scale Data Science on Hadoop (Intel Big Data Day)Uri Laserson
 

Was ist angesagt? (20)

Resource-Efficient Deep Learning Model Selection on Apache Spark
Resource-Efficient Deep Learning Model Selection on Apache SparkResource-Efficient Deep Learning Model Selection on Apache Spark
Resource-Efficient Deep Learning Model Selection on Apache Spark
 
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
 
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
 
Deep Learning on Apache® Spark™ : Workflows and Best Practices
Deep Learning on Apache® Spark™ : Workflows and Best PracticesDeep Learning on Apache® Spark™ : Workflows and Best Practices
Deep Learning on Apache® Spark™ : Workflows and Best Practices
 
Running Spark in Production
Running Spark in ProductionRunning Spark in Production
Running Spark in Production
 
Transitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to SparkTransitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to Spark
 
Speeding Up Spark with Data Compression on Xeon+FPGA with David Ojika
Speeding Up Spark with Data Compression on Xeon+FPGA with David OjikaSpeeding Up Spark with Data Compression on Xeon+FPGA with David Ojika
Speeding Up Spark with Data Compression on Xeon+FPGA with David Ojika
 
Time-oriented event search. A new level of scale
Time-oriented event search. A new level of scale Time-oriented event search. A new level of scale
Time-oriented event search. A new level of scale
 
Spark Summit EU talk by Christos Erotocritou
Spark Summit EU talk by Christos ErotocritouSpark Summit EU talk by Christos Erotocritou
Spark Summit EU talk by Christos Erotocritou
 
Real time fraud detection at 1+M scale on hadoop stack
Real time fraud detection at 1+M scale on hadoop stackReal time fraud detection at 1+M scale on hadoop stack
Real time fraud detection at 1+M scale on hadoop stack
 
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
 
Spark Uber Development Kit
Spark Uber Development KitSpark Uber Development Kit
Spark Uber Development Kit
 
Don't Let the Spark Burn Your House: Perspectives on Securing Spark
Don't Let the Spark Burn Your House: Perspectives on Securing SparkDon't Let the Spark Burn Your House: Perspectives on Securing Spark
Don't Let the Spark Burn Your House: Perspectives on Securing Spark
 
Apache Spark in Scientific Applciations
Apache Spark in Scientific ApplciationsApache Spark in Scientific Applciations
Apache Spark in Scientific Applciations
 
A machine learning and data science pipeline for real companies
A machine learning and data science pipeline for real companiesA machine learning and data science pipeline for real companies
A machine learning and data science pipeline for real companies
 
Solving Real Problems with Apache Spark: Archiving, E-Discovery, and Supervis...
Solving Real Problems with Apache Spark: Archiving, E-Discovery, and Supervis...Solving Real Problems with Apache Spark: Archiving, E-Discovery, and Supervis...
Solving Real Problems with Apache Spark: Archiving, E-Discovery, and Supervis...
 
Announcing Databricks Cloud (Spark Summit 2014)
Announcing Databricks Cloud (Spark Summit 2014)Announcing Databricks Cloud (Spark Summit 2014)
Announcing Databricks Cloud (Spark Summit 2014)
 
Intro to big data analytics using microsoft machine learning server with spark
Intro to big data analytics using microsoft machine learning server with sparkIntro to big data analytics using microsoft machine learning server with spark
Intro to big data analytics using microsoft machine learning server with spark
 
End-to-End Data Pipelines with Apache Spark
End-to-End Data Pipelines with Apache SparkEnd-to-End Data Pipelines with Apache Spark
End-to-End Data Pipelines with Apache Spark
 
Large-Scale Data Science on Hadoop (Intel Big Data Day)
Large-Scale Data Science on Hadoop (Intel Big Data Day)Large-Scale Data Science on Hadoop (Intel Big Data Day)
Large-Scale Data Science on Hadoop (Intel Big Data Day)
 

Ähnlich wie An Online Spark Pipeline: Semi-Supervised Learning and Automatic Retraining with Spark Streaming with J White Bear

Machine Learning for Everyone
Machine Learning for EveryoneMachine Learning for Everyone
Machine Learning for EveryoneAly Abdelkareem
 
Web Information Extraction for the Database Research Domain
Web Information Extraction for the Database Research DomainWeb Information Extraction for the Database Research Domain
Web Information Extraction for the Database Research DomainMichael Genkin
 
Machine Learning for Data Extraction
Machine Learning for Data ExtractionMachine Learning for Data Extraction
Machine Learning for Data ExtractionDasha Herrmannova
 
Big learning 1.2
Big learning   1.2Big learning   1.2
Big learning 1.2Mohit Garg
 
kantorNSF-NIJ-ISI-03-06-04.ppt
kantorNSF-NIJ-ISI-03-06-04.pptkantorNSF-NIJ-ISI-03-06-04.ppt
kantorNSF-NIJ-ISI-03-06-04.pptbutest
 
Using Computer as a Research Assistant in Qualitative Research
Using Computer as a Research Assistant in Qualitative ResearchUsing Computer as a Research Assistant in Qualitative Research
Using Computer as a Research Assistant in Qualitative ResearchJoshuaApolonio1
 
background.pptx
background.pptxbackground.pptx
background.pptxKabileshCm
 
Barga Data Science lecture 5
Barga Data Science lecture 5Barga Data Science lecture 5
Barga Data Science lecture 5Roger Barga
 
Tuning the Untunable - Insights on Deep Learning Optimization
Tuning the Untunable - Insights on Deep Learning OptimizationTuning the Untunable - Insights on Deep Learning Optimization
Tuning the Untunable - Insights on Deep Learning OptimizationSigOpt
 
Building High Available and Scalable Machine Learning Applications
Building High Available and Scalable Machine Learning ApplicationsBuilding High Available and Scalable Machine Learning Applications
Building High Available and Scalable Machine Learning ApplicationsYalçın Yenigün
 
Data Management - Full Stack Deep Learning
Data Management - Full Stack Deep LearningData Management - Full Stack Deep Learning
Data Management - Full Stack Deep LearningSergey Karayev
 
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...Ilkay Altintas, Ph.D.
 
H2O with Erin LeDell at Portland R User Group
H2O with Erin LeDell at Portland R User GroupH2O with Erin LeDell at Portland R User Group
H2O with Erin LeDell at Portland R User GroupSri Ambati
 
Machine Learning for automated diagnosis of distributed ...AE
Machine Learning for automated diagnosis of distributed ...AEMachine Learning for automated diagnosis of distributed ...AE
Machine Learning for automated diagnosis of distributed ...AEbutest
 
Making Netflix Machine Learning Algorithms Reliable
Making Netflix Machine Learning Algorithms ReliableMaking Netflix Machine Learning Algorithms Reliable
Making Netflix Machine Learning Algorithms ReliableJustin Basilico
 
Lecture_1_Intro.pdf
Lecture_1_Intro.pdfLecture_1_Intro.pdf
Lecture_1_Intro.pdfpaijitk
 
BsidesLVPresso2016_JZeditsv6
BsidesLVPresso2016_JZeditsv6BsidesLVPresso2016_JZeditsv6
BsidesLVPresso2016_JZeditsv6Rod Soto
 

Ähnlich wie An Online Spark Pipeline: Semi-Supervised Learning and Automatic Retraining with Spark Streaming with J White Bear (20)

Machine Learning for Everyone
Machine Learning for EveryoneMachine Learning for Everyone
Machine Learning for Everyone
 
Web Information Extraction for the Database Research Domain
Web Information Extraction for the Database Research DomainWeb Information Extraction for the Database Research Domain
Web Information Extraction for the Database Research Domain
 
Machine Learning for Data Extraction
Machine Learning for Data ExtractionMachine Learning for Data Extraction
Machine Learning for Data Extraction
 
Big learning 1.2
Big learning   1.2Big learning   1.2
Big learning 1.2
 
kantorNSF-NIJ-ISI-03-06-04.ppt
kantorNSF-NIJ-ISI-03-06-04.pptkantorNSF-NIJ-ISI-03-06-04.ppt
kantorNSF-NIJ-ISI-03-06-04.ppt
 
qualitative.ppt
qualitative.pptqualitative.ppt
qualitative.ppt
 
Using Computer as a Research Assistant in Qualitative Research
Using Computer as a Research Assistant in Qualitative ResearchUsing Computer as a Research Assistant in Qualitative Research
Using Computer as a Research Assistant in Qualitative Research
 
background.pptx
background.pptxbackground.pptx
background.pptx
 
Barga Data Science lecture 5
Barga Data Science lecture 5Barga Data Science lecture 5
Barga Data Science lecture 5
 
AI and Deep Learning
AI and Deep Learning AI and Deep Learning
AI and Deep Learning
 
Tuning the Untunable - Insights on Deep Learning Optimization
Tuning the Untunable - Insights on Deep Learning OptimizationTuning the Untunable - Insights on Deep Learning Optimization
Tuning the Untunable - Insights on Deep Learning Optimization
 
Building High Available and Scalable Machine Learning Applications
Building High Available and Scalable Machine Learning ApplicationsBuilding High Available and Scalable Machine Learning Applications
Building High Available and Scalable Machine Learning Applications
 
Data Management - Full Stack Deep Learning
Data Management - Full Stack Deep LearningData Management - Full Stack Deep Learning
Data Management - Full Stack Deep Learning
 
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
 
H2O with Erin LeDell at Portland R User Group
H2O with Erin LeDell at Portland R User GroupH2O with Erin LeDell at Portland R User Group
H2O with Erin LeDell at Portland R User Group
 
Machine Learning for automated diagnosis of distributed ...AE
Machine Learning for automated diagnosis of distributed ...AEMachine Learning for automated diagnosis of distributed ...AE
Machine Learning for automated diagnosis of distributed ...AE
 
Machine_Learning_Co__
Machine_Learning_Co__Machine_Learning_Co__
Machine_Learning_Co__
 
Making Netflix Machine Learning Algorithms Reliable
Making Netflix Machine Learning Algorithms ReliableMaking Netflix Machine Learning Algorithms Reliable
Making Netflix Machine Learning Algorithms Reliable
 
Lecture_1_Intro.pdf
Lecture_1_Intro.pdfLecture_1_Intro.pdf
Lecture_1_Intro.pdf
 
BsidesLVPresso2016_JZeditsv6
BsidesLVPresso2016_JZeditsv6BsidesLVPresso2016_JZeditsv6
BsidesLVPresso2016_JZeditsv6
 

Mehr von Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringDatabricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsDatabricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkDatabricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 
Machine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionMachine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionDatabricks
 

Mehr von Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 
Machine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionMachine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack Detection
 

Kürzlich hochgeladen

Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsJoseMangaJr1
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramMoniSankarHazra
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...amitlee9823
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...amitlee9823
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...SUHANI PANDEY
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Researchmichael115558
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteedamy56318795
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...amitlee9823
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightDelhi Call girls
 

Kürzlich hochgeladen (20)

Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 

An Online Spark Pipeline: Semi-Supervised Learning and Automatic Retraining with Spark Streaming with J White Bear

  • 1. Presented: J. White Bear An Online Spark Pipeline: Semi-Supervised Learning and Online Retraining with Spark Streaming.
  • 2. 2 IBM Spark Technology Center • Founded in 2015. • Location: – Physical: 505 Howard St., San Francisco CA – Web: http://spark.tc Twitter: @apachespark_tc • Mission: – Contribute intellectual and technical capital to theApache Spark community. – Make the core technology enterprise- and cloud-ready. – Build data science skills to drive intelligence into business applications — http://bigdatauniversity.com • Key statistics: – About 50 developers, co-located with 25 IBM designers. – Major contributions toApache Spark http://jiras.spark.tc – Apache SystemML is now a top level Apache project ! – Founding member of UC Berkeley AMPLab and RISE Lab – Member of R Consortium and Scala Center Spark Technology Center
  • 3. About Me Education • University ofMichigan-Computer Science • Databases,Machine Learning/Computational Biology,Cryptography • University ofCalifornia SanFrancisco,University of California Berkeley- • Multi-objective Optimization/Computational Biology/Bioinformatics • McGill University • Machine Learning/ Multi-objective Optimizationfor PathPlanning/ Cryptography Industry • IBM • Amazon • TeraGrid • Pfizer • Researchat UC Berkeley,Purdue University,and every university Iever attended.J Fun Facts (?) I love researchfor itsown sake. I like robots,helping to cure diseases,advocating for social change and reform,and breaking encryptions.Also,most activitiesinvolving the OceanandI usually hate taking pictures. J
  • 4. Why do we need online semisupervised learning? Malware/Fraud Detection Stock Prediction Real-time Diagnosis NLP/Speech Recognition Real-time Visual Processing
  • 5. Why online learning? • Incremental/Sequential learning for real-time use cases • Predicts/learns from the newest data • Optimized for low latency in real-time cases • Often used in conjunction with streaming data
  • 6. Semi-supervised learning • Smaller training sets, less labeled data • Classifying or predicting unlabeled data • The underlying distribution P(x,y) may or may not be known/ stable (PAC learning) • How/Why do we bring these ideas together?
  • 7. Key Challenges • Acquiring a sufficient ratio training data – Batch learning is much slower and more difficult to update a model • Maintaining accuracy – Concept drift – Catastrophic events • Meeting latency requirements particularly in real- time scenarios, like autonomous vehicles • What about unlabeled data?....
  • 8. The Framework: Hybrid Online and Semi-Supervised learning RealTime Streaming Data Real-Time Consumption Feature Extraction Feature Hashing Hashing Trick Matrix Factorization Out-of-Core Component RDD of All Data (not in cache) Background Out-of-Core Processing AllReduce Component All Reduce Analysis of mini batch data using dist. node averaging Can trigger Model Updates Initiates semi-supervised state, mini batches can be sequential or subsampling scheme Background In-Memory Processing Spanning Tree Protocol for AllReduce (simulated in POC) MiniBatches of RDD’s distributed to nodes Model Retraining Policy Model/Retraining Cache Global Cache Stores/Best Performers Metrics from Out of Core Local Cache Stores/Best Performers Metrics from AllReduce A Reliable Effective Terascale Linear Learning System Alekh Agarwal, Olivier Chapelle, Miroslav Dudik, John Langford
  • 9. Why this framework? • Abundance of unlabeled data requires semi-supervised learning – Real-time learning in the IoT setting • Semi-Supervised learning can improve predictions – Empirically studied for online use case with Big Data • We need to address the challenges of incremental sequential learning without losing valuable historical data • We need to optimize for low latency without overly sacrificing accuracy • You may have an online case but you are not guaranteed labeled data and definitely not in a timely fashion • Hybrid frameworks address these challenges and allow for future data mining to understand and correct for how your data changes over time
  • 10. Online and Semi-supervised Learning: Online Constraint min w2Rd nX i=1 l(w> xi; yi) + R(w) (1) Supervised learning over a batch 1 n nX i=1 l(w> xi; yi) + n R(w) (2) e learning over a mini-batch from a streaming data set i=1 Supervised learning over a batch 1 n nX i=1 l(w> xi;yi) + n R(w) (2 Online learning over a mini-batch from a streaming data set m n nX i=Sk l(w> xi;yi) + n R(w) (3 Distributed learning over a mini-batch from a streaming data set 1 n nX i=1 l(w> xi; yi) + n R(w) (2) Online learning over a mini-batch from a streaming data set m n nX i=Sk l(w> xi; yi) + n R(w) (3) Distributed learning over a mini-batch from a streaming data set 1
  • 11. Online and Semi-supervised Learning: Semi-supervised Constraint m n nX i=Sk l(w> xi; yi) + n R(w) (3) Distributed learning over a mini-batch from a streaming data set Let xp be an example where xp = (xp1, xp2, xp2, ..., xpD, !) where xp belongs to class ! and a D dimensional space. xpi is the value of the ith feature of the pth sample. Assume a labeled set. L with n instances of x, with ! known. Assume a labeled set. U with m instances of x, with ! unknown. Assume that the number of labeled instances L, is less than the number of unlabeled instances, U. L [ U represents the training set, T . We want to infer a hypothesis using T and use this hypothesis to predict labels we have not yet seen. The semi-supervised learning case. 1
  • 12. Online and Semi-supervised Learning: Semi-supervised Constraint Labeled Data Unlabeled Data Initial Training Set Online/Streaming Labeled/ Unlabeled Data Model Retraining Label ed Data Unlabel ed Data
  • 13. The Framework: Batch Component RealTime Streaming Data Real-Time Consumption Feature Extraction Feature Hashing Hashing Trick Matrix Factorization Batch Out-of-Core Component RDD of All Data (not in cache) Background Out-of-Core Processing AllReduce Component All Reduce Analysis of mini batch data using dist. node averaging Can trigger Model Updates Initiates semi-supervised state, mini batches can be sequential or subsampling scheme Background In-Memory Processing Spanning Tree Protocol for AllReduce (simulated in POC) MiniBatches of RDD’s distributed to nodes Model Retraining Policy Model/Retraining Cache Global Cache Stores/Best Performers Metrics from Out of Core Local Cache Stores/Best Performers Metrics from AllReduce Learning Modules M1…MN Batch Out-of-Core Component RDD of All Data (not in cache) Background Out-of-Core Processing Learning Modules M1…MN Customization & Policy Options Retraining Policy Amending Policy Committee Policy
  • 14. The Framework: Batch Component • Features – Out-of-Core, can run as a background process – Ensemble Learning: Multiple model/ Co-training learning algorithms – Amending Policy – Retrain Policy – Active Learning Policy – Custom Parameters
  • 15. The Framework: Batch Component
  • 16. The Framework: Batch Component Semi-Supervised Learning Edited by Olivier Chapelle, Bernhard Schölkopf and Alexander Zien Learning from labeled and unlabeled data Tom Mitchell, CMU
  • 17. The Framework: Batch Component • Multiple Model Learning/ Co-training – Expandable learning modules: multiple learning algorithms – Co-training for high-dimensionality, improved confidence – Accounts for different biases across learning techniques to strengthen the hypothesis – Improves confidence estimates in practice – Yields better results when there is a significant diversity among the models. – Reduces variance and helps to avoid overfitting
  • 18. The Framework: Batch Component L [ U represents the training set, T . We want to infer a hypothesis using T and use this hypothesis to predict labels we have not yet seen. The semi-supervised learning case. Given a model, m 2 M ym(x) = h(x) + ✏m(x) (4) We can get the average error over all models, Eaverage Eaverage = 1 M MX m=1 Ex[✏m(x)2 ] (5) Regression: Modal averaging sum of squares/ Prediction confidence 1Multiple Model-Based Reinforcement Learning Kenji Doya, KazuyukiSamejima,Ken-ichiKatagiriand Mitsuo Kawato Vainsencher, Daniel, Shie Mannor, and Huan Xu. "Learning multiple models via regularized weighting." Advances in NeuralInformation Processing Systems.2013.
  • 19. The Framework: Batch Component • Amending Policy – Instances are added sequentially and maintain order – Only the most ‘confident’ predictions are added – If validation is received, inaccurate predictions are removed and retrained (w/n a threshold) – Low confidence, unlabeled data are stored for future validation or rescoring Avrim Blum and Tom Mitchell. 1998. Combining labeled and unlabeled data with co-training. In Proceedings of the eleventh annual conference on Computational learning theory (COLT' 98). ACM, New York, NY, USA, 92-100. DOI=http://dx.doi.org/10.1145/279943.279962
  • 20. The Framework: Batch Component Training Batch Labeled Data Unlabeled Low Confidence Data Amending Policy Avrim Blum and Tom Mitchell. 1998. Combining labeled and unlabeled data with co-training. In Proceedings of the eleventh annual conference on Computational learning theory (COLT' 98). ACM, New York, NY, USA, 92-100. DOI=http://dx.doi.org/10.1145/279943.279962
  • 21. The Framework: Batch Component • Retrain Policy – Loss Minimization • Increases in expected loss above threshold – Accuracy • Ratio/Number of incorrect predictions – Schedule • Regular retraining schedule – Optimizations • Limited to a window over the given data set – Train over a specific time period to optimize for catastrophic events and concept shifts • Subsampling – Very large datasets can set a subsampling methodology to improve batch processing speed
  • 22. The Framework: Batch Component • What happens to low confidence predictions? • Active Learning Policy – Separate dataset holds these instances – Human labeling and/or an oracle is queried for accurate labels • Google has over 10,000 contractors performing this function – https://arstechnica.com/features/2017/04/the-secret-lives-of-google-raters/ • Facebook is hiring 3000 new content monitors for a job AI cannot do – http://www.popsci.com/Facebook-hiring-3000-content-monitors • Netflix/Amazon are bringing in humans to improve the CTR recommendations – http://blog.echen.me/2014/10/07/moving-beyond-ctr-better- recommendations-through-human-evaluation/
  • 23. The Framework: Batch Component • Custom Parameters: – Window Size • Specify a time or range – Amending times • Schedule when amending policy runs • Default is automated at accuracy rates – Loss threshold • Specify maximum loss for a given loss function – Accuracy threshold • Specify the precision/recall rates before retraining is called on batch – Subsampling • Run a random subsample to train
  • 24. The Framework: All-Reduce Component A Reliable Effective Terascale Linear Learning System Alekh Agarwal, Olivier Chapelle, Miroslav Dudik, John Langford RealTime Streaming Data Real-Time Consumption Feature Extraction Feature Hashing Hashing Trick Matrix Factorization Batch Out-of-Core Component RDD of All Data (not in cache) Background Out-of-Core Processing AllReduce Component All Reduce Analysis of mini batch data using dist. node averaging Can trigger Model Updates Initiates semi-supervised state, mini batches can be sequential or subsampling scheme Background In-Memory Processing Spanning Tree Protocol for AllReduce (simulated in POC) MiniBatches of RDD’s distributed to nodes Model Retraining Policy Model/Retraining Cache Global Cache Stores/Best Performers Metrics from Out of Core Local Cache Stores/Best Performers Metrics from AllReduce Learning Modules M1…MN
  • 25. The Framework: All-Reduce Component A Reliable Effective Terascale Linear Learning System Alekh Agarwal, Olivier Chapelle, Miroslav Dudik, John Langford • SGD/ LBFGS/Regression • In-memory • Minimal network latency • Mini-batches of n/m for each nodes • Head nodes holds an average of the weight parameters • Can be added to existing implementations to optimize for online training
  • 26. The Framework: Cache Component Bringing it all together… RealTime Streaming Data Real-Time Consumption Feature Extraction Feature Hashing Hashing Trick Matrix Factorization Batch Out-of-Core Component RDD of All Data (not in cache) Background Out-of-Core Processing AllReduce Component All Reduce Analysis of mini batch data using dist. node averaging Can trigger Model Updates Initiates semi-supervised state, mini batches can be sequential or subsampling scheme Background In-Memory Processing Spanning Tree Protocol for AllReduce (simulated in POC) MiniBatches of RDD’s distributed to nodes Model Retraining Policy Model/Retraining Cache Global Cache Stores/Best Performers Metrics from Out of Core Local Cache Stores/Best Performers Metrics from AllReduce Learning Modules M1…MN
  • 27. The Framework: Cache Component • Model Weights & Loss • Custom Parameters • When a the retraining policy is triggered in the monitor – A new best performing weight vector is selected from the cache – The running model weights are updated and used in real-time predictions
  • 28. The Framework: Monitor Component RealTime Streaming Data Monitor Real-Time Consumption Feature Extraction Feature Hashing Hashing Trick Matrix Factorization Batch Out-of-Core Component RDD of All Data (not in cache) Background Out-of-Core Processing AllReduce Component All Reduce Analysis of mini batch data using dist. node averaging Can trigger Model Updates Initiates semi-supervised state, mini batches can be sequential or subsampling scheme Background In-Memory Processing Spanning Tree Protocol for AllReduce (simulated in POC) MiniBatches of RDD’s distributed to nodes Model Retraining Policy Model/Retraining Cache Global Cache Stores/Best Performers Metrics from Out of Core Local Cache Stores/Best Performers Metrics from AllReduce Learning Modules M1…MN
  • 29. The Framework: Monitor Component • Retraining in All-Reduce happens after each online pass • Option 1: Makes real-time prediction with only local latency when initiated (Model Retrain Policy) – very low latency – stable predictions – Significant change in weights since last pass – Loss Minimization • Increases in expected loss above threshold – Accuracy • Option 2: Can opt to use All-Reduce weights after the online pass – Low Latency – Greater sensitivy – Periodic queries to cache fro best weights (Model Retrain Policy)
  • 30. Performance: IBM’s Data Science Experience IBM Data Science Experience is an environment that brings together everything that a Data Scientist needs. It includes the most popular Open Source tools such as Code in Scala/Python/R/SQL, Jupyter Notebooks, RStudio IDE and Shiny apps, Apache Spark and IBM unique value-add functionalities with community and social features, integrated as a first class citizen to make Data Scientists more successful.
  • 31. Performance: IBM’s Data Science Experience
  • 32. Performance: IBM’s Data Science Experience
  • 33. Performance: Accuracy Gains Semi-Supervised Learning Edited by Olivier Chapelle, Bernhard Schölkopf and Alexander Zien Learning from labeled and unlabeled data Tom Mitchell, CMU
  • 34. Performance: NASDAQ • NASDAQ Data – Daily stock market values since 1971; ~50K instances – Features (excerpt) • Date, Open, High, Low, Close ,Volume, Adj Close – Fully Labeled, y = close – k fold cross validation – Regression
  • 35. Performance (1): Linear Regression 1 1.5 2 2.5 3 3.5 4 4.5 5 Percentage of Unlabeled 789.5403 789.5404 789.5405 789.5406 789.5407 789.5408 789.5409 AverageRMSD Average RMSD 1 2 3 4 5 6 7 8 9 10 K-Fold 788 788.5 789 789.5 790 790.5 791 791.5 RMSD RMSE Supervised- 50% Unlabeled Supervised 10% Unlabeled 20% Unlabeled 30% Unlabeled 50% Unlabeled
  • 36. Performance (1A): Classification 1 2 3 4 5 6 7 8 9 10 0.795 0.8 0.805 0.81 0.815 0.82 0.825 0.83 Supervised 10% Unlabeled 20% Unlabeled 30% Unlabeled 50% Unlabeled 1 1.5 2 2.5 3 3.5 4 4.5 5 % Unlabeled 0.8158 0.816 0.8162 0.8164 0.8166 0.8168 0.817 0.8172 Accuracy Ave Log Accuracy Average Accuracy
  • 37. Performance (2): Online & Semi-Supervised • HealthStats provides key health, nutrition and population statistics gathered from a variety of international sources • Global Health Data since1960; ~100K instances – Features (excerpt) • This dataset includes 345 indicators, such as immunization rates, malnutrition prevalence, and vitamin A supplementation rates across 263 countries around the world. Data was collected on a yearly basis from 1960-2016.Fully Labeled, y = close – k fold cross validation – Classification
  • 38. Performance (3): Online Semi-Supervised Learning • IBM Employee Attrition and Performance – Uncover the factors that lead to employee attrition – Features (excerpt) • Education 1 'Below College' 2 'College' 3 'Bachelor' 4 'Master' 5 'Doctor'EnvironmentSatisfaction 1 'Low' 2 'Medium' 3 'High' 4 'Very High'JobInvolvement 1 'Low' 2 'Medium' 3 'High' 4 'Very High'JobSatisfaction 1 'Low' 2 'Medium' 3 'High' 4 'Very High'PerformanceRating 1 'Low' 2 'Good' 3 'Excellent' 4 'Outstanding'RelationshipSatisfaction 1 'Low' 2 'Medium' 3 'High' 4 'Very High'WorkLifeBalance 1 'Bad' 2 'Good' 3 'Better' 4 'Best'Fully Labeled – k fold cross validation – Classification
  • 39. Future Work • More complex real-time data sets • Real-Time Streaming and Semi- Supervised Learning for Autonomous Vehicles • IoT and the Autonomous Vehicle in the Clouds: Simultaneous Localization and Mapping (SLAM) with Kafka and Spark Streaming (Spark Summit East 2017) • Full Framework!!!
  • 40. Future Work • Improving Batch retraining policy to incorporate more information about the distributions of data • Investigating model switching vs retraining • Adding boosting mechanisms in batch • Adding feature extraction for high dimensionality • Expansion of Spark Streaming ML algorithms
  • 41. Further Reading A Reliable Effective Terascale Linear Learning System Alekh Agarwal, Olivier Chapelle, Miroslav Dudik, John Langford Semi-Supervised Learning Edited by Olivier Chapelle, Bernhard Schölkopf and Alexander Zien Learning from labeled and unlabeled data Tom Mitchell, CMU Multiple Model-Based Reinforcement Learning Kenji Doya, Kazuyuki Samejima,Ken-ichi Katagiri and Mitsuo Kawato Vainsencher, Daniel, Shie Mannor, and Huan Xu. "Learning multiple models via regularized weighting." Advances in Neural Information Processing Systems. 2013. Avrim Blum and Tom Mitchell. 1998. Combining labeled and unlabeled data with co-training. In Proceedings of the eleventh annual conference on Computational learning theory (COLT' 98). ACM, New York, NY, USA, 92-100. DOI=http://dx.doi.org/10.1145/279943.279962 The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Second Edition (Springer Series in Statistics) 2nd Editionby Trevor Hastie (Author), Robert Tibshirani (Author), Jerome Friedman
  • 42. Thank You. J. WhiteBear (jwhiteb@us.ibm.com) IBM SparkTechnology Center 505 Howard St San Francisco, CA