SlideShare ist ein Scribd-Unternehmen logo
1 von 45
Webinar: Machine Learning with Spark
Everything you want to know about Machine Learning
but could not find the place and time to ask
Highlights
 Detecting the low hanging fruit for machine learning
 Balancing business and science on your team
 Choosing the best Machine Learning tools, be it small or Big
Data, R, Python or Spark. (And these are not mutually
exclusive).
Copyright © 2016 Elephant Scale. All rights reserved. 2
What does a data scientist need to know
 Familiarity with either Java / Scala / Python language
– Need to be comfortable programming - there are many labs
– Our platform is Spark, basic familiarity is expected
– Our labs are in Scala, basics of Scala will be helpful
 Basic understanding of Linux development environment
– Command line navigation
– Editing files (e.g. using VI or nano)
 This is a Machine Learning with Spark class
– But, no previous of Machine Learning knowledge is assumed
– Class will be paced based on the pace of majority of the
students.
Copyright © 2016 Elephant Scale. All rights reserved. 3
Lots of Labs : Learn By Doing
Copyright © 2016 Elephant Scale. All rights reserved. 4
Where is
the ANY
key?
After The Class…
Copyright © 2016 Elephant Scale. All rights reserved. 5
Machine
Learning
Recommended Books
 “Advanced Analytics With Spark” by Sandy Ryza, et al.
 ”Data Algorithms” by Mahmoud Parsian
 “Computational Complexity - A Modern Approach” by Sanjeev
Arora and Boaz Barak
6Copyright © 2016 Elephant Scale. All rights reserved.
Why machine learning?
 Build a model to detect credit card fraud
– thousands of features
– billions of transactions
 Recommend
– millions of products
– to millions of users
 Estimate financial risk
– simulations of portfolios
– With millions of instruments
 Genome data manipulation
– thousands of human genomes
– detect genetic associations with disease
Copyright © 2016 Elephant Scale. All rights reserved.
 Like Hadoop MapReduce, Spark has linear scalability and
fault tolerance for large data sets
 However, it adds the following extensions
– DAG of operations, instead of Map-then-Reduce
– Rich transformations to express solutions in the natural way
– RDD – in-memory computation
 Addresses the major bottleneck:
– Not CPU
– Not disk
– Not network
– But developer productivity
Why Spark?
The story of Spark
 It reduces performance overhead
– Be certain the performance adequate
– Scala gives you access to the latest and greatest
– Python and R bindings may come much later
 Scala helps you understand the Spark approach better
– Spark is written in Scala
– Think in Scala, think in Spark
 Just Scala, no other languages needed
– Such as R with SQL
Copyright © 2016 Elephant Scale. All rights reserved.
Why Scala?
 Python
– Popular, well-known
– Many packages
– Graphing
 R
– Very popular, well-known
– Very many packages
– Graphing
Why NOT Scala?
About Machine Learning
 What is Machine Learning?
 It is an algorithm that “learns” from data
– Any algorithm which improves its performance by access to data.
 Machine Learning borrows from applied statistics
 Also considered a branch of AI (Artificial Intelligence)
12
 Sixties
– Commercial computers & mainframes
– Computers play chess
 Eighties
– Computational complexity theory
– Artificial intelligence (AI) gets a bad rap
 21st century
– Big Data changes it all
A glimpse of history
 Computational complexity is simple:
 P – all problems that can be solved fast
– (in polynomial time, like n^p, but not exponential)
– Example: system of linear equations
 NP – all problems that can be verified fast
– That is, just check if the solution is correct
 But folks, it does not matter!
P = NP?
 “Big O” notation
 Example of polynomial time O(n^^3)
 Example of exponential time O(2^^n)
– How much is that?
– Compare to the number of particles in the universe ~ 10^^80
– To reach that, our n needs to be log(10^^80)
= 80 log (10) ~ 80 * 3 = 240
 There are also in-between, such as n^^(log log (n))
– But that is still bad enough
O(n) notation
 Old reasons
– It is too theoretical, talking only about worst case scenario
– There may be new computers, such as quantum computers
 New reason
– Big Data
– Turing machine is inadequate
• Because we hit the size limitations of one computer
• And go into clusters
• And we have other problems than expected
Copyrig
ht ©
2016
Elephant
Scale.
All
rights
reserved
.
Why P and NP do not matter
 Old thinking:
– If you can solve any problem (P = NP), you can be creative
 New thinking:
– You don’t have to solve problems in order to be creative
– Instead, you can pick up the answer from the internet 
– Examples:
• Google translate
• IBM Dr. Watson (Jeopardy winner)
• Lesson: re-use world’s data
 New thinking:
– Rely on the abundance of data
– Find an approximate solution that is good enough
– “Bad algorithms trained on lots of data can outperform good ones
trained on very little” - Deeplearningfor4
How Big Data changed it all
 Turing machine might be too theoretical
 But developers often tend to “just code”
“Думать не за свое дело браться”
(Жаргон лабухов)
“To think is wrong business to undertake”
Russian slang
Copyright © 2016 Elephant Scale. All rights reserved.
The other extreme - no thinking at all
 Our approach to Machine Learning is
 The Golden Mean approach
 Avoid over-theorizing
 Avoid “just code”
– Know what to expect of the solution
– When to apply
– The limitations
– The benefits
Copyright © 2016 Elephant Scale. All rights reserved.
The golden mean
Sages advocate the golden means
Types of Machine Learning
 Supervised Machine Learning:
– A model is “trained” with human labeled training data.
– Model then tested on other training data to see performance
– Model can then be applied to unknown data.
– Classification & regression usually supervised.
 Unsupervised Machine Learning
– Model tries to find natural patterns in the data.
– No human input except parameters of the model.
– Example: Clustering
 Semi-Supervised Learning
– Model is trained with a training set which contains mix of trained
and untrained data
20
Supervised Machine Learning
 Input Data is split into “training” and “test” data, both labeled.
 A Model is trained using training data
 Prediction is made using model.predict()
 Model can be tested using comparing the test dataset
– Mean Squared Error: mean(predicted – actual)
21
MLLib Algorithm overview
22
Model Validation
 Models need to be ‘verified’ / ‘validated’
 Split the data set into
– Training set : build / train model
– Test set : validate the model
 Initially 70% training, 30% validation
 Tweak the dials to decrease training and increase validation
 Training set should represent data well-enough
Training Testing
model
23
Creating Feature Vectors: Feature Extraction
 Machine Learning only works with vectors. Feature Vectors
are an n-dimensional point in space.
– Select variables from data
– Turn data into numbers (doubles).
– “normalize” (scale down) high magnitude data.
24
Vectors: Dense versus Sparse
 Dense Vectors
– Usually have a nonzero value for each variable
– The “telecom churn” dataset we use in the labs is a dense dataset.
– Use Vectors.dense
 Sparse Vectors
– Most values are zero (or nonexistent)
– Text Data yields sparse vectors
– One-Hot, factor variables lead to sparse vectors
– Use Vectors.sparse
25
Creating Vectors From Text
 How to create vectors from text?
– TF/IDF: Term Frequency Inverse Document Frequency
• This essentially means the frequency of a term divided by its
frequency in the larger group of documents (the “corpus”)
• Each word in the corpus is then a “dimension” – you would have
thousands of dimensions.
– Word2Vec
• Another vectorization algorithm
• Uses neural network
• Borders on deep learning
26
Visualizing Text using WordCloud
State of The Union Speech 2014
27
 What is deep learning?
– “A neural network with more than 1 hidden layer”
– Deeplearning4j
 But what is a neural network?
Copyright © 2016 Elephant Scale. All rights reserved.
Deep learning
 Set of algorithms
 Modeled loosely after the human brain
 Designed to recognize patterns
 Input comes from sensory data
– machine perception
– labeling
– clustering raw input
 Recognized patterns
– Numerical
– Contained in vectors
– Translated from real-world data
• Images
• Sound
• Text
• Time series
Copyright © 2016 Elephant Scale. All rights reserved.
Neural networks
 Do I have the data?
 Which outputs do I care about?
– Spam – not spam
– Fraud – not fraud
 Do I have labeled data from which to learn? (Supervised
learning)
 Nah, I just need to group things (Unsupervised learning)
– Normal – anomaly
– Group documents
Copyright © 2016 Elephant Scale. All rights reserved.
Basic steps in a neural network
Copyright © 2016 Elephant Scale. All rights reserved.
Neural network node
Copyright © 2016 Elephant Scale. All rights reserved.
Neural network composition
Copyright © 2016 Elephant Scale. All rights reserved.
Deep neural network
 Google
– ParagraphVectors (implemented as doc2vec)
– Represents the meaning of documents
– Based on word2vec and word context
 Facebook
Copyright © 2016 Elephant Scale. All rights reserved.
Deep learning applications
ML in Spark
Spark Core
Spark
SQL
Spark
Streaming
ML lib
Standalone YARN MESOS
GraphX
35
Linear algorithms
Linear algorithms
SVM
Logistic regression
Linear regression
Practical use case for SVM
37(c) ElephantScale.com 2016. All rights reserved
History of logistic regression
 Invented by (Sir) David Cox, UK
 Who wrote 364 books and papers
 Best known for
– Proportional hazards model
– Used in analysis of survival data
– Medical research (cancer)
38(c) ElephantScale.com 2016. All rights reserved
Classification algorithms
Naïve Bayes
Decision Trees
K-Means
Where Naïve Bayes fits in
 There are many classification algorithms in the world
 Naïve Bayes Classifier (NBC) is one of the simplest but most
effective
 K-means and K-nearest neighbors are for numeric data
 But for
– Names
– Symbols
– Emails
– Texts
 NBC may be the best for that
 Bayes can do multiclass (and not only binary) classification
40(c) ElephantScale.com 2016. All rights reserved
A is a good candidate for Naïve Bayes
(Credit: Sebastian Raschka)
History of Bayes
 Discovered by the Reverend Thomas Bayes (1701–1761)
 Edited and read at the Royal Society by Richard Price (1763)
 Independently reproduced and extended by Laplace (1774)
 Naïve Bayes classifiers studied in 1950’s
41(c) ElephantScale.com 2016. All rights reserved
Clustering use case
 Anomaly detection
– Find fraud
– Detect network intrusion attack
– Discover problems on servers
– Or on any machinery with sensors
 Clustering does not necessarily detects fraud
– But it points to unusual data
– And the need for further investigation
42(c) ElephantScale.com 2016. All rights reserved
Network intrusion
 Known unknowns
– Port scanning
– Number of ports accessed per second
– Number of bytes sent/received
 But what about unknown unknowns?
– Biggest thread
– New and as yet unclassified attacks
– Connections that are not knows as attacks
– But are out of the ordinary
– Anomalies that are outside clusters
43(c) ElephantScale.com 2016. All rights reserved
Shortest Path
 You have a graph (map) of cities
 With distances between them
 Help the mouse find the shortest path to the cheese
Copyright © 2016 Elephant Scale. All rights reserved. 44Session 7: GraphX
Elephant Scale – Big Data Training done right
45Copyright © 2016 Elephant Scale. All rights reserved.

Weitere ähnliche Inhalte

Was ist angesagt?

Deep Learning with MXNet - Dmitry Larko
Deep Learning with MXNet - Dmitry LarkoDeep Learning with MXNet - Dmitry Larko
Deep Learning with MXNet - Dmitry LarkoSri Ambati
 
Data Science at Scale by Sarah Guido
Data Science at Scale by Sarah GuidoData Science at Scale by Sarah Guido
Data Science at Scale by Sarah GuidoSpark Summit
 
Spark Meetup @ Netflix, 05/19/2015
Spark Meetup @ Netflix, 05/19/2015Spark Meetup @ Netflix, 05/19/2015
Spark Meetup @ Netflix, 05/19/2015Yves Raimond
 
Distributed Models Over Distributed Data with MLflow, Pyspark, and Pandas
Distributed Models Over Distributed Data with MLflow, Pyspark, and PandasDistributed Models Over Distributed Data with MLflow, Pyspark, and Pandas
Distributed Models Over Distributed Data with MLflow, Pyspark, and PandasDatabricks
 
Real-Time Anomoly Detection with Spark MLib, Akka and Cassandra by Natalino Busa
Real-Time Anomoly Detection with Spark MLib, Akka and Cassandra by Natalino BusaReal-Time Anomoly Detection with Spark MLib, Akka and Cassandra by Natalino Busa
Real-Time Anomoly Detection with Spark MLib, Akka and Cassandra by Natalino BusaSpark Summit
 
ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk b...
ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk b...ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk b...
ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk b...Spark Summit
 
RISELab:Enabling Intelligent Real-Time Decisions
RISELab:Enabling Intelligent Real-Time DecisionsRISELab:Enabling Intelligent Real-Time Decisions
RISELab:Enabling Intelligent Real-Time DecisionsJen Aman
 
Deep Learning on Apache® Spark™ : Workflows and Best Practices
Deep Learning on Apache® Spark™ : Workflows and Best PracticesDeep Learning on Apache® Spark™ : Workflows and Best Practices
Deep Learning on Apache® Spark™ : Workflows and Best PracticesJen Aman
 
Scala: the unpredicted lingua franca for data science
Scala: the unpredicted lingua franca  for data scienceScala: the unpredicted lingua franca  for data science
Scala: the unpredicted lingua franca for data scienceAndy Petrella
 
Sparking Science up with Research Recommendations by Maya Hristakeva
Sparking Science up with Research Recommendations by Maya HristakevaSparking Science up with Research Recommendations by Maya Hristakeva
Sparking Science up with Research Recommendations by Maya HristakevaSpark Summit
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingPaco Nathan
 
IBM Strategy for Spark
IBM Strategy for SparkIBM Strategy for Spark
IBM Strategy for SparkMark Kerzner
 
Application and Challenges of Streaming Analytics and Machine Learning on Mu...
 Application and Challenges of Streaming Analytics and Machine Learning on Mu... Application and Challenges of Streaming Analytics and Machine Learning on Mu...
Application and Challenges of Streaming Analytics and Machine Learning on Mu...Databricks
 
Data Science with Spark
Data Science with SparkData Science with Spark
Data Science with SparkKrishna Sankar
 
Agile data science with scala
Agile data science with scalaAgile data science with scala
Agile data science with scalaAndy Petrella
 
QCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark StreamingQCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark StreamingPaco Nathan
 
CuRious about R in Power BI? End to end R in Power BI for beginners
CuRious about R in Power BI? End to end R in Power BI for beginners CuRious about R in Power BI? End to end R in Power BI for beginners
CuRious about R in Power BI? End to end R in Power BI for beginners Jen Stirrup
 
Introduction to machine learning with GPUs
Introduction to machine learning with GPUsIntroduction to machine learning with GPUs
Introduction to machine learning with GPUsCarol McDonald
 
Analytics Zoo: Building Analytics and AI Pipeline for Apache Spark and BigDL ...
Analytics Zoo: Building Analytics and AI Pipeline for Apache Spark and BigDL ...Analytics Zoo: Building Analytics and AI Pipeline for Apache Spark and BigDL ...
Analytics Zoo: Building Analytics and AI Pipeline for Apache Spark and BigDL ...Databricks
 
Distributed machine learning 101 using apache spark from a browser devoxx.b...
Distributed machine learning 101 using apache spark from a browser   devoxx.b...Distributed machine learning 101 using apache spark from a browser   devoxx.b...
Distributed machine learning 101 using apache spark from a browser devoxx.b...Andy Petrella
 

Was ist angesagt? (20)

Deep Learning with MXNet - Dmitry Larko
Deep Learning with MXNet - Dmitry LarkoDeep Learning with MXNet - Dmitry Larko
Deep Learning with MXNet - Dmitry Larko
 
Data Science at Scale by Sarah Guido
Data Science at Scale by Sarah GuidoData Science at Scale by Sarah Guido
Data Science at Scale by Sarah Guido
 
Spark Meetup @ Netflix, 05/19/2015
Spark Meetup @ Netflix, 05/19/2015Spark Meetup @ Netflix, 05/19/2015
Spark Meetup @ Netflix, 05/19/2015
 
Distributed Models Over Distributed Data with MLflow, Pyspark, and Pandas
Distributed Models Over Distributed Data with MLflow, Pyspark, and PandasDistributed Models Over Distributed Data with MLflow, Pyspark, and Pandas
Distributed Models Over Distributed Data with MLflow, Pyspark, and Pandas
 
Real-Time Anomoly Detection with Spark MLib, Akka and Cassandra by Natalino Busa
Real-Time Anomoly Detection with Spark MLib, Akka and Cassandra by Natalino BusaReal-Time Anomoly Detection with Spark MLib, Akka and Cassandra by Natalino Busa
Real-Time Anomoly Detection with Spark MLib, Akka and Cassandra by Natalino Busa
 
ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk b...
ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk b...ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk b...
ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk b...
 
RISELab:Enabling Intelligent Real-Time Decisions
RISELab:Enabling Intelligent Real-Time DecisionsRISELab:Enabling Intelligent Real-Time Decisions
RISELab:Enabling Intelligent Real-Time Decisions
 
Deep Learning on Apache® Spark™ : Workflows and Best Practices
Deep Learning on Apache® Spark™ : Workflows and Best PracticesDeep Learning on Apache® Spark™ : Workflows and Best Practices
Deep Learning on Apache® Spark™ : Workflows and Best Practices
 
Scala: the unpredicted lingua franca for data science
Scala: the unpredicted lingua franca  for data scienceScala: the unpredicted lingua franca  for data science
Scala: the unpredicted lingua franca for data science
 
Sparking Science up with Research Recommendations by Maya Hristakeva
Sparking Science up with Research Recommendations by Maya HristakevaSparking Science up with Research Recommendations by Maya Hristakeva
Sparking Science up with Research Recommendations by Maya Hristakeva
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
 
IBM Strategy for Spark
IBM Strategy for SparkIBM Strategy for Spark
IBM Strategy for Spark
 
Application and Challenges of Streaming Analytics and Machine Learning on Mu...
 Application and Challenges of Streaming Analytics and Machine Learning on Mu... Application and Challenges of Streaming Analytics and Machine Learning on Mu...
Application and Challenges of Streaming Analytics and Machine Learning on Mu...
 
Data Science with Spark
Data Science with SparkData Science with Spark
Data Science with Spark
 
Agile data science with scala
Agile data science with scalaAgile data science with scala
Agile data science with scala
 
QCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark StreamingQCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark Streaming
 
CuRious about R in Power BI? End to end R in Power BI for beginners
CuRious about R in Power BI? End to end R in Power BI for beginners CuRious about R in Power BI? End to end R in Power BI for beginners
CuRious about R in Power BI? End to end R in Power BI for beginners
 
Introduction to machine learning with GPUs
Introduction to machine learning with GPUsIntroduction to machine learning with GPUs
Introduction to machine learning with GPUs
 
Analytics Zoo: Building Analytics and AI Pipeline for Apache Spark and BigDL ...
Analytics Zoo: Building Analytics and AI Pipeline for Apache Spark and BigDL ...Analytics Zoo: Building Analytics and AI Pipeline for Apache Spark and BigDL ...
Analytics Zoo: Building Analytics and AI Pipeline for Apache Spark and BigDL ...
 
Distributed machine learning 101 using apache spark from a browser devoxx.b...
Distributed machine learning 101 using apache spark from a browser   devoxx.b...Distributed machine learning 101 using apache spark from a browser   devoxx.b...
Distributed machine learning 101 using apache spark from a browser devoxx.b...
 

Ähnlich wie Machine Learning with Spark

Is Spark the right choice for data analysis ?
Is Spark the right choice for data analysis ?Is Spark the right choice for data analysis ?
Is Spark the right choice for data analysis ?Ahmed Kamal
 
Deep Learning for Developers (Advanced Workshop)
Deep Learning for Developers (Advanced Workshop)Deep Learning for Developers (Advanced Workshop)
Deep Learning for Developers (Advanced Workshop)Amazon Web Services
 
Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Scienc...
Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Scienc...Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Scienc...
Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Scienc...Sarah Aerni
 
Data science and Hadoop
Data science and HadoopData science and Hadoop
Data science and HadoopDonald Miner
 
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...Ilkay Altintas, Ph.D.
 
"R, Hadoop, and Amazon Web Services (20 December 2011)"
"R, Hadoop, and Amazon Web Services (20 December 2011)""R, Hadoop, and Amazon Web Services (20 December 2011)"
"R, Hadoop, and Amazon Web Services (20 December 2011)"Portland R User Group
 
Role of python in hpc
Role of python in hpcRole of python in hpc
Role of python in hpcDr Reeja S R
 
Fast and Scalable Python
Fast and Scalable PythonFast and Scalable Python
Fast and Scalable PythonTravis Oliphant
 
Data science presentation
Data science presentationData science presentation
Data science presentationMSDEVMTL
 
Data Science Accelerator Program
Data Science Accelerator ProgramData Science Accelerator Program
Data Science Accelerator ProgramGoDataDriven
 
Mahout and Distributed Machine Learning 101
Mahout and Distributed Machine Learning 101Mahout and Distributed Machine Learning 101
Mahout and Distributed Machine Learning 101John Ternent
 
Cloudera Breakfast Series, Analytics Part 1: Use All Your Data
Cloudera Breakfast Series, Analytics Part 1: Use All Your DataCloudera Breakfast Series, Analytics Part 1: Use All Your Data
Cloudera Breakfast Series, Analytics Part 1: Use All Your DataCloudera, Inc.
 
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's DataFrom Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's DataDatabricks
 
Hadoop for Data Science
Hadoop for Data ScienceHadoop for Data Science
Hadoop for Data ScienceDonald Miner
 
2014 nicta-reproducibility
2014 nicta-reproducibility2014 nicta-reproducibility
2014 nicta-reproducibilityc.titus.brown
 
SQL or NoSQL, that is the question!
SQL or NoSQL, that is the question!SQL or NoSQL, that is the question!
SQL or NoSQL, that is the question!Andraz Tori
 
Functional Ideas for a Cloudy Future
Functional Ideas for a Cloudy FutureFunctional Ideas for a Cloudy Future
Functional Ideas for a Cloudy FutureRichard Minerich
 

Ähnlich wie Machine Learning with Spark (20)

Is Spark the right choice for data analysis ?
Is Spark the right choice for data analysis ?Is Spark the right choice for data analysis ?
Is Spark the right choice for data analysis ?
 
Deep Learning for Developers (Advanced Workshop)
Deep Learning for Developers (Advanced Workshop)Deep Learning for Developers (Advanced Workshop)
Deep Learning for Developers (Advanced Workshop)
 
Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Scienc...
Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Scienc...Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Scienc...
Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Scienc...
 
Data science and Hadoop
Data science and HadoopData science and Hadoop
Data science and Hadoop
 
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
 
OpenML data@Sheffield
OpenML data@SheffieldOpenML data@Sheffield
OpenML data@Sheffield
 
"R, Hadoop, and Amazon Web Services (20 December 2011)"
"R, Hadoop, and Amazon Web Services (20 December 2011)""R, Hadoop, and Amazon Web Services (20 December 2011)"
"R, Hadoop, and Amazon Web Services (20 December 2011)"
 
R, Hadoop and Amazon Web Services
R, Hadoop and Amazon Web ServicesR, Hadoop and Amazon Web Services
R, Hadoop and Amazon Web Services
 
Role of python in hpc
Role of python in hpcRole of python in hpc
Role of python in hpc
 
Fast and Scalable Python
Fast and Scalable PythonFast and Scalable Python
Fast and Scalable Python
 
Distributed deep learning_over_spark_20_nov_2014_ver_2.8
Distributed deep learning_over_spark_20_nov_2014_ver_2.8Distributed deep learning_over_spark_20_nov_2014_ver_2.8
Distributed deep learning_over_spark_20_nov_2014_ver_2.8
 
Data science presentation
Data science presentationData science presentation
Data science presentation
 
Data Science Accelerator Program
Data Science Accelerator ProgramData Science Accelerator Program
Data Science Accelerator Program
 
Mahout and Distributed Machine Learning 101
Mahout and Distributed Machine Learning 101Mahout and Distributed Machine Learning 101
Mahout and Distributed Machine Learning 101
 
Cloudera Breakfast Series, Analytics Part 1: Use All Your Data
Cloudera Breakfast Series, Analytics Part 1: Use All Your DataCloudera Breakfast Series, Analytics Part 1: Use All Your Data
Cloudera Breakfast Series, Analytics Part 1: Use All Your Data
 
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's DataFrom Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
 
Hadoop for Data Science
Hadoop for Data ScienceHadoop for Data Science
Hadoop for Data Science
 
2014 nicta-reproducibility
2014 nicta-reproducibility2014 nicta-reproducibility
2014 nicta-reproducibility
 
SQL or NoSQL, that is the question!
SQL or NoSQL, that is the question!SQL or NoSQL, that is the question!
SQL or NoSQL, that is the question!
 
Functional Ideas for a Cloudy Future
Functional Ideas for a Cloudy FutureFunctional Ideas for a Cloudy Future
Functional Ideas for a Cloudy Future
 

Mehr von elephantscale

How to obtain the Cloudera Data Engineer Certification
How to obtain the Cloudera Data Engineer CertificationHow to obtain the Cloudera Data Engineer Certification
How to obtain the Cloudera Data Engineer Certificationelephantscale
 
Building a Big Data Team
Building a Big Data TeamBuilding a Big Data Team
Building a Big Data Teamelephantscale
 
Petrophysics and Big Data by Elephant Scale training and consultin
Petrophysics and Big Data by Elephant Scale training and consultinPetrophysics and Big Data by Elephant Scale training and consultin
Petrophysics and Big Data by Elephant Scale training and consultinelephantscale
 
Changing the game with cloud dw
Changing the game with cloud dwChanging the game with cloud dw
Changing the game with cloud dwelephantscale
 
Oil & Gas Big Data use cases
Oil & Gas Big Data use casesOil & Gas Big Data use cases
Oil & Gas Big Data use caseselephantscale
 
Reference architecture for Internet Of Things
Reference architecture for Internet Of ThingsReference architecture for Internet Of Things
Reference architecture for Internet Of Thingselephantscale
 

Mehr von elephantscale (7)

AI for Kids
AI for KidsAI for Kids
AI for Kids
 
How to obtain the Cloudera Data Engineer Certification
How to obtain the Cloudera Data Engineer CertificationHow to obtain the Cloudera Data Engineer Certification
How to obtain the Cloudera Data Engineer Certification
 
Building a Big Data Team
Building a Big Data TeamBuilding a Big Data Team
Building a Big Data Team
 
Petrophysics and Big Data by Elephant Scale training and consultin
Petrophysics and Big Data by Elephant Scale training and consultinPetrophysics and Big Data by Elephant Scale training and consultin
Petrophysics and Big Data by Elephant Scale training and consultin
 
Changing the game with cloud dw
Changing the game with cloud dwChanging the game with cloud dw
Changing the game with cloud dw
 
Oil & Gas Big Data use cases
Oil & Gas Big Data use casesOil & Gas Big Data use cases
Oil & Gas Big Data use cases
 
Reference architecture for Internet Of Things
Reference architecture for Internet Of ThingsReference architecture for Internet Of Things
Reference architecture for Internet Of Things
 

Kürzlich hochgeladen

Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesBoston Institute of Analytics
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 

Kürzlich hochgeladen (20)

Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 

Machine Learning with Spark

  • 1. Webinar: Machine Learning with Spark Everything you want to know about Machine Learning but could not find the place and time to ask
  • 2. Highlights  Detecting the low hanging fruit for machine learning  Balancing business and science on your team  Choosing the best Machine Learning tools, be it small or Big Data, R, Python or Spark. (And these are not mutually exclusive). Copyright © 2016 Elephant Scale. All rights reserved. 2
  • 3. What does a data scientist need to know  Familiarity with either Java / Scala / Python language – Need to be comfortable programming - there are many labs – Our platform is Spark, basic familiarity is expected – Our labs are in Scala, basics of Scala will be helpful  Basic understanding of Linux development environment – Command line navigation – Editing files (e.g. using VI or nano)  This is a Machine Learning with Spark class – But, no previous of Machine Learning knowledge is assumed – Class will be paced based on the pace of majority of the students. Copyright © 2016 Elephant Scale. All rights reserved. 3
  • 4. Lots of Labs : Learn By Doing Copyright © 2016 Elephant Scale. All rights reserved. 4 Where is the ANY key?
  • 5. After The Class… Copyright © 2016 Elephant Scale. All rights reserved. 5 Machine Learning
  • 6. Recommended Books  “Advanced Analytics With Spark” by Sandy Ryza, et al.  ”Data Algorithms” by Mahmoud Parsian  “Computational Complexity - A Modern Approach” by Sanjeev Arora and Boaz Barak 6Copyright © 2016 Elephant Scale. All rights reserved.
  • 7. Why machine learning?  Build a model to detect credit card fraud – thousands of features – billions of transactions  Recommend – millions of products – to millions of users  Estimate financial risk – simulations of portfolios – With millions of instruments  Genome data manipulation – thousands of human genomes – detect genetic associations with disease Copyright © 2016 Elephant Scale. All rights reserved.
  • 8.  Like Hadoop MapReduce, Spark has linear scalability and fault tolerance for large data sets  However, it adds the following extensions – DAG of operations, instead of Map-then-Reduce – Rich transformations to express solutions in the natural way – RDD – in-memory computation  Addresses the major bottleneck: – Not CPU – Not disk – Not network – But developer productivity Why Spark?
  • 9. The story of Spark
  • 10.  It reduces performance overhead – Be certain the performance adequate – Scala gives you access to the latest and greatest – Python and R bindings may come much later  Scala helps you understand the Spark approach better – Spark is written in Scala – Think in Scala, think in Spark  Just Scala, no other languages needed – Such as R with SQL Copyright © 2016 Elephant Scale. All rights reserved. Why Scala?
  • 11.  Python – Popular, well-known – Many packages – Graphing  R – Very popular, well-known – Very many packages – Graphing Why NOT Scala?
  • 12. About Machine Learning  What is Machine Learning?  It is an algorithm that “learns” from data – Any algorithm which improves its performance by access to data.  Machine Learning borrows from applied statistics  Also considered a branch of AI (Artificial Intelligence) 12
  • 13.  Sixties – Commercial computers & mainframes – Computers play chess  Eighties – Computational complexity theory – Artificial intelligence (AI) gets a bad rap  21st century – Big Data changes it all A glimpse of history
  • 14.  Computational complexity is simple:  P – all problems that can be solved fast – (in polynomial time, like n^p, but not exponential) – Example: system of linear equations  NP – all problems that can be verified fast – That is, just check if the solution is correct  But folks, it does not matter! P = NP?
  • 15.  “Big O” notation  Example of polynomial time O(n^^3)  Example of exponential time O(2^^n) – How much is that? – Compare to the number of particles in the universe ~ 10^^80 – To reach that, our n needs to be log(10^^80) = 80 log (10) ~ 80 * 3 = 240  There are also in-between, such as n^^(log log (n)) – But that is still bad enough O(n) notation
  • 16.  Old reasons – It is too theoretical, talking only about worst case scenario – There may be new computers, such as quantum computers  New reason – Big Data – Turing machine is inadequate • Because we hit the size limitations of one computer • And go into clusters • And we have other problems than expected Copyrig ht © 2016 Elephant Scale. All rights reserved . Why P and NP do not matter
  • 17.  Old thinking: – If you can solve any problem (P = NP), you can be creative  New thinking: – You don’t have to solve problems in order to be creative – Instead, you can pick up the answer from the internet  – Examples: • Google translate • IBM Dr. Watson (Jeopardy winner) • Lesson: re-use world’s data  New thinking: – Rely on the abundance of data – Find an approximate solution that is good enough – “Bad algorithms trained on lots of data can outperform good ones trained on very little” - Deeplearningfor4 How Big Data changed it all
  • 18.  Turing machine might be too theoretical  But developers often tend to “just code” “Думать не за свое дело браться” (Жаргон лабухов) “To think is wrong business to undertake” Russian slang Copyright © 2016 Elephant Scale. All rights reserved. The other extreme - no thinking at all
  • 19.  Our approach to Machine Learning is  The Golden Mean approach  Avoid over-theorizing  Avoid “just code” – Know what to expect of the solution – When to apply – The limitations – The benefits Copyright © 2016 Elephant Scale. All rights reserved. The golden mean Sages advocate the golden means
  • 20. Types of Machine Learning  Supervised Machine Learning: – A model is “trained” with human labeled training data. – Model then tested on other training data to see performance – Model can then be applied to unknown data. – Classification & regression usually supervised.  Unsupervised Machine Learning – Model tries to find natural patterns in the data. – No human input except parameters of the model. – Example: Clustering  Semi-Supervised Learning – Model is trained with a training set which contains mix of trained and untrained data 20
  • 21. Supervised Machine Learning  Input Data is split into “training” and “test” data, both labeled.  A Model is trained using training data  Prediction is made using model.predict()  Model can be tested using comparing the test dataset – Mean Squared Error: mean(predicted – actual) 21
  • 23. Model Validation  Models need to be ‘verified’ / ‘validated’  Split the data set into – Training set : build / train model – Test set : validate the model  Initially 70% training, 30% validation  Tweak the dials to decrease training and increase validation  Training set should represent data well-enough Training Testing model 23
  • 24. Creating Feature Vectors: Feature Extraction  Machine Learning only works with vectors. Feature Vectors are an n-dimensional point in space. – Select variables from data – Turn data into numbers (doubles). – “normalize” (scale down) high magnitude data. 24
  • 25. Vectors: Dense versus Sparse  Dense Vectors – Usually have a nonzero value for each variable – The “telecom churn” dataset we use in the labs is a dense dataset. – Use Vectors.dense  Sparse Vectors – Most values are zero (or nonexistent) – Text Data yields sparse vectors – One-Hot, factor variables lead to sparse vectors – Use Vectors.sparse 25
  • 26. Creating Vectors From Text  How to create vectors from text? – TF/IDF: Term Frequency Inverse Document Frequency • This essentially means the frequency of a term divided by its frequency in the larger group of documents (the “corpus”) • Each word in the corpus is then a “dimension” – you would have thousands of dimensions. – Word2Vec • Another vectorization algorithm • Uses neural network • Borders on deep learning 26
  • 27. Visualizing Text using WordCloud State of The Union Speech 2014 27
  • 28.  What is deep learning? – “A neural network with more than 1 hidden layer” – Deeplearning4j  But what is a neural network? Copyright © 2016 Elephant Scale. All rights reserved. Deep learning
  • 29.  Set of algorithms  Modeled loosely after the human brain  Designed to recognize patterns  Input comes from sensory data – machine perception – labeling – clustering raw input  Recognized patterns – Numerical – Contained in vectors – Translated from real-world data • Images • Sound • Text • Time series Copyright © 2016 Elephant Scale. All rights reserved. Neural networks
  • 30.  Do I have the data?  Which outputs do I care about? – Spam – not spam – Fraud – not fraud  Do I have labeled data from which to learn? (Supervised learning)  Nah, I just need to group things (Unsupervised learning) – Normal – anomaly – Group documents Copyright © 2016 Elephant Scale. All rights reserved. Basic steps in a neural network
  • 31. Copyright © 2016 Elephant Scale. All rights reserved. Neural network node
  • 32. Copyright © 2016 Elephant Scale. All rights reserved. Neural network composition
  • 33. Copyright © 2016 Elephant Scale. All rights reserved. Deep neural network
  • 34.  Google – ParagraphVectors (implemented as doc2vec) – Represents the meaning of documents – Based on word2vec and word context  Facebook Copyright © 2016 Elephant Scale. All rights reserved. Deep learning applications
  • 35. ML in Spark Spark Core Spark SQL Spark Streaming ML lib Standalone YARN MESOS GraphX 35
  • 36. Linear algorithms Linear algorithms SVM Logistic regression Linear regression
  • 37. Practical use case for SVM 37(c) ElephantScale.com 2016. All rights reserved
  • 38. History of logistic regression  Invented by (Sir) David Cox, UK  Who wrote 364 books and papers  Best known for – Proportional hazards model – Used in analysis of survival data – Medical research (cancer) 38(c) ElephantScale.com 2016. All rights reserved
  • 40. Where Naïve Bayes fits in  There are many classification algorithms in the world  Naïve Bayes Classifier (NBC) is one of the simplest but most effective  K-means and K-nearest neighbors are for numeric data  But for – Names – Symbols – Emails – Texts  NBC may be the best for that  Bayes can do multiclass (and not only binary) classification 40(c) ElephantScale.com 2016. All rights reserved A is a good candidate for Naïve Bayes (Credit: Sebastian Raschka)
  • 41. History of Bayes  Discovered by the Reverend Thomas Bayes (1701–1761)  Edited and read at the Royal Society by Richard Price (1763)  Independently reproduced and extended by Laplace (1774)  Naïve Bayes classifiers studied in 1950’s 41(c) ElephantScale.com 2016. All rights reserved
  • 42. Clustering use case  Anomaly detection – Find fraud – Detect network intrusion attack – Discover problems on servers – Or on any machinery with sensors  Clustering does not necessarily detects fraud – But it points to unusual data – And the need for further investigation 42(c) ElephantScale.com 2016. All rights reserved
  • 43. Network intrusion  Known unknowns – Port scanning – Number of ports accessed per second – Number of bytes sent/received  But what about unknown unknowns? – Biggest thread – New and as yet unclassified attacks – Connections that are not knows as attacks – But are out of the ordinary – Anomalies that are outside clusters 43(c) ElephantScale.com 2016. All rights reserved
  • 44. Shortest Path  You have a graph (map) of cities  With distances between them  Help the mouse find the shortest path to the cheese Copyright © 2016 Elephant Scale. All rights reserved. 44Session 7: GraphX
  • 45. Elephant Scale – Big Data Training done right 45Copyright © 2016 Elephant Scale. All rights reserved.

Hinweis der Redaktion

  1. 1
  2. 2
  3. 3
  4. 4
  5. 5
  6. Image credit : http://sparkinaction.com/ Image credit : http://shop.oreilly.com/
  7. 7
  8. 12
  9. Best is the enemy of the good Most machine learning solutions are approximate And rely on the abundance of data
  10. 20
  11. 21
  12. 24
  13. 25
  14. 26
  15. 35
  16. 36
  17. 39
  18. 44