8. Like Hadoop MapReduce, Spark has linear scalability and
fault tolerance for large data sets
However, it adds the following extensions
– DAG of operations, instead of Map-then-Reduce
– Rich transformations to express solutions in the natural way
– RDD – in-memory computation
Addresses the major bottleneck:
– Not CPU
– Not disk
– Not network
– But developer productivity
Why Spark?
11. Python
– Popular, well-known
– Many packages
– Graphing
R
– Very popular, well-known
– Very many packages
– Graphing
Why NOT Scala?
12. About Machine Learning
What is Machine Learning?
It is an algorithm that “learns” from data
– Any algorithm which improves its performance by access to data.
Machine Learning borrows from applied statistics
Also considered a branch of AI (Artificial Intelligence)
12
13. Sixties
– Commercial computers & mainframes
– Computers play chess
Eighties
– Computational complexity theory
– Artificial intelligence (AI) gets a bad rap
21st century
– Big Data changes it all
A glimpse of history
14. Computational complexity is simple:
P – all problems that can be solved fast
– (in polynomial time, like n^p, but not exponential)
– Example: system of linear equations
NP – all problems that can be verified fast
– That is, just check if the solution is correct
But folks, it does not matter!
P = NP?
15. “Big O” notation
Example of polynomial time O(n^^3)
Example of exponential time O(2^^n)
– How much is that?
– Compare to the number of particles in the universe ~ 10^^80
– To reach that, our n needs to be log(10^^80)
= 80 log (10) ~ 80 * 3 = 240
There are also in-between, such as n^^(log log (n))
– But that is still bad enough
O(n) notation
17. Old thinking:
– If you can solve any problem (P = NP), you can be creative
New thinking:
– You don’t have to solve problems in order to be creative
– Instead, you can pick up the answer from the internet
– Examples:
• Google translate
• IBM Dr. Watson (Jeopardy winner)
• Lesson: re-use world’s data
New thinking:
– Rely on the abundance of data
– Find an approximate solution that is good enough
– “Bad algorithms trained on lots of data can outperform good ones
trained on very little” - Deeplearningfor4
How Big Data changed it all
20. Types of Machine Learning
Supervised Machine Learning:
– A model is “trained” with human labeled training data.
– Model then tested on other training data to see performance
– Model can then be applied to unknown data.
– Classification & regression usually supervised.
Unsupervised Machine Learning
– Model tries to find natural patterns in the data.
– No human input except parameters of the model.
– Example: Clustering
Semi-Supervised Learning
– Model is trained with a training set which contains mix of trained
and untrained data
20
21. Supervised Machine Learning
Input Data is split into “training” and “test” data, both labeled.
A Model is trained using training data
Prediction is made using model.predict()
Model can be tested using comparing the test dataset
– Mean Squared Error: mean(predicted – actual)
21
23. Model Validation
Models need to be ‘verified’ / ‘validated’
Split the data set into
– Training set : build / train model
– Test set : validate the model
Initially 70% training, 30% validation
Tweak the dials to decrease training and increase validation
Training set should represent data well-enough
Training Testing
model
23
24. Creating Feature Vectors: Feature Extraction
Machine Learning only works with vectors. Feature Vectors
are an n-dimensional point in space.
– Select variables from data
– Turn data into numbers (doubles).
– “normalize” (scale down) high magnitude data.
24
25. Vectors: Dense versus Sparse
Dense Vectors
– Usually have a nonzero value for each variable
– The “telecom churn” dataset we use in the labs is a dense dataset.
– Use Vectors.dense
Sparse Vectors
– Most values are zero (or nonexistent)
– Text Data yields sparse vectors
– One-Hot, factor variables lead to sparse vectors
– Use Vectors.sparse
25
26. Creating Vectors From Text
How to create vectors from text?
– TF/IDF: Term Frequency Inverse Document Frequency
• This essentially means the frequency of a term divided by its
frequency in the larger group of documents (the “corpus”)
• Each word in the corpus is then a “dimension” – you would have
thousands of dimensions.
– Word2Vec
• Another vectorization algorithm
• Uses neural network
• Borders on deep learning
26
37. Practical use case for SVM
37(c) ElephantScale.com 2016. All rights reserved
38. History of logistic regression
Invented by (Sir) David Cox, UK
Who wrote 364 books and papers
Best known for
– Proportional hazards model
– Used in analysis of survival data
– Medical research (cancer)
38(c) ElephantScale.com 2016. All rights reserved
40. Where Naïve Bayes fits in
There are many classification algorithms in the world
Naïve Bayes Classifier (NBC) is one of the simplest but most
effective
K-means and K-nearest neighbors are for numeric data
But for
– Names
– Symbols
– Emails
– Texts
NBC may be the best for that
Bayes can do multiclass (and not only binary) classification
40(c) ElephantScale.com 2016. All rights reserved
A is a good candidate for Naïve Bayes
(Credit: Sebastian Raschka)
41. History of Bayes
Discovered by the Reverend Thomas Bayes (1701–1761)
Edited and read at the Royal Society by Richard Price (1763)
Independently reproduced and extended by Laplace (1774)
Naïve Bayes classifiers studied in 1950’s
41(c) ElephantScale.com 2016. All rights reserved
42. Clustering use case
Anomaly detection
– Find fraud
– Detect network intrusion attack
– Discover problems on servers
– Or on any machinery with sensors
Clustering does not necessarily detects fraud
– But it points to unusual data
– And the need for further investigation
42(c) ElephantScale.com 2016. All rights reserved
43. Network intrusion
Known unknowns
– Port scanning
– Number of ports accessed per second
– Number of bytes sent/received
But what about unknown unknowns?
– Biggest thread
– New and as yet unclassified attacks
– Connections that are not knows as attacks
– But are out of the ordinary
– Anomalies that are outside clusters
43(c) ElephantScale.com 2016. All rights reserved