This presentation will provide an overview of the latest advancements in Machine Learning modules over the past year, including Clustering, Natural Language Processing, Deep Learning, and the Expanded Model Evaluation Metrics.
Clustering Methods of the HPCC Systems Machine Learning Library
The clustering method is an important part of unsupervised learning. To gain the unsupervised learning capability, two widely applied clustering methods, KMeans and DBSCAN are adopted to the current Machine Learning library. This presentation will introduce the newly developed clustering algorithms and the evaluation methods.
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Advancements in HPCC Systems Machine Learning
1. 2019 HPCC
Systems®
Community Day
Challenge Yourself –
Challenge the Status Quo
Lili Xu
Software Engineer III
HPCC Systems
LexisNexis Risk
Roger Dev
Sr. Architect
Machine Learning
Library
3. Overview
• Theme: Expand the ML library to handle multimedia and unsupervised learning
• Extended set of model evaluation metrics
• Text Vectors – Machine Learning for textual data
• Generalized Neural Networks (GNN) – ECL Deep Learning for Image and Video
and more
• Unsupervised Clustering
• K-Means – Centroid based clustering
• DBSCAN – Density based clustering
Advancements in HPCC Systems Machine Learning 3
4. Advancements in HPCC Systems Machine Learning
HPCC Systems Machine Learning Library
ML_Core
PBblas
LinearRegressi
on
LogisticRegressi
on
SVM
GLM
LearningTrees
TextVectors
BAS
E
Deep
Learning
SUPERVISED LEARNING
DBSCAN
K-Means
UNSUPERVISED
LEARNING
GNN
4
8. Evaluation Metrics
• Extensions to ML_Core to better evaluate the ML Models or compare alternative
Models.
• Done by our intern: A. Suryanarayanan (“Surya”)
• Enhanced ML_Core Accuracy module:
• Regression Accuracy
• Standard Error
• ANOVA (Analysis of Variance)
• T-Statistic
• P-value
• Confidence Interval
• R-Squared
• Root Mean Squared Error
• Akaike Information Criterion (AIC)
Advancements in HPCC Systems Machine Learning 8
9. Evaluation Metrics (cont’d)
• Enhanced ML_Core Accuracy module (cont’d)
• Classification Accuracy
• Raw Accuracy
• Power-of-Discrimination (PoD)
• Extended Power-of-Discrimination (PoDE)
• Confusion Matrix
• Precision, Recall, False-Positive-Rate
• Balanced F-Score – Combines Precision and Recall into a score for
each class
• Hamming Loss
• Area Under Curve (AUC)
• Clustering Accuracy
• Silhouette Coefficient
• Adjusted Rand Index
Advancements in HPCC Systems Machine Learning 9
10. Other
• New Feature Selection Module
• Chi Squared Feature Selection Test
Advancements in HPCC Systems Machine Learning 10
11. Advancements in HPCC Systems Machine Learning 11
For more details
• Module Documentation:
ML_Core.Accuracy
ML_Core.FeatureSelection
• Research Publication:
Design and implementation of Machine Learning Evaluation Metrics on HPCC
Systems
A. Suryanarayanan, Arjuna Chala, Lili Xu, Shobha G, Jyothi Shetty, Roger Dev
12. Text Vectors – Machine Learning for Free-form Text
Advancements in HPCC Systems Machine Learning 12
13. Introduction to Text Vectors
• Fully unsupervised learning – Give it a Corpus (large body of text) and it will learn
on its own.
• Convert free-form text into numeric vectors to allow mathematical treatment of
text.
• Word Vectors
• Sentence Vectors
• Vector: An ordered list of numbers – A coordinate in N-dimensional space
• (11.3, -2.5) – A two dimensional vector
• (-.0138, .5247, .9831) – A three dimensional vector
• (.1, -.3, -.1, … , .5) – An N dimensional vector
• Text Vectors are typically between 20 and 1000 dimensional
• Vectors that are close in space are also close in meaning.
Advancements in HPCC Systems Machine Learning 13
14. Text Vectorization – The theory
Advancements in HPCC Systems Machine Learning 14
"You shall know a word by the company it
keeps."
- Linguist John Rupert Firth, 1957
Or more rigorously:
"The meaning of a word is closely associated with the
distribution of the words that surround it in coherent text."
16. Applications
• Analysis of Free-form Text
• Turn text into features for any ML algorithm
• Classification of text (e.g. Positive, Negative, Neutral)
• Find closest sentence to a new sentence
• Free-form Search
• Translation
• Textual Mining
• Many more undiscovered uses
Advancements in HPCC Systems Machine Learning 16
17. Advancements in HPCC Systems Machine Learning 17
For more details
• Theory and Tutorial:
Text Vectors – Machine Learning for Textual Data
Link: https://hpccsystems.com/blog/textvectors
19. Introduction to GNN
• Flexible ECL interface to Keras / Tensorflow.
• Google’s Tensorflow is the most widely used Deep Learning framework
• Keras is a high-level interface to Tensorflow (and other frameworks) and is the
most widely used DL interface. It is included as a standard component of
Tensorflow
• Parallelized training using Batch Synchronous Network Optimization.
• Provides full access to Keras Sequential Model capabilities.
• Can handle nearly any style of Neural Network:
• Classical (Densely connected)
• Convolutional (commonly used for image processing)
• Recurrent (used for video / time-series)
• Auto-encoders (unsupervised training of weight vectors)
• ECL Tensor module allows N dimensional datasets
Advancements in HPCC Systems Machine Learning 19
20. Tensors
• Think of Tensors as N dimensional arrays or matrices.
• A single number is a 0 dimensional Tensor
• A vector is a 1 dimensional Tensor
• A matrix is a 2 dimensional Tensor
• For traditional ML, 2 dimensional works well.
• Example: nObservations X nFeatures
• For multi-media ML (e.g. images, video, time-series) more dimensions are
required. E.g.
• For color images: nObservations X pixel-width X pixel-height X 3 (i.e.
Red/Green/Blue)
• For video: nObservations X pixel-width X pixel-height X 3 X time-steps
• The GNN Tensor module provides efficient storage and distribution for Tensor-
based data of any dimension.
Advancements in HPCC Systems Machine Learning 20
21. GNNI
• The GNNI module provides an easy to use interface for defining, training, and utilizing
Neural Networks.
• It handles all of the parallelization and distribution of data transparently.
• Under the hood, a separate Keras / Tensorflow network is trained on each node, and
the resulting weights are combined periodically.
• Neural Networks and their training mechanism are defined using the same Python
syntax as native Keras.
• In native (Python) Keras, you create a Sequential Model and add layers one at a
time.
• In ECL, you create a list of layers (as Python text) and call DefineModel(…) with
that list.
• Input to training and prediction are via Tensors. Tensors are also used to get and set
weights.
Advancements in HPCC Systems Machine Learning 21
22. • Non-sequential Models (Complex hybrid deep learning)
• Support for textual data
• Generative Adversarial Networks (GANs) and their derivatives
Future Directions
Advancements in HPCC Systems Machine Learning 22
23. Applications
• Machine Learning for Images, Video, or Time-series
• Scoring (i.e. Regression)
• Classification
• Multivariate Optimization
• Auto-encoders
• Vectorization
• Many others TBD
Advancements in HPCC Systems Machine Learning 23
24. Advancements in HPCC Systems Machine Learning 24
For more details
• The Bundle will be releasing soon.
• Look for the blog article announcing the release on hpccsystems.com >> Community >>
Blog
26. Clustering Methods in HPCC Systems : KMeans & DBSCAN
• Unsupervised Machine Learning (ML) algorithms
• Automatically find the clusters/groups of the data without previous knowledge
• Highly Scalable Parallelized for Big Data machine learning challenge
Clustering Methods of the HPCC Systems Machine Learning
Library
26
27. Applications
Clustering Methods of the HPCC Systems Machine Learning
Library
27
Image segmentationClaimCustomer segmentation
Clustering gene expressions
Eisen et al, PNAS 1998
28. • Most popular clustering method
• Highly Scalable Parallelized
• Parametric: K, Tolerance
• Sensitive to Initialization
• Spherical Clusters
• Sensitive to Outliers
• Curse of Dimensionality
Clustering Methods of the HPCC Systems Machine Learning
Library
28
[3]
KMeans vs. DBSCAN
KMEANS
K = 3
Tolerance = 0.0
29. • Density-Based Clustering Method
• Highly Scalable Parallelized
• Parametric: epsilon, minPoints
• Sensitive to Initialization
• Random Shapes Clusters
• Outliers Detection
• Sensitive to Density Variance
• Curse of Dimensionality
Clustering Methods of the HPCC Systems Machine Learning
Library
29
KMeans vs. DBSCAN
DBSCAN
30. Clustering Methods of the HPCC Systems Machine Learning
Library
30
KMeans vs. DBSCAN
KMean
s
DBSCAN
• Clusters Shape
• Cluster Size
• Model Parameters
• Number of Clusters
(Fixed vs. Variable)
• Outlier Detection
• Curse of Dimensionality
31. Clustering Methods of the HPCC Systems Machine Learning
Library
31
Recommendation SystemClustering Demographic/Geospatial Data
Application Domains
32. IMPORT KMeans as KM;
Clustering Methods of the HPCC Systems Machine Learning
Library
32
Model := KM.KMeans(Max_iterations,Tolerance).Fit( Samples,
InitialCentroids));
Easy to use
Step 1 Import K-Means bundle
Step 2 Train K-Means Model
Optional
Labels := KM.KMeans().Predict(Model, NewSamples);
Step 3 Predict the cluster index of the new samples (Optional)
Required
33. Clustering Methods of the HPCC Systems Machine Learning
Library
33
Easy to use – Cont.
Step 4 Visualization (Optional)
ECL Cloud IDE: KMeans Visualization
34. Clustering Methods of the HPCC Systems Machine Learning
Library
34
For more details
• Tutorial:
Automatically Cluster your Data with Massively Scalable K-Means
Link: https://hpccsystems.com/blog/kmeans
• Research Publication:
Massively Scalable Parallel KMeans on the HPCC Systems Platform
Lili Xu, Amy Apon, Flavio Villanustre, Roger Dev, Arjuna Chala
35. Clustering Methods of the HPCC Systems Machine Learning
Library
35
Reference
ECL-ML module: https://hpccsystems.com/ml
Download: https://hpccsystems.com/download/free-modules/machine-learning-library
Source code: https://github.com/hpcc-systems
Forum: http://hpccsystems.com/bb/viewforum.php?f=23
Contact us: Lili Xu
Software Engineer III
HPCC Systems
Lili.xu@lexisnexisrisk.com
Roger Dev
Sr. Architect
Machine Learning Library
roger.dev@lexisnexisrisk.co
m
36. Presentation Title Here (Insert Menu > Header & Footer > Apply) 36
View this presentation on YouTube:
https://www.youtube.com/watch?v=Z1A3nOuhv3A&list=PL-
8MJMUpp8IKH5-d56az56t52YccleX5h&index=11&t=43s (12:32)
Hinweis der Redaktion
Last step is optional. if you want to predict other samples’s group relationship, you can simple use the model you already built in the last step and feed it together with the new sample set into the predict function. The results will give the group label of each new sample.
Last step is optional. if you want to predict other samples’s group relationship, you can simple use the model you already built in the last step and feed it together with the new sample set into the predict function. The results will give the group label of each new sample.
Now you understand how KMeans works, let’s take a look how to use this simple but powerful too to cluster your data in hpcc systems.
It’s very easy, only three steps.
Below steps assume that you already downloaded and installed KMeans bundle on the machine. You can go to our website for more details about this if you have questions.
With kmeans bundle installed.
Step 1, simply import KMeans bundle in your ECL code.
Then step 2 is to train your model.
Copy this code in your ECL code and then change the content in the parenthesis which we call model parameters.
Let me explain the details of each parameters here.
It’s easy to understand that the parameters such as sample set and centroid set. The sample set means the data points you want to find the groups in it.
The centroidset is the centroids that you want to initialize with.
Another parameter is the max_iterations. It defines the max number of iterations OUR MODEL can run.
The last one is the tolerance, it defines the