Visualization of Enhanced Spark Induced Naive Bayes Classifier with Barry Becker

•

1 gefällt mir•794 views

This document discusses visualizing a naive Bayes classifier in Spark. It begins with an overview of naive Bayes classifiers and their advantages of being fast and simple to visualize but assuming independent features. It then discusses using a mushroom dataset to classify mushrooms as edible or poisonous. The document explains Bayes' rule and demonstrates it. It notes limitations of Spark's naive Bayes and how to address different data types like strings, nulls, and continuous values using techniques like MDLP discretization. It concludes with a demo of these techniques and discusses the benefits of visualization for model insight.

Daten & Analysen

Barry Becker, ESI-Group
VISUALIZATION OF
ENHANCED SPARK
INDUCED NAIVE BAYES
CLASSIFIER

Overview
• What the Naïve Bayes Classifier is
• How it can be visualized, and why you would want to
• Limitations of the version in spark, how to get around them
• Demo and use cases

What is a Naïve Bayes Classifier?
• A probabilistic model that can be used to predict a
categorical value (the target)
• Induced using training data and evaluated with test data
Inducer
Classifier
Training Data
New Data Predictions

What is a Naïve Bayes Classifier?
• Advantages
– Fast and simple
– Easy to visualize
• Disadvantage
– Assumes features are independent

Use Case: Detecting Poisonous Mushrooms
Expert determines edibility based on
cap shape, odor, gill spacing, stalk root,
veil type, veil color, ring type, spore print color,
habitat, gill size, stalk shape, … many more….
for 8,124 cases
https://www.pinterest.com/pin/239253798926772284

Bayes’ Rule
P(Edible | X) = P(X | Edible) P(Edible) / P(X)
Where X is
Odor = none and
Spore print color = chocolate and
Ring type = evanescent
Bayes Rule Demo

Understand
model with
Visualization
Everything you see is
based on counts

How the
Model
Makes
Predictions
Multiplies conditional
probabilities to come
up with a posterior
probability

Limitations of Spark Naïve Bayes
Out-of-the-box version is only for document classification!
Document Classification Approach
• Each feature represents a word
• For each feature,
determine frequency for each class
More General Approach
• Each feature is a column
• For each value of each feature,
determine class distribution
=* …**
Prediction:
Not Spam
loan viagra zebra
=**
Prediction:
Poisonousevanescent none chocolate
Ring Type: Odor: SP Color:
spam
not spam
edible
poisonous

Limitations of Spark Naïve Bayes
All columns must be non-negative integers
– But what if there are nulls?
– But what if there are strings?
– But what if there are floating point values?
– But what if there are dates?
– What if target is continuous?

How to Handle String Columns?
• First replace Nulls with a special Null value
• Use StringIndexers to map string values to
integer indices
• Lastly, use IndexToString transformer to restore
the predicted value back to a string.

How to Handle Continuous Columns?
• Use MDLP Discretization to intelligently bin
continuous (number or date) columns with
respect to the target
– Entropy based binning makes distributions of
adjacent bins as different as possible
• Replace Nulls with NaN so they can be in a
separate bin.

MDLP applied to a continuous column
MDLP Binning
30 Uniform Bins
HistogramView

MDLP applied to continuous column
MDLP Binning
30 Uniform Bins
SpinogramView

MDLP Binning of Continuous Features
Splits seek to maximize the information, as measured by
entropy
Recursively split bins until
– Information gain is too small
– Bins are getting too small (avoids overfitting)
Open Source
https://github.com/sramirez/spark-MDLP-discretization

What if continuous Target?
Automatically bin
a continuous target
into 3 equal weight bins

Pixel-Perfect Rendering
Anti-aliased rendering
Pixel-perfect rendering

Conclusion
• Spark Naïve Bayes classifier can be applied to
any data as long as some modifications made
• MDLP is useful for binning continuous features
• Visualization of model provides insight, trust,
and ability to answer what If questions

https://cloud.esi-group.com/analytics
Barry Becker - bbe@esi-group.com

Weitere ähnliche Inhalte

Ähnlich wie Visualization of Enhanced Spark Induced Naive Bayes Classifier with Barry Becker

Robust inference via generative classifiers for handling noisy labelsKimin Lee

Data Science 101ideatoipo

classfeb24.pptminaketan81

classfeb24.pptRangothriSreenivasaS

Data mining techniques unit ivmalathieswaran29

The zen of predictive modellingQuinton Anderson

ClusteryanamNagasuri Bala Venkateswarlu

Recommender Systems! @ASAI 2011Ernesto Mislej

Revisiting evolutionary information filteringManolis Vavalis

What is Machine Learning?SwiftKeyComms

Semi-automated Exploration and Extraction of Data in Scientific TablesElsevier

Cluster_saumitra.pptssuser6b3336

PandasUDFs: One Weird Trick to Scaled EnsemblesDatabricks

Creativity and Curiosity - The Trial and Error of Data ScienceDamianMingle

Introduction to samplingSituo Liu

How Machine Learning Helps Organizations to Work More Efficiently?Tuan Yang

Tips and tricks to win kaggle data science competitionsDarius Barušauskas

Machine Learning for EveryoneAly Abdelkareem

Kevin Swingler: Introduction to Data MiningLibrary and Information Science Research Coalition

ClusteringDr. C.V. Suresh Babu

Ähnlich wie Visualization of Enhanced Spark Induced Naive Bayes Classifier with Barry Becker (20)

Robust inference via generative classifiers for handling noisy labels

Data Science 101

classfeb24.ppt

Data mining techniques unit iv

The zen of predictive modelling

Clusteryanam

Recommender Systems! @ASAI 2011

Revisiting evolutionary information filtering

What is Machine Learning?

Semi-automated Exploration and Extraction of Data in Scientific Tables

Cluster_saumitra.ppt

PandasUDFs: One Weird Trick to Scaled Ensembles

Creativity and Curiosity - The Trial and Error of Data Science

Introduction to sampling

How Machine Learning Helps Organizations to Work More Efficiently?

Tips and tricks to win kaggle data science competitions

Machine Learning for Everyone

Kevin Swingler: Introduction to Data Mining

Clustering

Mehr von Databricks

DW Migration Webinar-March 2022.pptxDatabricks

Data Lakehouse Symposium | Day 1 | Part 1Databricks

Data Lakehouse Symposium | Day 1 | Part 2Databricks

Data Lakehouse Symposium | Day 2Databricks

Data Lakehouse Symposium | Day 4Databricks

5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks

Democratizing Data Quality Through a Centralized PlatformDatabricks

Learn to Use Databricks for Data ScienceDatabricks

Why APM Is Not the Same As ML MonitoringDatabricks

The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks

Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks

Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks

Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks

Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks

Sawtooth Windows for Feature AggregationsDatabricks

Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks

Re-imagine Data Monitoring with whylogs and SparkDatabricks

Raven: End-to-end Optimization of ML Prediction QueriesDatabricks

Processing Large Datasets for ADAS Applications using Apache SparkDatabricks

Massive Data Processing in Adobe Using Delta LakeDatabricks

Mehr von Databricks (20)

DW Migration Webinar-March 2022.pptx

Data Lakehouse Symposium | Day 1 | Part 1

Data Lakehouse Symposium | Day 1 | Part 2

Data Lakehouse Symposium | Day 2

Data Lakehouse Symposium | Day 4

5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop

Democratizing Data Quality Through a Centralized Platform

Learn to Use Databricks for Data Science

Why APM Is Not the Same As ML Monitoring

The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix

Stage Level Scheduling Improving Big Data and AI Integration

Simplify Data Conversion from Spark to TensorFlow and PyTorch

Scaling your Data Pipelines with Apache Spark on Kubernetes

Scaling and Unifying SciKit Learn and Apache Spark Pipelines

Sawtooth Windows for Feature Aggregations

Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink

Re-imagine Data Monitoring with whylogs and Spark

Raven: End-to-end Optimization of ML Prediction Queries

Processing Large Datasets for ADAS Applications using Apache Spark

Massive Data Processing in Adobe Using Delta Lake

Kürzlich hochgeladen

NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...Amil Baba Dawood bangali

April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024Timothy Spann

Multiple time frame trading analysis -brianshannon.pdfchwongval

Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ

Vision, Mission, Goals and Objectives ppt..pptxellehsormae

Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson

Semantic Shed - Squashing and Squeezing.pptxMike Bennett

RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993

毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss

modul pembelajaran robotic Workshop _ by Slidesgo.pptxaleedritatuxx

专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss

Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Boston Institute of Analytics

Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03

办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degreeyuu sss

Student profile product demonstration on grades, ability, well-being and mind...Seán Kennedy

ASML's Taxonomy Adventure by Daniel Cantervoginip

Real-Time AI Streaming - AI Max PrincetonTimothy Spann

LLMs, LMMs, their Improvement Suggestions and the Path towards AGIThomas Poetter

Biometric Authentication: The Evolution, Applications, Benefits and Challenge...GQ Research

Identifying Appropriate Test Statistics Involving Population MeanMYRABACSAFRA2

Kürzlich hochgeladen (20)

NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...

April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024

Multiple time frame trading analysis -brianshannon.pdf

Advanced Machine Learning for Business Professionals

Vision, Mission, Goals and Objectives ppt..pptx

Defining Constituents, Data Vizzes and Telling a Data Story

Semantic Shed - Squashing and Squeezing.pptx

RABBIT: A CLI tool for identifying bots based on their GitHub events.

毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree

modul pembelajaran robotic Workshop _ by Slidesgo.pptx

专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改

Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...

Top 5 Best Data Analytics Courses In Queens

办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree

Student profile product demonstration on grades, ability, well-being and mind...

ASML's Taxonomy Adventure by Daniel Canter

Real-Time AI Streaming - AI Max Princeton

LLMs, LMMs, their Improvement Suggestions and the Path towards AGI

Biometric Authentication: The Evolution, Applications, Benefits and Challenge...

Identifying Appropriate Test Statistics Involving Population Mean

Visualization of Enhanced Spark Induced Naive Bayes Classifier with Barry Becker

1. Barry Becker, ESI-Group VISUALIZATION OF ENHANCED SPARK INDUCED NAIVE BAYES CLASSIFIER

2. Overview • What the Naïve Bayes Classifier is • How it can be visualized, and why you would want to • Limitations of the version in spark, how to get around them • Demo and use cases

3. What is a Naïve Bayes Classifier? • A probabilistic model that can be used to predict a categorical value (the target) • Induced using training data and evaluated with test data Inducer Classifier Training Data New Data Predictions

4. What is a Naïve Bayes Classifier? • Advantages – Fast and simple – Easy to visualize • Disadvantage – Assumes features are independent

5. Use Case: Detecting Poisonous Mushrooms Expert determines edibility based on cap shape, odor, gill spacing, stalk root, veil type, veil color, ring type, spore print color, habitat, gill size, stalk shape, … many more…. for 8,124 cases https://www.pinterest.com/pin/239253798926772284

6. Bayes’ Rule P(Edible | X) = P(X | Edible) P(Edible) / P(X) Where X is Odor = none and Spore print color = chocolate and Ring type = evanescent Bayes Rule Demo

7. Understand model with Visualization Everything you see is based on counts

8. How the Model Makes Predictions Multiplies conditional probabilities to come up with a posterior probability

9. How the Model Makes Predictions Multiplies conditional probabilities to come up with a posterior probability

10. Limitations of Spark Naïve Bayes Out-of-the-box version is only for document classification! Document Classification Approach • Each feature represents a word • For each feature, determine frequency for each class More General Approach • Each feature is a column • For each value of each feature, determine class distribution =* …** Prediction: Not Spam loan viagra zebra =** Prediction: Poisonousevanescent none chocolate Ring Type: Odor: SP Color: spam not spam edible poisonous

11. Limitations of Spark Naïve Bayes All columns must be non-negative integers – But what if there are nulls? – But what if there are strings? – But what if there are floating point values? – But what if there are dates? – What if target is continuous?

12. How to Handle String Columns? • First replace Nulls with a special Null value • Use StringIndexers to map string values to integer indices • Lastly, use IndexToString transformer to restore the predicted value back to a string.

13. How to Handle Continuous Columns? • Use MDLP Discretization to intelligently bin continuous (number or date) columns with respect to the target – Entropy based binning makes distributions of adjacent bins as different as possible • Replace Nulls with NaN so they can be in a separate bin.

14. Example: Job Type from Application

15. MDLP applied to a continuous column MDLP Binning 30 Uniform Bins HistogramView

16. MDLP applied to continuous column MDLP Binning 30 Uniform Bins SpinogramView

17. MDLP Binning of Continuous Features Splits seek to maximize the information, as measured by entropy Recursively split bins until – Information gain is too small – Bins are getting too small (avoids overfitting) Open Source https://github.com/sramirez/spark-MDLP-discretization

18. What if continuous Target? Automatically bin a continuous target into 3 equal weight bins

19. Mineset Demo

20. Pixel-Perfect Rendering Anti-aliased rendering Pixel-perfect rendering

21. Pixel-Perfect Rendering Anti-aliased rendering Pixel-perfect rendering

22. Conclusion • Spark Naïve Bayes classifier can be applied to any data as long as some modifications made • MDLP is useful for binning continuous features • Visualization of model provides insight, trust, and ability to answer what If questions

23. https://cloud.esi-group.com/analytics Barry Becker - bbe@esi-group.com

Visualization of Enhanced Spark Induced Naive Bayes Classifier with Barry Becker

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Ähnlich wie Visualization of Enhanced Spark Induced Naive Bayes Classifier with Barry Becker

Ähnlich wie Visualization of Enhanced Spark Induced Naive Bayes Classifier with Barry Becker (20)

Mehr von Databricks

Mehr von Databricks (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Visualization of Enhanced Spark Induced Naive Bayes Classifier with Barry Becker