SlideShare ist ein Scribd-Unternehmen logo
1 von 16
Cluster Analysis
Using RapidMiner and SAS 9.3
Agenda
• The Data
• Some preliminary treatments






•
•
•
•
•
•

Checking for outliers
Manual outlier checking for a given confidence level
Filtering outliers
Data without outliers
Selecting attributes for clusters

Setting up clusters
Reading the clusters
Using SAS for clustering
Dendrogram
Depicting Tree using SAS
Conclusion
The Data
• Number of observations: 97
• 3 numeric variables:
 Birth rate per thousand
 Death rate per thousand
 Infant mortality rate per thousand

• 1 polynomial variable: Country
• Data obtained from UN Demographic
Yearbook 1990
Some preliminary treatments
• Checking for outliers using RapidMiner
Some preliminary treatments
• Manual checking for outliers at a given confidence
level
• For Birth (95%)
 mu-2(sigma) = 27.384-2(12.978) = 1.428
 mu+2(sigma) = 27.384+2(12.978) = 53.34

• Hence, no outliers
• Filtering outliers
o 10 outliers recorded
• Data without outliers
o Filter examples
o Parameter string: outlier=true
o Invert filter
• Selecting attributes for clusters
o Clusters on polynomial variables make no sense
o Remove Country from attribute list
• Setting up clusters
o K=3
o Join both nodes to get cluster model information
Reading the Clusters
•
•
•

Cluster 1: Low values of each numeric variable
Cluster 2: High values of each numeric variable
Cluster 0: Moderate values of each numeric variable
Reading the Clusters
•
•

Scatter Plot Birth and Death against Infant Death
Rate
Size – Infant Death Rate
Using SAS for clustering
•
•

Using canonical variables for standardization of
variables to mean 0 and standard deviation 1
Spherical within-cluster covariance matrix

proc aceclus data=Poverty out=Ace p=.03
noprint;
var Birth Death InfantDeath;
run;
proc cluster data=Ace outtree=Tree
method=ward
ccc pseudo print=15;
var can1 can2 can3 ;
id Country;
run;
Using SAS for clustering
•

First 2 canonical variables account for about 93% of
the total variation
Dendrogram
Tree depiction
•
•

Plot can1 and can2 against cluster
Shows similar plot compared to RapidMiner output
Conclusion
•

Cluster 1: Mostly developed European nations, USA, UK,
Singapore, USSR, etc
•
•
•

•

Cluster 2: Afghanistan, Pakistan, Iran, mostly under
privileged African nations
•
•
•
•

•

Efficient allocation of public goods
Lower crime rates
Abortion legalized

Low GDP
Abortion not legal
High crime rates, prevalent wars and terrorist activities
Poor health standards, high poverty levels

Cluster 0: India, Mexico, South Africa, Saudi Arabia, etc
•
•
•
•

Emerging nations
Increasing growth rates
Controlled negative externalities
Focus on literacy and employment

Weitere ähnliche Inhalte

Andere mochten auch

A Beginner's Guide to Machine Learning with Scikit-Learn
A Beginner's Guide to Machine Learning with Scikit-LearnA Beginner's Guide to Machine Learning with Scikit-Learn
A Beginner's Guide to Machine Learning with Scikit-LearnSarah Guido
 
RapidMiner: Introduction To Rapid Miner
RapidMiner: Introduction To Rapid MinerRapidMiner: Introduction To Rapid Miner
RapidMiner: Introduction To Rapid MinerRapidmining Content
 
K-means Clustering with Scikit-Learn
K-means Clustering with Scikit-LearnK-means Clustering with Scikit-Learn
K-means Clustering with Scikit-LearnSarah Guido
 
Data mining tools (R , WEKA, RAPID MINER, ORANGE)
Data mining tools (R , WEKA, RAPID MINER, ORANGE)Data mining tools (R , WEKA, RAPID MINER, ORANGE)
Data mining tools (R , WEKA, RAPID MINER, ORANGE)Krishna Petrochemicals
 
Application of Clustering in Data Science using Real-life Examples
Application of Clustering in Data Science using Real-life Examples Application of Clustering in Data Science using Real-life Examples
Application of Clustering in Data Science using Real-life Examples Edureka!
 
Cluster analysis for market segmentation
Cluster analysis for market segmentationCluster analysis for market segmentation
Cluster analysis for market segmentationVishal Tandel
 

Andere mochten auch (17)

Customer Management - A Practioners Perspective
Customer Management - A Practioners PerspectiveCustomer Management - A Practioners Perspective
Customer Management - A Practioners Perspective
 
Data manipulation with RapidMiner Studio 7
Data manipulation with RapidMiner Studio 7Data manipulation with RapidMiner Studio 7
Data manipulation with RapidMiner Studio 7
 
Introduction to Feature (Attribute) Selection with RapidMiner Studio 6
Introduction to Feature (Attribute) Selection with RapidMiner Studio 6Introduction to Feature (Attribute) Selection with RapidMiner Studio 6
Introduction to Feature (Attribute) Selection with RapidMiner Studio 6
 
A Beginner's Guide to Machine Learning with Scikit-Learn
A Beginner's Guide to Machine Learning with Scikit-LearnA Beginner's Guide to Machine Learning with Scikit-Learn
A Beginner's Guide to Machine Learning with Scikit-Learn
 
Rapidminer
RapidminerRapidminer
Rapidminer
 
RapidMiner: Introduction To Rapid Miner
RapidMiner: Introduction To Rapid MinerRapidMiner: Introduction To Rapid Miner
RapidMiner: Introduction To Rapid Miner
 
K-means Clustering with Scikit-Learn
K-means Clustering with Scikit-LearnK-means Clustering with Scikit-Learn
K-means Clustering with Scikit-Learn
 
Introduction to Text Classification with RapidMiner Studio 7
Introduction to Text Classification with RapidMiner Studio 7Introduction to Text Classification with RapidMiner Studio 7
Introduction to Text Classification with RapidMiner Studio 7
 
Search Twitter with RapidMiner Studio 6
Search Twitter with RapidMiner Studio 6Search Twitter with RapidMiner Studio 6
Search Twitter with RapidMiner Studio 6
 
Advanced Predictive Modeling with R and RapidMiner Studio 7
Advanced Predictive Modeling with R and RapidMiner Studio 7Advanced Predictive Modeling with R and RapidMiner Studio 7
Advanced Predictive Modeling with R and RapidMiner Studio 7
 
Data mining tools (R , WEKA, RAPID MINER, ORANGE)
Data mining tools (R , WEKA, RAPID MINER, ORANGE)Data mining tools (R , WEKA, RAPID MINER, ORANGE)
Data mining tools (R , WEKA, RAPID MINER, ORANGE)
 
Predictive analytic-for-retail-business
Predictive analytic-for-retail-businessPredictive analytic-for-retail-business
Predictive analytic-for-retail-business
 
Application of Clustering in Data Science using Real-life Examples
Application of Clustering in Data Science using Real-life Examples Application of Clustering in Data Science using Real-life Examples
Application of Clustering in Data Science using Real-life Examples
 
Building Decision Tree model with numerical attributes
Building Decision Tree model with numerical attributesBuilding Decision Tree model with numerical attributes
Building Decision Tree model with numerical attributes
 
Evaluation metrics: Precision, Recall, F-Measure, ROC
Evaluation metrics: Precision, Recall, F-Measure, ROCEvaluation metrics: Precision, Recall, F-Measure, ROC
Evaluation metrics: Precision, Recall, F-Measure, ROC
 
Introduction to Data Mining and Big Data Analytics
Introduction to Data Mining and Big Data AnalyticsIntroduction to Data Mining and Big Data Analytics
Introduction to Data Mining and Big Data Analytics
 
Cluster analysis for market segmentation
Cluster analysis for market segmentationCluster analysis for market segmentation
Cluster analysis for market segmentation
 

Ähnlich wie Cluster analysis using Rapidminer and Sas

Feature selection with imbalanced data in agriculture
Feature selection with  imbalanced data in agricultureFeature selection with  imbalanced data in agriculture
Feature selection with imbalanced data in agricultureAboul Ella Hassanien
 
Spss tutorial-cluster-analysis
Spss tutorial-cluster-analysisSpss tutorial-cluster-analysis
Spss tutorial-cluster-analysisAnimesh Kumar
 
Anomaly detection (Unsupervised Learning) in Machine Learning
Anomaly detection (Unsupervised Learning) in Machine LearningAnomaly detection (Unsupervised Learning) in Machine Learning
Anomaly detection (Unsupervised Learning) in Machine LearningKuppusamy P
 
Prostate Cancer Diagnosis using Deep Learning with 3D Multiparametric MRI: Pr...
Prostate Cancer Diagnosis using Deep Learning with 3D Multiparametric MRI: Pr...Prostate Cancer Diagnosis using Deep Learning with 3D Multiparametric MRI: Pr...
Prostate Cancer Diagnosis using Deep Learning with 3D Multiparametric MRI: Pr...Saifeng (Aaron) Liu
 
Genetic algorithm
Genetic algorithmGenetic algorithm
Genetic algorithmJari Abbas
 
House Sale Price Prediction
House Sale Price PredictionHouse Sale Price Prediction
House Sale Price Predictionsriram30691
 
Statistical modeling in San Francisco Crime Prediction
Statistical modeling in San Francisco Crime PredictionStatistical modeling in San Francisco Crime Prediction
Statistical modeling in San Francisco Crime PredictionJiaying Li
 
Association mapping, GWAS, Mapping, natural population mapping
Association mapping, GWAS, Mapping, natural population mappingAssociation mapping, GWAS, Mapping, natural population mapping
Association mapping, GWAS, Mapping, natural population mappingMahesh Biradar
 
a brief introduction to epistasis detection
a brief introduction to epistasis detectiona brief introduction to epistasis detection
a brief introduction to epistasis detectionHyun-hwan Jeong
 
Detecting STR Peaks in Degraded DNA samples
Detecting STR Peaks in Degraded DNA samplesDetecting STR Peaks in Degraded DNA samples
Detecting STR Peaks in Degraded DNA samplesEmanuela Marasco
 
Project presentation - Capstone
Project presentation  - CapstoneProject presentation  - Capstone
Project presentation - CapstoneSkandha Ch
 
SPSS Step-by-Step Tutorial and Statistical Guides by Statswork
SPSS Step-by-Step Tutorial and Statistical Guides by StatsworkSPSS Step-by-Step Tutorial and Statistical Guides by Statswork
SPSS Step-by-Step Tutorial and Statistical Guides by StatsworkStats Statswork
 
Data Science Project by Areeb Ansari.ppt
Data Science Project by Areeb Ansari.pptData Science Project by Areeb Ansari.ppt
Data Science Project by Areeb Ansari.pptAreebAnsari16
 
Data_Preparation.pptx
Data_Preparation.pptxData_Preparation.pptx
Data_Preparation.pptxImXaib
 

Ähnlich wie Cluster analysis using Rapidminer and Sas (20)

Vanderbilt b
Vanderbilt bVanderbilt b
Vanderbilt b
 
Feature selection with imbalanced data in agriculture
Feature selection with  imbalanced data in agricultureFeature selection with  imbalanced data in agriculture
Feature selection with imbalanced data in agriculture
 
Multivariate analysis
Multivariate analysisMultivariate analysis
Multivariate analysis
 
Multivariate Analysis.ppt
Multivariate Analysis.pptMultivariate Analysis.ppt
Multivariate Analysis.ppt
 
Multivariate Analysis
Multivariate AnalysisMultivariate Analysis
Multivariate Analysis
 
Spss tutorial-cluster-analysis
Spss tutorial-cluster-analysisSpss tutorial-cluster-analysis
Spss tutorial-cluster-analysis
 
Anomaly detection (Unsupervised Learning) in Machine Learning
Anomaly detection (Unsupervised Learning) in Machine LearningAnomaly detection (Unsupervised Learning) in Machine Learning
Anomaly detection (Unsupervised Learning) in Machine Learning
 
Prostate Cancer Diagnosis using Deep Learning with 3D Multiparametric MRI: Pr...
Prostate Cancer Diagnosis using Deep Learning with 3D Multiparametric MRI: Pr...Prostate Cancer Diagnosis using Deep Learning with 3D Multiparametric MRI: Pr...
Prostate Cancer Diagnosis using Deep Learning with 3D Multiparametric MRI: Pr...
 
Genetic algorithm
Genetic algorithmGenetic algorithm
Genetic algorithm
 
House Sale Price Prediction
House Sale Price PredictionHouse Sale Price Prediction
House Sale Price Prediction
 
Statistical modeling in San Francisco Crime Prediction
Statistical modeling in San Francisco Crime PredictionStatistical modeling in San Francisco Crime Prediction
Statistical modeling in San Francisco Crime Prediction
 
Association mapping, GWAS, Mapping, natural population mapping
Association mapping, GWAS, Mapping, natural population mappingAssociation mapping, GWAS, Mapping, natural population mapping
Association mapping, GWAS, Mapping, natural population mapping
 
a brief introduction to epistasis detection
a brief introduction to epistasis detectiona brief introduction to epistasis detection
a brief introduction to epistasis detection
 
Machine Learning Workshop
Machine Learning WorkshopMachine Learning Workshop
Machine Learning Workshop
 
Microarray Analysis
Microarray AnalysisMicroarray Analysis
Microarray Analysis
 
Detecting STR Peaks in Degraded DNA samples
Detecting STR Peaks in Degraded DNA samplesDetecting STR Peaks in Degraded DNA samples
Detecting STR Peaks in Degraded DNA samples
 
Project presentation - Capstone
Project presentation  - CapstoneProject presentation  - Capstone
Project presentation - Capstone
 
SPSS Step-by-Step Tutorial and Statistical Guides by Statswork
SPSS Step-by-Step Tutorial and Statistical Guides by StatsworkSPSS Step-by-Step Tutorial and Statistical Guides by Statswork
SPSS Step-by-Step Tutorial and Statistical Guides by Statswork
 
Data Science Project by Areeb Ansari.ppt
Data Science Project by Areeb Ansari.pptData Science Project by Areeb Ansari.ppt
Data Science Project by Areeb Ansari.ppt
 
Data_Preparation.pptx
Data_Preparation.pptxData_Preparation.pptx
Data_Preparation.pptx
 

Kürzlich hochgeladen

Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesThousandEyes
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditSkynet Technologies
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Scott Andery
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 

Kürzlich hochgeladen (20)

Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance Audit
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 

Cluster analysis using Rapidminer and Sas

  • 2. Agenda • The Data • Some preliminary treatments      • • • • • • Checking for outliers Manual outlier checking for a given confidence level Filtering outliers Data without outliers Selecting attributes for clusters Setting up clusters Reading the clusters Using SAS for clustering Dendrogram Depicting Tree using SAS Conclusion
  • 3. The Data • Number of observations: 97 • 3 numeric variables:  Birth rate per thousand  Death rate per thousand  Infant mortality rate per thousand • 1 polynomial variable: Country • Data obtained from UN Demographic Yearbook 1990
  • 4. Some preliminary treatments • Checking for outliers using RapidMiner
  • 5. Some preliminary treatments • Manual checking for outliers at a given confidence level • For Birth (95%)  mu-2(sigma) = 27.384-2(12.978) = 1.428  mu+2(sigma) = 27.384+2(12.978) = 53.34 • Hence, no outliers
  • 6. • Filtering outliers o 10 outliers recorded
  • 7. • Data without outliers o Filter examples o Parameter string: outlier=true o Invert filter
  • 8. • Selecting attributes for clusters o Clusters on polynomial variables make no sense o Remove Country from attribute list
  • 9. • Setting up clusters o K=3 o Join both nodes to get cluster model information
  • 10. Reading the Clusters • • • Cluster 1: Low values of each numeric variable Cluster 2: High values of each numeric variable Cluster 0: Moderate values of each numeric variable
  • 11. Reading the Clusters • • Scatter Plot Birth and Death against Infant Death Rate Size – Infant Death Rate
  • 12. Using SAS for clustering • • Using canonical variables for standardization of variables to mean 0 and standard deviation 1 Spherical within-cluster covariance matrix proc aceclus data=Poverty out=Ace p=.03 noprint; var Birth Death InfantDeath; run; proc cluster data=Ace outtree=Tree method=ward ccc pseudo print=15; var can1 can2 can3 ; id Country; run;
  • 13. Using SAS for clustering • First 2 canonical variables account for about 93% of the total variation
  • 15. Tree depiction • • Plot can1 and can2 against cluster Shows similar plot compared to RapidMiner output
  • 16. Conclusion • Cluster 1: Mostly developed European nations, USA, UK, Singapore, USSR, etc • • • • Cluster 2: Afghanistan, Pakistan, Iran, mostly under privileged African nations • • • • • Efficient allocation of public goods Lower crime rates Abortion legalized Low GDP Abortion not legal High crime rates, prevalent wars and terrorist activities Poor health standards, high poverty levels Cluster 0: India, Mexico, South Africa, Saudi Arabia, etc • • • • Emerging nations Increasing growth rates Controlled negative externalities Focus on literacy and employment