2. Agenda
• The Data
• Some preliminary treatments
•
•
•
•
•
•
Checking for outliers
Manual outlier checking for a given confidence level
Filtering outliers
Data without outliers
Selecting attributes for clusters
Setting up clusters
Reading the clusters
Using SAS for clustering
Dendrogram
Depicting Tree using SAS
Conclusion
3. The Data
• Number of observations: 97
• 3 numeric variables:
Birth rate per thousand
Death rate per thousand
Infant mortality rate per thousand
• 1 polynomial variable: Country
• Data obtained from UN Demographic
Yearbook 1990
5. Some preliminary treatments
• Manual checking for outliers at a given confidence
level
• For Birth (95%)
mu-2(sigma) = 27.384-2(12.978) = 1.428
mu+2(sigma) = 27.384+2(12.978) = 53.34
• Hence, no outliers
7. • Data without outliers
o Filter examples
o Parameter string: outlier=true
o Invert filter
8. • Selecting attributes for clusters
o Clusters on polynomial variables make no sense
o Remove Country from attribute list
9. • Setting up clusters
o K=3
o Join both nodes to get cluster model information
10. Reading the Clusters
•
•
•
Cluster 1: Low values of each numeric variable
Cluster 2: High values of each numeric variable
Cluster 0: Moderate values of each numeric variable
12. Using SAS for clustering
•
•
Using canonical variables for standardization of
variables to mean 0 and standard deviation 1
Spherical within-cluster covariance matrix
proc aceclus data=Poverty out=Ace p=.03
noprint;
var Birth Death InfantDeath;
run;
proc cluster data=Ace outtree=Tree
method=ward
ccc pseudo print=15;
var can1 can2 can3 ;
id Country;
run;
13. Using SAS for clustering
•
First 2 canonical variables account for about 93% of
the total variation