The document provides an overview of training on SAS Enterprise Guide and Enterprise Miner for analytical capabilities. It discusses the process flow involving data compilation in EG, analysis, and presentation. Advanced analytical techniques in EM like cluster analysis, decision trees, and regressions are also covered. Practical exercises on credit scoring using EG and EM are demonstrated involving steps of data acquisition, understanding data, selecting important variables, and modeling.
2. 2
Exploration of Analytical Possibilities in SAS
Process Flow
Techniques and Concepts
Contents
2
Getting Started
Practical Exercises
3. 3
Exploration of Analytical possibilties in SAS
Process Flow
Getting Started
Contents
3
Credit Scoring Project – An Example of SAS Usage
Practical Observations
4. Enterprise Guide (EG) & Enterprise
Miner (Eminer)
4
EG is an interactive interface for data compilation,
transformation, analysis and presentation
Version used:4
Eminer is the advanced analytical interface for Data
Mining
Version used:5.1
4
5. EG offers:
Data Compilation
&
Transformation
• Datasets in different formats accepted: Excel, csv, txt
files, Microsoft Access
• Data size is no constraint (upto 9 lacs rows downloaded)
• Functions like APPEND, JOIN, SORT, RANDOM
SAMPLE, RANK etc
• Host of Statistical Techniques like
• Regressions-Linear, Non-Linear, Logistic
• Time-Series Forecasting
• Correlations, Principal Components, Factor
Analysis
• Final files to be used for Periodic Reporting like in SBR
• Graphs ,Data summary tables etc
• Final SAS Datasets to be used in EMiner
Data Analysis
Data Export and
Presentation
5
8. 8
Exploration of Analytical Possibilities in SAS
Process Flow
Techniques and Concepts
Contents
8
Getting Started
Practical Exercises
9. Process Flow For SAS Applications
Database
Servers
8563,
8561
Excel,
CSV,
M.
Access
, txt
files
+
Enterprise Guide
• Data Compilation
• Data Analysis
• Presentation
•Excel files for
Reporting
Purposes
•Html files as
graphs, tables
etc
• SAS Data files to
be used for
advanced
analytics in SAS
EMiner
9
Enterprise Miner
Cluster Analysis
Decision trees
Regressions
Neural networks
Output
• Excel files for
Reporting
Purposes
• Html files as
graphs, tables
etc
10. 10
Exploration of Analytical Possibilities in SAS
Process Flow
Techniques and Concepts
Contents
10
Getting Started
Practical Exercises
11. ‘What’ &‘How’ of ANN
Artificial Neural Networks
“Non-linear” Statistical Data Modeling Tool
Models complex relationships between inputs and
outputs
Consists of interconnected group of ‘Artificial
Neurons’
Uses ‘connectionist’ approach to computation –
Multi-layer Perceptron (MLP) most common
approach
Example:
Classification of Good and Bad Credit Risks based on most relevant variables
out of occupation, financials, Age, past Banking Record etc by training neural
network on historic data
12. ‘What’ & ‘How’ of Regression
Linear and Non-linear Regressions
Consists of dependent variable, independent
variables, parameter and random error term
Rely heavily on assumptions for probability
distribution of error term
Used for modeling of causal relationships,
hypothesis testing and prediction (as of time-
series data)
NLMs are logarithmic, exponential functions etc
Example:
Examining the relationship between the performance of a Channel partner
(dealer) with his market share, vintage, geo- distribution etc
E
13. ‘What’ & ‘How’ of DT
Decision Trees
Predictive Model with ‘leaves’ and ‘branches’
Leaves mean the cuts or classifications and
Branches mean the criteria for those cuts
Maps observations into conclusions based on the
target value
Example:
A two level tree showing best performance in ACL & SAL for West Zone and
business profile services
E
14. ‘What’ & ‘How’ of Cluster Analysis
K- Means Clustering
Partitioning of data into K clusters
Data point assigned a cluster which has the
‘centre’ or ‘centroid’ nearest to it.
“Iterative” refinement of centroids of a cluster
Convergence when intra –cluster distance
minimized and inter cluster distance maximized
Example:
Dataset has two dimensions - churn and limit utilization in the first MoB. Then
if there are two clusters to start with…1st cluster has a centroid (mean of
vector points) of 5% and 70% lim utiz, 2nd has 10% and 85% and 3rd has 15%
and 99%. Then if new data-point is 17% churn and 95% lim utiz, then it will
most likely fall in Cluster 3. (Distance criteria can be selected)
15. BIU Concepts Un-coded
Loss forecasting
• Prediction of Delinquencies for a time period in future
• Use of roll rates and flow rates data
• Application of time series tools like ARIMA modeling
• Best results with greater number of data-points
• Analysis of long-term portfolio delinquency trends
• Grouping of data points based on the age in the
portfolio
• Tracking of bad rate over time for each vintage
• Estimation of losses over a period of time
• Statistical expression of the credit worthiness
• Use of client credit files
• Use of tools like logistic regression which give the
probability of default
15
Credit Scoring
Vintage Analysis
16. 16
Exploration of Analytical Possibilities in SAS
Process Flow
Techniques and Concepts
Contents
16
Getting Started
Practical Exercises
18. Basic Functions in SAS EG
18
Importing Data
Exporting final
datasets
Open command can be used also for opening SAS datasets and Projects or
where a change in format of variables not required.
Import of very big datasets can be done directly to the servers
18
20. Filter Query Page
Joining
datasets
Adding
Tables
Creating derived variables
Filter
data
Changing name
of output
Grouping
data
Join command needs to be executed with the option “select distinct rows only” or
be followed by “Sort” in Data segment of Main toolbar to avoid duplication of
entries
20
21. Most common Functionalities in EG
•Open
Project/Data/
Code
•Import Data
21
•Append
•Sort
•Random
•Sample
•Summary
Statistics
•Characterize
data
•Frequency
Tables
•Pie Charts
•Bar Chart
•Line Chart
•Anova
•Regression
Linear/Logisti
c
•Multivariate
Analysis
•Time Series
Analysis
File Data Describe Graph Analyze
23. 23
Getting Started in Eminer - Import Data
•Source: Server Eminer/FTPLIB, exported from 8561 Server of EG
•Data import is critical
•Column headings should not have special characters or >32 characters/ should not
start with numbers
•Creation of Diagrams
•Adjustment of the Role and the level of variables
27. Credit Scoring Project
A 4-Step exercise
Acquiring the data - Consolidation (includes addition of
variables), Rolling up, etc
Knowing the data - Critical Step
Segregating important variables
Modelling
Data Acquisition
EG-
Append,
join,
Group
•Choice of Performance Indicator
•Choice of Independent Variables
•Cleaning of dataEG-
Random
Sample
28. Credit Scoring Project
-Knowing the Data
Knowing Data
EG –
characterise
data
EM-
Stat
Explore
EM-
Explore
•Removal of outliers (based on Summ. Stats and
Domain knowledge)
•Missing values
• Imputation - mean,mode, Percentage
wise Dist. For Categorical Variables
through“Impute” Function
• Full case analysis - Trade-off is Loss
of data
• Detection of Outliers & Errors
•Data issues and solutions (from MFI
Experience)
• Need for Oversampling - Adjustments
to be made later or use alternative
performance indicator
• Need to tackle Undercoverage -
Reject Data
29. CreditScoringProject
Multi
Plot
Varia
ble
Select
ion
• Why use all techniques?
• Important variables not left out - Need to create Derived variables
• Common variables from all techniques give validation to results
• Factor Analysis, Principal Components Analysis can be used to remove redundant
variables
• All techniques except Cluster Analysis require a ‘target’ variable.
• In cluster Analysis,
• Standardization of data is a must- -”Internal Standardization” option
• Technique more biased of Categorical variables
Segregating Imp Variables Cluster
Analysis
Decision
Tree
Var
selection
30. Practical Observations in EG & EMiner
Running a Query and Running the whole Branch
Refreshing files with the same name and location in the Project
Refreshing the file with the same name but different location
Refreshing the file with a different name and a different location
Creating a code and linking it with adjacent files
“E:/biu” location to be specified while making new Eminer project
Data standardization required in Cluster Analysis (option present
in Eminer)
30
33. Location of Projects and Data
33
8563 Server
•SAS Main:Files/BI_RA Folder
•SAS Main:Libraries/rmagtrg
8561 Server •SAS Main:Files/ftpdir/sasbilogs
33