SlideShare ist ein Scribd-Unternehmen logo
1 von 8
Downloaden Sie, um offline zu lesen
A Statistical Framework for Cluster
Health Assessment and Its Application in
Anti-Money-Laundering Systems
By using cluster analysis to continuously assess the health of peer
groups used by anti-money-laundering systems, banks can better
understand the reasons for cluster deterioration over time.
Executive Summary
The art and science of clustering is used in many
fields, from ubiquitous customer segmentation
for gauging marketing effectiveness, to predicting
default patterns among credit card holders, to
its recent application by financial institutions
to segment investment customers to enhance
liquidity management. In the area of anti-mon-
ey-laundering, the use of peer grouping, or
segmenting, is even more prevalent as it provides
a tailor-made solution for detecting unusual
transactional activities.
The premise: Customers are expected to exhibit
the transactional behavior of the peer group in
which they fall; any deviation is deemed unusual.
Increased sophistication in statistical methodolo-
giesandadvancementinITsolutionshaveensured
that peer grouping becomes the foundation of
various anti-money-laundering (AML) solutions.
Despite this, however, there hasn’t been much
development around approaches that could
ensure that predefined peer group, or clusters,
remain healthy (i.e., reflective of precise transac-
tional behaviors); peer group validation remains
overlooked. A peer group is called healthy when a
majority of its constituents exhibit similar charac-
teristics and dissimilar characteristics to constitu-
ents of other peer groups.
This white paper describes the ways in which
the health of a cluster or peer group may be
erroneous or bad due either to poor choice of
segmentation variables or to the movement of
entities between clusters over time. Importantly,
it proposes a generic statistical methodology to
provide an objective assessment of peer group
health. In other words, this paper provides a
methodology to create a quantitative indicator of
the extent to which a peer grouping system under
consideration conforms to the fundamental traits
of a good peer group.
Healthy Clusters: A Definitional
Foundation
Clustering is increasingly used across various
fields in innovative ways and has proved to
be extremely helpful in predicting customer
behavior and identifying outlier patterns. Never-
theless, most techniques employ primitive meth-
odologies to update or maintain clusters. This
paper proposes a generic framework that can
• Cognizant 20-20 Insights
cognizant 20-20 insights | april 2014
2
be used to assess cluster health and improve the
predictive capabilities of such a clustering system
by locating the probable causes of cluster health
deterioration.
There are several reasons why clusters can be
said to have deteriorated over time. The most
notable factors include:
•	Clusters are often created on the basis of
expert judgment, which is liable to go awry
when markets turn dynamic.
•	Segmentation variables, which were selected to
create clusters, may not be the most appropri-
ate ones and do not differentiate constituents
of clusters enough to produce clear segments.
•	Legitimate changes in clusters due to actual
behavioral change of customer groups over
time.
•	Poor-quality data or lack of data while forming
clusters, necessitating a relook given the avail-
ability of new data.
•	Additional information gained over time about
cluster constituents may demand a relook at
existing clusters.
We begin by explaining the traits of a healthy
peer group, followed by the methodology to
assess health and identify the reasons for health
deterioration — notably concerning the top two
bullet points above. The assessment method-
ology adopted, though largely generic, can be
used only if the problem conforms to a particular
structure. We explain the methodology based on
our analysis on a set of real customer data. We
also briefly describe various statistical measures
of cluster health and when they can be used.
Traits of Healthy Clustering
From a business perspective, the degree of
presence of each of the factors explained further
down (e.g., identifiability, compactness, etc.) will
define how good or bad the cluster system is. It
is notable to mention here that there are sta-
tistical measures that correspond to particular
parameters: entropy, for example, measures both
the homogeneity and separation of clusters.
There are important caveats to all of this. The
parameters mentioned below are not entirely
independent of each other. A high degree of
homogeneity and compactness is likely to be
observed. But this is not necessarily true in cases
where clusters are highly dispersed and highly
separated (i.e., within one customer segment),
behaviors are not too similar (big, dispersed
clusters), but there are reasonable differences
from other groups (large differences among
clusters).
•	Identifiability/homogeneity (entropy, purity):
>> Can we see clear differences between seg-
ments?
>> Is the transaction behavior of one peer
group sufficiently different from that of oth-
er peer groups?
•	Compactness (variance ratio, additive
margin):
>> Are the data points of each cluster as close
to each other as possible? A common mea-
sure of compactness is variance, which
should be minimized.
•	Separation (L-separatability, entropy):
>> The clusters should themselves be widely
spaced.
>> Measured by distance between two clusters:
single linkage, complete linkage, comparison
of centroids.
•	Substantiality:
>> Are the segments large enough to warrant
separate groups and expected transactional
differences?
>> Does the peer grouping need to change if
there are very few data points in one peer
group?
•	Stability:
>> Do the peer groups remain stable over a cer-
tain period of time?
>> Can we implement dynamic profiling if over a
period of time customer behavior changes?
•	Scalability:
>> The peer groups should be able to accommo-
date and/or transform in the case of a huge
number of varied data points.
A Brief Methodology
The following methodology can be used to assess
the health of customer groups formed on the
basis of a set of business variables. The terms
“clusters” and “peer groups” may be used inter-
changeably for all practical purposes.
•	Variables definition: At the outset, let us
define some terms that are going to be used
later in the paper.
A set of variables that the organization uses
to create/form clusters is referred to as initial/
cognizant 20-20 insights
3cognizant 20-20 insights
input/system variables. Demographic variables
are a typical example. These are essentially
different from the variables used to validate the
clustering actually formed, which are referred
to as observed/output variables. Observable
transactional behavior variables such as value
or volume of transaction are typical examples
of such kinds of variables.
The methodology contains the following steps:
1.	 The initial assumption is that the input/initial
variables used to create customer peer groups
by the organization will correctly predict the
customer behaviors in terms of the observed
variables.
2.	Using various tools, the organizations create
clusters based on the initial variables.
>> Clustering on initial variables: Generally,
organizations create clusters based on a set
of initial variables (different from observed
variables) that are available while forming
clusters. In anti-money-laundering, for ex-
ample, the initial set of variables is annual
income, age, living area type (city/village/
town) and product types. The nonavailability
of data on output or observed variables is
the driver. In the context of AML solutions,
output variables depict transactional pro-
files such as value and volume.
The clusters thus formed are expected to
correctly predict the customer behaviors in
terms of observed variables in the next step
(Step 3).
3.	For quality assessment, the health of the
clusters formed in Step 2 is checked by
analyzing observed output variables.
>> Analysis of observed vari-
ables: For any general busi-
ness problem, the health of
clustering can be judged
by looking at the groups of
observable variables that
define the constituents of
that cluster. For example,
in the case of an anti-mon-
ey-laundering peer group,
the clusters of customers
formed should be “good”
based on their transaction
profiles. That means trans-
action profiles of all constituents of a peer
group or cluster should be somehow similar.
Transaction profiles are represented by the
transaction volume, transaction value and
transaction types. Specifically, this means
that the clusters, which were formed at the
time of system configuration and used for
detection of unusual transactions, should be
“good” when assessed using the observed
variables — transaction values, volumes and
types.
Quick Take
The mathematical details of each of the measures
mentioned here are explained the glossary.
•	Entropy: Entropy is a measure of the homoge-
neity of objects with a single class label (here,
types of products). If the resulting clusters
are not healthy based on entropy or purity,
it is assumed that the clustering is bad and
needs to be redone. If the entropy calculations
yield satisfactory results within a predefined
confidence interval, then further means of
cluster analysis like variance ratio, additive
margin and L-separatability can be applied.
>> L-separatability: This can be simply de-
scribed as the ratio of the distance of each
point in the entire population with the
population centroid to the distance of the
combined two clusters from their average
centroid. A lower value indicates a better
separatability from the adjoining cluster.
>> Additive margin: Simply put, this is the ra-
tio of the average difference between the
distance of points of a cluster to its centroid
within the same cluster and the centroid of
the nearest cluster to the average within
cluster distance. A higher value indicates
better quality.
Statistical Measures
For any general
business problem, the
health of clustering
can be judged
by looking at the
groups of observable
variables that define
the constituents of
that cluster.
cognizant 20-20 insights 4
Figure 1
Figure 2
TransactionValue(in‘000$)
Transaction Value (in ‘000 $)
Assumptions for peer groups
Higher Health Index
Peer groups before analysis
Low Health Index
500
450
400
350
300
250
200
150
100
50
0
0 20
1
2
3
4
1
2
3
4
5
5
40 60 80 100 120
1
5
2
TransactionValue(in‘000$)
Transaction Value (in ‘000 $)
500
450
400
350
300
250
200
150
100
50
0
0 20 40 60 80 100 120
1
2
3
TransactionValue(in‘000$)
Transaction Value (in ‘000 $)
Assumptions for peer groups
Higher Health Index
Peer groups before analysis
Low Health Index
500
450
400
350
300
250
200
150
100
50
0
0 20
1
2
3
4
1
2
3
4
5
5
40 60 80 100 120
1
5
2
TransactionValue(in‘000$)
Transaction Value (in ‘000 $)
500
450
400
350
300
250
200
150
100
50
0
0 20 40 60 80 100 120
1
2
3
Assumed Transactional Behavior of Peer Groups Per Premise
Actual Transactional Behavior of Peer Groups Over Time
A representation of the methodology discussed
in Steps 2 & 3 can be elucidated by Figures 1 and
2. The figures plot the transactional behaviors
(transactional value, transactional volume) of
a set of ~100,000 records of customer data for
anti-money-laundering systems of a leading U.S.
brokerage firm.
Figure 1 represents the initial expectations of
the customer transactional behaviors in terms
of observed variables during the peer group
configuration phase. Figure 2 represents the
actual observed transactional behaviors, showing
clusters corrupted due to one or both of the
following reasons over a period of time:
•	Specifically in anti-money-laundering transac-
tions it may be argued that the customers that
were grouped together on the basis of some
parameters may have moved/changed over
time to other groups due to a legitimate change
in their characteristics. Typically, organizations
do not regularly check the peer groups formed,
and hence the discrepancy.
•	It is possible that the initial variables expected
to correctly predict the customer behaviors
were wrongly chosen. For example, the initial
set of variables used to create clusters might
have included “gender,” which does not nec-
essarily reflect customer behavior in the long
term. It is also quite possible to have missed
important input variables such as “income”
in the initial variable set, resulting in poor
grouping.
These reasons, among other scenario-specific
reasons, provide an insight into the deterioration
of the health of clusters over time.
cognizant 20-20 insights 5
Figure 3
Peer group data Gather data
Are similar
transaction types
clustered
together?
Measure if customers with similar
transaction type are together?
Measure if customers with similar trans
value and volume are together?
Is value/
volume-based
clustering good?
Dashboard displaying
clustering measures
Transaction data
• Entropy
• Additive Margin
• Variance Ratio
• Cluster Quantity
No
Yes
Yes
Declare measure for bad clustering
No
Declare measure for bad clustering
A System Architecture to Analyze Cluster Health
A perfect clustering
would mean that
groups assessed on
the basis of observed
variables are healthy
and thus the clusters
formed using initial
variables remain good
in terms of observed
variables as well.
4.	If the health of the clusters is “bad,” the organi-
zation should take steps to:
>> Reconsider and redefine the initial variables
taken.
>> Check if the clustering has deteriorated, not
because of wrong initial variables chosen
but due to time-dependency.
5.	The organization should remedy the problems
identified, and repeat the process again for
validation.
Calculating Cluster Health
A perfect clustering would mean that groups
assessed on the basis of observed variables are
healthy and thus the clusters formed using initial
variables remain good in terms of observed
variables as well. Different business require-
ments may lead to a different selection of sta-
tistical measures (e.g., entropy, L-separatability,
additive margins, etc.). In the beginning, business
decisions must be made to determine the allowed
value and variation of the measure being used.
However, calculation of clusters’ health using
a complete set of observed parameters may
not be possible. Not all parameters and their
effects can be quantified. For example, a credit-
issuing company developing parameters to form
customer segments may focus on salary segments
(in dollars) and types of products (loans, credit
cards, etc.), among other criteria. While it may
be easy to plot and cluster the customers using
salary figures and to analyze clusters, it is difficult
to visualize the type of product, which is a non-
ordinal variable that can’t be plotted.
This difficulty can be eliminated by using the
statistical measure of entropy to see if clusters
formed are homogeneous in nature.
Step Sequence for Calculating the
Health Index for Typical AML Systems
This section depicts the complete sequence of
steps or the framework used to calculate the
health of peer groups using techniques identified
in earlier sections of this paper. This generic
framework can be used, with
relevant scenario-specific
modifications, to approach
the cluster health problem.
Figure 3 represents the
sequence of steps leading to
the final statistical measure of
cluster (peer group) health. It
is assumed that the “health”
of clusters on the basis of
non-ordinal measures such
as transaction types can be
calculated with the help of
entropy.
cognizant 20-20 insights 6
Applying the Framework to Different Clustering/
Peer Group Systems
To apply our framework to calculate peer group
health, the peer group or clustering system should
conform to some basic constructs. For instance:
•	Clusters should have been formed to put con-
stituents having similar profiles together.
•	These profiles have to be represented by two
or more dimensions. At least two of these
dimensions should be representable either
quantitatively or in an ordinal manner. All of
these dimensions should be equally important
in business decision-making; moreover, these
dimensions should be orthogonal to each other.
•	While creating these clusters, data on these
dimensions should not be available. Hence,
these clusters should have been formed using
some other “predictor” variables, referred to
as initial variables in this paper.
•	These profiles, as mentioned in the first bullet
point above, should be an important consid-
eration while making business decisions. For
example, in case of AML systems, the transac-
tion profile of a customer determined whether
that customer should be declared suspicious.
Although this construct seems very specific, it
is found in most scenarios where clustering is
used. However, careful consideration is required
to fit a given problem in the above construct, so
that a peer group health index framework can be
applied in the most appropriate way.
Looking Forward: Additional
Applications
While this paper demonstrates the use of this
framework in the operational risk area of finance,
the generic nature of this framework makes it
extremely versatile and pliable for applications in
a wide range of subject areas that span financial
services, consumer marketing and behavioral
analytics. As such, all that is needed for such a
peer group health assessment is proper under-
standing of the subject area and an intelligent
analysis of initial and observed variables.
The benefits of this approach have already been
seen in the transaction monitoring area for anti-
money-laundering systems. This methodology
helped a brokerage firm identify issues with its
peer groups, which led to corrective measures
and eventually to a reduction in the number of
false alerts.
The recent emergence of big data technologies
supporting high density data, velocity and other
parameters can enable faster and easier imple-
mentation of this framework. We mention some
common fields where the cluster health index can
be applied:
•	Transaction analysis in anti-money-laundering
systems for customer groups.
•	Customer segmentation for credit-card-issuing
organizations.
•	Marketing effort validation for marketing
campaigns for targeted customers.
•	Mutual fund rebalancing for segments of
stocks grouped by stock price movement char-
acteristics.
cognizant 20-20 insights 7
Glossary
The mathematical details of the statistical measures used in this white paper include the following:
•	Entropy: To calculate the entropy of a set of peer groups, we first compute the class distribution of
the objects in each peer group — i.e., for each cluster j we compute pij, the probability that a member
of cluster j belongs to class i. Given this class distribution, the entropy of cluster j is calculated as
Ej = - ∑ log( ) (1)
E = ∑ (2)
L-Sepmax(C,X,d)=
( , , )
{ ( , , , ), }
,
1,C2,….C k} is some k-clustering. Cij is a clustering identical to C except with clusters Ci,Cj
taken over all classes. The total entropy for a set of clusters is computed as
Ej = - ∑ log( ) (1)
E = ∑ (2)
L-Sepmax(C,X,d)=
( , , )
{ ( , , , ), }
,
1,C2,….C k} is some k-clustering. Cij is a clustering identical to C except with clusters Ci,Cj
the weighted sum of the entropies of all clusters, as shown in (2), where nj is the size of cluster j, k is the
number of clusters, and n is the total number of data points.
•	L-separatability: Measures like L-separatability help normalize the loss functions to obtain scale
invariance.
Ej = - ∑ log( ) (1)
E = ∑ (2)
L-Sepmax(C,X,d)=
( , , )
{ ( , , , ), }
,
1,C2,….C k} is some k-clustering. Cij is a clustering identical to C except with clusters Ci,Cj
is sensitive to maximal separation between clusters.
Here, C is {C1
,C2
,….Ck
} is some k-clustering. Cij
is a clustering identical to C except with clusters Ci
,Cj
merged.
•	Additive margin: If instead of looking at ratios we want to evaluate quality using differences, we
use additive margin. The additive margin of a point x is C-AMx
,d(x)
=d(x,cj
)-d(x,ci
) where ci
is the closest
center to x and cj
is the second closest center to x and C is a center based clustering over (X,d).
AMX,d(C) =
, ( )
{ , }
( , )
The range is [0,∞].
The range is [0,∞].
References
•	D. Barbara, J. Couto and Y Li, “COOLCAT: an entropy-based algorithm for categorical clustering,”
Proceedings of the 11th ACM CIKM Conference, pp. 582–589, 2002.
•	Wallace, R. S., “Finding natural clusters through entropy minimization,” Technical Report CMU-CS-89-
183, Carnegie Mellon University, 1989.
•	Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze, Introduction to Information
Retrieval, Cambridge University Press, 2008.
•	R. Ostrovsky, Y. Rabani, L.J. Schulman and C. Swamy, “The Effectiveness of Lloyd-Type Methods for the
k-Means Problem,” Foundations of Computer Science, 2006, FOCS ’05, 47th Annual IEEE Symposium,
Berkeley, CA, October 2006, pp. 165-176.
About Cognizant
Cognizant (NASDAQ: CTSH) is a leading provider of information technology, consulting, and business process out-
sourcing services, dedicated to helping the world’s leading companies build stronger businesses. Headquartered in
Teaneck, New Jersey (U.S.), Cognizant combines a passion for client satisfaction, technology innovation, deep industry
and business process expertise, and a global, collaborative workforce that embodies the future of work. With over 50
delivery centers worldwide and approximately 171,400 employees as of December 31, 2013, Cognizant is a member of
the NASDAQ-100, the S&P 500, the Forbes Global 2000, and the Fortune 500 and is ranked among the top performing
and fastest growing companies in the world. Visit us online at www.cognizant.com or follow us on Twitter: Cognizant.
World Headquarters
500 Frank W. Burr Blvd.
Teaneck, NJ 07666 USA
Phone: +1 201 801 0233
Fax: +1 201 801 0243
Toll Free: +1 888 937 3277
Email: inquiry@cognizant.com
European Headquarters
1 Kingdom Street
Paddington Central
London W2 6BD
Phone: +44 (0) 20 7297 7600
Fax: +44 (0) 20 7121 0102
Email: infouk@cognizant.com
India Operations Headquarters
#5/535, Old Mahabalipuram Road
Okkiyam Pettai, Thoraipakkam
Chennai, 600 096 India
Phone: +91 (0) 44 4209 6000
Fax: +91 (0) 44 4209 6060
Email: inquiryindia@cognizant.com
­­© Copyright 2014, Cognizant. All rights reserved. No part of this document may be reproduced, stored in a retrieval system, transmitted in any form or by any
means, electronic, mechanical, photocopying, recording, or otherwise, without the express written permission from Cognizant. The information contained herein is
subject to change without notice. All other trademarks mentioned herein are the property of their respective owners.
About the Authors
Anshuman Sharma is a Financial Risk Consultant in the Governance Regulatory Compliance Practice
within Cognizant Business Consulting. He has three-plus years of experience in the finance sector, focused
on credit and market risk implementation, governance changes for Basel III/Dodd-Frank and regulatory
reporting. He previously worked for the hedge fund D.E. Shaw & Co. and holds an M.B.A. in finance from
XLRI Jamshedpur, India and an engineering degree from Indian Institute of Information Technology
Allahabad, India. Anshuman can be reached at Anshuman.Sharma2@cognizant.com.
Raghvendra Kushwah is a Consulting Manager within Cognizant Business Consulting, heading the Oper-
ational Risk Division, and has deep domain experience in fraud and anti-money-laundering processes
and implementation. His areas of expertise include analytics for behavioral finance, where he has led
numerous operational risk and analytics projects across several geographies. He has eight years of
experience in the IT and finance industries and holds an engineering degree from Indian Institute of
Technology, Delhi and an M.B.A. from Indian Institute of Management Lucknow. He can be reached at
Raghvendra.Kushwah@cognizant.com.
Anshuman Choudhary is a Director and heads the Governance Regulatory Compliance Practice within
Cognizant Business Consulting. His areas of expertise include consulting in risk management and
regulatory reporting. He has 14 years of business technology consulting and domain experience and
is a qualified GARP financial risk manager. Anshuman has an M.B.A. in finance from Indian Institute of
Social Welfare and Business Management and a bachelor’s degree in metallurgical engineering from REC
Durgapur, India. He can be reached at Anshuman.Choudhary@cognizant.

Weitere ähnliche Inhalte

Andere mochten auch

Inferential statistics
Inferential statisticsInferential statistics
Inferential statistics
Schwayb Javid
 

Andere mochten auch (9)

Syllabus briktru 2017
Syllabus briktru 2017Syllabus briktru 2017
Syllabus briktru 2017
 
Demand Forecasting
Demand ForecastingDemand Forecasting
Demand Forecasting
 
A study on inventory classification technique for effective store management ...
A study on inventory classification technique for effective store management ...A study on inventory classification technique for effective store management ...
A study on inventory classification technique for effective store management ...
 
Financial statement analysis
Financial statement analysisFinancial statement analysis
Financial statement analysis
 
Inferential statistics
Inferential statisticsInferential statistics
Inferential statistics
 
Vital Signs Taking
Vital Signs TakingVital Signs Taking
Vital Signs Taking
 
What Is Statistics
What Is StatisticsWhat Is Statistics
What Is Statistics
 
Role of Statistics in Scientific Research
Role of Statistics in Scientific ResearchRole of Statistics in Scientific Research
Role of Statistics in Scientific Research
 
Statistics lesson 1
Statistics   lesson 1Statistics   lesson 1
Statistics lesson 1
 

Ähnlich wie A Statistical Framework for Cluster Health Assessment and Its Application in Anti-Money-Laundering Systems

Customer Clustering Based on Customer Purchasing Sequence Data
Customer Clustering Based on Customer Purchasing Sequence DataCustomer Clustering Based on Customer Purchasing Sequence Data
Customer Clustering Based on Customer Purchasing Sequence Data
IJERA Editor
 
Cluster2
Cluster2Cluster2
Cluster2
work
 
Performance management analytics
Performance management analyticsPerformance management analytics
Performance management analytics
paramoozai
 
Module_6_-_Datamining_tasks_and_tools_uGuVaDv4iv-2.pptx
Module_6_-_Datamining_tasks_and_tools_uGuVaDv4iv-2.pptxModule_6_-_Datamining_tasks_and_tools_uGuVaDv4iv-2.pptx
Module_6_-_Datamining_tasks_and_tools_uGuVaDv4iv-2.pptx
HarshitGoel87
 
Inventory Optimization in a Market-Driven World - 27 APR 2015
Inventory Optimization in a Market-Driven World - 27 APR 2015Inventory Optimization in a Market-Driven World - 27 APR 2015
Inventory Optimization in a Market-Driven World - 27 APR 2015
Lora Cecere
 

Ähnlich wie A Statistical Framework for Cluster Health Assessment and Its Application in Anti-Money-Laundering Systems (20)

Data Analytics Notes
Data Analytics NotesData Analytics Notes
Data Analytics Notes
 
Recency/Frequency and Predictive Analytics in the gaming industry
Recency/Frequency and Predictive Analytics in the gaming industryRecency/Frequency and Predictive Analytics in the gaming industry
Recency/Frequency and Predictive Analytics in the gaming industry
 
TOC- Improve FMCG Distribution Channel Performance
TOC- Improve FMCG Distribution Channel PerformanceTOC- Improve FMCG Distribution Channel Performance
TOC- Improve FMCG Distribution Channel Performance
 
Customer Clustering Based on Customer Purchasing Sequence Data
Customer Clustering Based on Customer Purchasing Sequence DataCustomer Clustering Based on Customer Purchasing Sequence Data
Customer Clustering Based on Customer Purchasing Sequence Data
 
Cluster2
Cluster2Cluster2
Cluster2
 
Store segmentation progresso
Store segmentation progressoStore segmentation progresso
Store segmentation progresso
 
Prepaid customer segmentation in telecommunications: An overview of common pr...
Prepaid customer segmentation in telecommunications: An overview of common pr...Prepaid customer segmentation in telecommunications: An overview of common pr...
Prepaid customer segmentation in telecommunications: An overview of common pr...
 
Association rule mining.pptx
Association rule mining.pptxAssociation rule mining.pptx
Association rule mining.pptx
 
Reducing the gap between Consumer and Retailer using Association Rule Mining ...
Reducing the gap between Consumer and Retailer using Association Rule Mining ...Reducing the gap between Consumer and Retailer using Association Rule Mining ...
Reducing the gap between Consumer and Retailer using Association Rule Mining ...
 
Chapter 1.pdf
Chapter 1.pdfChapter 1.pdf
Chapter 1.pdf
 
Making Analytics Actionable for Financial Institutions (Part I of III)
Making Analytics Actionable for Financial Institutions (Part I of III)Making Analytics Actionable for Financial Institutions (Part I of III)
Making Analytics Actionable for Financial Institutions (Part I of III)
 
Performance management analytics
Performance management analyticsPerformance management analytics
Performance management analytics
 
A Survey on Customer Analytics Techniques for the Retail Industry
A Survey on Customer Analytics Techniques for the Retail IndustryA Survey on Customer Analytics Techniques for the Retail Industry
A Survey on Customer Analytics Techniques for the Retail Industry
 
Segmentation
SegmentationSegmentation
Segmentation
 
Segmentation
SegmentationSegmentation
Segmentation
 
Exam Short Preparation on Data Analytics
Exam Short Preparation on Data AnalyticsExam Short Preparation on Data Analytics
Exam Short Preparation on Data Analytics
 
Module_6_-_Datamining_tasks_and_tools_uGuVaDv4iv-2.pptx
Module_6_-_Datamining_tasks_and_tools_uGuVaDv4iv-2.pptxModule_6_-_Datamining_tasks_and_tools_uGuVaDv4iv-2.pptx
Module_6_-_Datamining_tasks_and_tools_uGuVaDv4iv-2.pptx
 
Data Analysis - Approach & Techniques
Data Analysis - Approach & TechniquesData Analysis - Approach & Techniques
Data Analysis - Approach & Techniques
 
segmentda
segmentdasegmentda
segmentda
 
Inventory Optimization in a Market-Driven World - 27 APR 2015
Inventory Optimization in a Market-Driven World - 27 APR 2015Inventory Optimization in a Market-Driven World - 27 APR 2015
Inventory Optimization in a Market-Driven World - 27 APR 2015
 

Mehr von Cognizant

Mehr von Cognizant (20)

Using Adaptive Scrum to Tame Process Reverse Engineering in Data Analytics Pr...
Using Adaptive Scrum to Tame Process Reverse Engineering in Data Analytics Pr...Using Adaptive Scrum to Tame Process Reverse Engineering in Data Analytics Pr...
Using Adaptive Scrum to Tame Process Reverse Engineering in Data Analytics Pr...
 
Data Modernization: Breaking the AI Vicious Cycle for Superior Decision-making
Data Modernization: Breaking the AI Vicious Cycle for Superior Decision-makingData Modernization: Breaking the AI Vicious Cycle for Superior Decision-making
Data Modernization: Breaking the AI Vicious Cycle for Superior Decision-making
 
It Takes an Ecosystem: How Technology Companies Deliver Exceptional Experiences
It Takes an Ecosystem: How Technology Companies Deliver Exceptional ExperiencesIt Takes an Ecosystem: How Technology Companies Deliver Exceptional Experiences
It Takes an Ecosystem: How Technology Companies Deliver Exceptional Experiences
 
Intuition Engineered
Intuition EngineeredIntuition Engineered
Intuition Engineered
 
The Work Ahead: Transportation and Logistics Delivering on the Digital-Physic...
The Work Ahead: Transportation and Logistics Delivering on the Digital-Physic...The Work Ahead: Transportation and Logistics Delivering on the Digital-Physic...
The Work Ahead: Transportation and Logistics Delivering on the Digital-Physic...
 
Enhancing Desirability: Five Considerations for Winning Digital Initiatives
Enhancing Desirability: Five Considerations for Winning Digital InitiativesEnhancing Desirability: Five Considerations for Winning Digital Initiatives
Enhancing Desirability: Five Considerations for Winning Digital Initiatives
 
The Work Ahead in Manufacturing: Fulfilling the Agility Mandate
The Work Ahead in Manufacturing: Fulfilling the Agility MandateThe Work Ahead in Manufacturing: Fulfilling the Agility Mandate
The Work Ahead in Manufacturing: Fulfilling the Agility Mandate
 
The Work Ahead in Higher Education: Repaving the Road for the Employees of To...
The Work Ahead in Higher Education: Repaving the Road for the Employees of To...The Work Ahead in Higher Education: Repaving the Road for the Employees of To...
The Work Ahead in Higher Education: Repaving the Road for the Employees of To...
 
Engineering the Next-Gen Digital Claims Organisation for Australian General I...
Engineering the Next-Gen Digital Claims Organisation for Australian General I...Engineering the Next-Gen Digital Claims Organisation for Australian General I...
Engineering the Next-Gen Digital Claims Organisation for Australian General I...
 
Profitability in the Direct-to-Consumer Marketplace: A Playbook for Media and...
Profitability in the Direct-to-Consumer Marketplace: A Playbook for Media and...Profitability in the Direct-to-Consumer Marketplace: A Playbook for Media and...
Profitability in the Direct-to-Consumer Marketplace: A Playbook for Media and...
 
Green Rush: The Economic Imperative for Sustainability
Green Rush: The Economic Imperative for SustainabilityGreen Rush: The Economic Imperative for Sustainability
Green Rush: The Economic Imperative for Sustainability
 
Policy Administration Modernization: Four Paths for Insurers
Policy Administration Modernization: Four Paths for InsurersPolicy Administration Modernization: Four Paths for Insurers
Policy Administration Modernization: Four Paths for Insurers
 
The Work Ahead in Utilities: Powering a Sustainable Future with Digital
The Work Ahead in Utilities: Powering a Sustainable Future with DigitalThe Work Ahead in Utilities: Powering a Sustainable Future with Digital
The Work Ahead in Utilities: Powering a Sustainable Future with Digital
 
AI in Media & Entertainment: Starting the Journey to Value
AI in Media & Entertainment: Starting the Journey to ValueAI in Media & Entertainment: Starting the Journey to Value
AI in Media & Entertainment: Starting the Journey to Value
 
Operations Workforce Management: A Data-Informed, Digital-First Approach
Operations Workforce Management: A Data-Informed, Digital-First ApproachOperations Workforce Management: A Data-Informed, Digital-First Approach
Operations Workforce Management: A Data-Informed, Digital-First Approach
 
Five Priorities for Quality Engineering When Taking Banking to the Cloud
Five Priorities for Quality Engineering When Taking Banking to the CloudFive Priorities for Quality Engineering When Taking Banking to the Cloud
Five Priorities for Quality Engineering When Taking Banking to the Cloud
 
Getting Ahead With AI: How APAC Companies Replicate Success by Remaining Focused
Getting Ahead With AI: How APAC Companies Replicate Success by Remaining FocusedGetting Ahead With AI: How APAC Companies Replicate Success by Remaining Focused
Getting Ahead With AI: How APAC Companies Replicate Success by Remaining Focused
 
Crafting the Utility of the Future
Crafting the Utility of the FutureCrafting the Utility of the Future
Crafting the Utility of the Future
 
Utilities Can Ramp Up CX with a Customer Data Platform
Utilities Can Ramp Up CX with a Customer Data PlatformUtilities Can Ramp Up CX with a Customer Data Platform
Utilities Can Ramp Up CX with a Customer Data Platform
 
The Work Ahead in Intelligent Automation: Coping with Complexity in a Post-Pa...
The Work Ahead in Intelligent Automation: Coping with Complexity in a Post-Pa...The Work Ahead in Intelligent Automation: Coping with Complexity in a Post-Pa...
The Work Ahead in Intelligent Automation: Coping with Complexity in a Post-Pa...
 

Kürzlich hochgeladen

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Kürzlich hochgeladen (20)

GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 

A Statistical Framework for Cluster Health Assessment and Its Application in Anti-Money-Laundering Systems

  • 1. A Statistical Framework for Cluster Health Assessment and Its Application in Anti-Money-Laundering Systems By using cluster analysis to continuously assess the health of peer groups used by anti-money-laundering systems, banks can better understand the reasons for cluster deterioration over time. Executive Summary The art and science of clustering is used in many fields, from ubiquitous customer segmentation for gauging marketing effectiveness, to predicting default patterns among credit card holders, to its recent application by financial institutions to segment investment customers to enhance liquidity management. In the area of anti-mon- ey-laundering, the use of peer grouping, or segmenting, is even more prevalent as it provides a tailor-made solution for detecting unusual transactional activities. The premise: Customers are expected to exhibit the transactional behavior of the peer group in which they fall; any deviation is deemed unusual. Increased sophistication in statistical methodolo- giesandadvancementinITsolutionshaveensured that peer grouping becomes the foundation of various anti-money-laundering (AML) solutions. Despite this, however, there hasn’t been much development around approaches that could ensure that predefined peer group, or clusters, remain healthy (i.e., reflective of precise transac- tional behaviors); peer group validation remains overlooked. A peer group is called healthy when a majority of its constituents exhibit similar charac- teristics and dissimilar characteristics to constitu- ents of other peer groups. This white paper describes the ways in which the health of a cluster or peer group may be erroneous or bad due either to poor choice of segmentation variables or to the movement of entities between clusters over time. Importantly, it proposes a generic statistical methodology to provide an objective assessment of peer group health. In other words, this paper provides a methodology to create a quantitative indicator of the extent to which a peer grouping system under consideration conforms to the fundamental traits of a good peer group. Healthy Clusters: A Definitional Foundation Clustering is increasingly used across various fields in innovative ways and has proved to be extremely helpful in predicting customer behavior and identifying outlier patterns. Never- theless, most techniques employ primitive meth- odologies to update or maintain clusters. This paper proposes a generic framework that can • Cognizant 20-20 Insights cognizant 20-20 insights | april 2014
  • 2. 2 be used to assess cluster health and improve the predictive capabilities of such a clustering system by locating the probable causes of cluster health deterioration. There are several reasons why clusters can be said to have deteriorated over time. The most notable factors include: • Clusters are often created on the basis of expert judgment, which is liable to go awry when markets turn dynamic. • Segmentation variables, which were selected to create clusters, may not be the most appropri- ate ones and do not differentiate constituents of clusters enough to produce clear segments. • Legitimate changes in clusters due to actual behavioral change of customer groups over time. • Poor-quality data or lack of data while forming clusters, necessitating a relook given the avail- ability of new data. • Additional information gained over time about cluster constituents may demand a relook at existing clusters. We begin by explaining the traits of a healthy peer group, followed by the methodology to assess health and identify the reasons for health deterioration — notably concerning the top two bullet points above. The assessment method- ology adopted, though largely generic, can be used only if the problem conforms to a particular structure. We explain the methodology based on our analysis on a set of real customer data. We also briefly describe various statistical measures of cluster health and when they can be used. Traits of Healthy Clustering From a business perspective, the degree of presence of each of the factors explained further down (e.g., identifiability, compactness, etc.) will define how good or bad the cluster system is. It is notable to mention here that there are sta- tistical measures that correspond to particular parameters: entropy, for example, measures both the homogeneity and separation of clusters. There are important caveats to all of this. The parameters mentioned below are not entirely independent of each other. A high degree of homogeneity and compactness is likely to be observed. But this is not necessarily true in cases where clusters are highly dispersed and highly separated (i.e., within one customer segment), behaviors are not too similar (big, dispersed clusters), but there are reasonable differences from other groups (large differences among clusters). • Identifiability/homogeneity (entropy, purity): >> Can we see clear differences between seg- ments? >> Is the transaction behavior of one peer group sufficiently different from that of oth- er peer groups? • Compactness (variance ratio, additive margin): >> Are the data points of each cluster as close to each other as possible? A common mea- sure of compactness is variance, which should be minimized. • Separation (L-separatability, entropy): >> The clusters should themselves be widely spaced. >> Measured by distance between two clusters: single linkage, complete linkage, comparison of centroids. • Substantiality: >> Are the segments large enough to warrant separate groups and expected transactional differences? >> Does the peer grouping need to change if there are very few data points in one peer group? • Stability: >> Do the peer groups remain stable over a cer- tain period of time? >> Can we implement dynamic profiling if over a period of time customer behavior changes? • Scalability: >> The peer groups should be able to accommo- date and/or transform in the case of a huge number of varied data points. A Brief Methodology The following methodology can be used to assess the health of customer groups formed on the basis of a set of business variables. The terms “clusters” and “peer groups” may be used inter- changeably for all practical purposes. • Variables definition: At the outset, let us define some terms that are going to be used later in the paper. A set of variables that the organization uses to create/form clusters is referred to as initial/ cognizant 20-20 insights
  • 3. 3cognizant 20-20 insights input/system variables. Demographic variables are a typical example. These are essentially different from the variables used to validate the clustering actually formed, which are referred to as observed/output variables. Observable transactional behavior variables such as value or volume of transaction are typical examples of such kinds of variables. The methodology contains the following steps: 1. The initial assumption is that the input/initial variables used to create customer peer groups by the organization will correctly predict the customer behaviors in terms of the observed variables. 2. Using various tools, the organizations create clusters based on the initial variables. >> Clustering on initial variables: Generally, organizations create clusters based on a set of initial variables (different from observed variables) that are available while forming clusters. In anti-money-laundering, for ex- ample, the initial set of variables is annual income, age, living area type (city/village/ town) and product types. The nonavailability of data on output or observed variables is the driver. In the context of AML solutions, output variables depict transactional pro- files such as value and volume. The clusters thus formed are expected to correctly predict the customer behaviors in terms of observed variables in the next step (Step 3). 3. For quality assessment, the health of the clusters formed in Step 2 is checked by analyzing observed output variables. >> Analysis of observed vari- ables: For any general busi- ness problem, the health of clustering can be judged by looking at the groups of observable variables that define the constituents of that cluster. For example, in the case of an anti-mon- ey-laundering peer group, the clusters of customers formed should be “good” based on their transaction profiles. That means trans- action profiles of all constituents of a peer group or cluster should be somehow similar. Transaction profiles are represented by the transaction volume, transaction value and transaction types. Specifically, this means that the clusters, which were formed at the time of system configuration and used for detection of unusual transactions, should be “good” when assessed using the observed variables — transaction values, volumes and types. Quick Take The mathematical details of each of the measures mentioned here are explained the glossary. • Entropy: Entropy is a measure of the homoge- neity of objects with a single class label (here, types of products). If the resulting clusters are not healthy based on entropy or purity, it is assumed that the clustering is bad and needs to be redone. If the entropy calculations yield satisfactory results within a predefined confidence interval, then further means of cluster analysis like variance ratio, additive margin and L-separatability can be applied. >> L-separatability: This can be simply de- scribed as the ratio of the distance of each point in the entire population with the population centroid to the distance of the combined two clusters from their average centroid. A lower value indicates a better separatability from the adjoining cluster. >> Additive margin: Simply put, this is the ra- tio of the average difference between the distance of points of a cluster to its centroid within the same cluster and the centroid of the nearest cluster to the average within cluster distance. A higher value indicates better quality. Statistical Measures For any general business problem, the health of clustering can be judged by looking at the groups of observable variables that define the constituents of that cluster.
  • 4. cognizant 20-20 insights 4 Figure 1 Figure 2 TransactionValue(in‘000$) Transaction Value (in ‘000 $) Assumptions for peer groups Higher Health Index Peer groups before analysis Low Health Index 500 450 400 350 300 250 200 150 100 50 0 0 20 1 2 3 4 1 2 3 4 5 5 40 60 80 100 120 1 5 2 TransactionValue(in‘000$) Transaction Value (in ‘000 $) 500 450 400 350 300 250 200 150 100 50 0 0 20 40 60 80 100 120 1 2 3 TransactionValue(in‘000$) Transaction Value (in ‘000 $) Assumptions for peer groups Higher Health Index Peer groups before analysis Low Health Index 500 450 400 350 300 250 200 150 100 50 0 0 20 1 2 3 4 1 2 3 4 5 5 40 60 80 100 120 1 5 2 TransactionValue(in‘000$) Transaction Value (in ‘000 $) 500 450 400 350 300 250 200 150 100 50 0 0 20 40 60 80 100 120 1 2 3 Assumed Transactional Behavior of Peer Groups Per Premise Actual Transactional Behavior of Peer Groups Over Time A representation of the methodology discussed in Steps 2 & 3 can be elucidated by Figures 1 and 2. The figures plot the transactional behaviors (transactional value, transactional volume) of a set of ~100,000 records of customer data for anti-money-laundering systems of a leading U.S. brokerage firm. Figure 1 represents the initial expectations of the customer transactional behaviors in terms of observed variables during the peer group configuration phase. Figure 2 represents the actual observed transactional behaviors, showing clusters corrupted due to one or both of the following reasons over a period of time: • Specifically in anti-money-laundering transac- tions it may be argued that the customers that were grouped together on the basis of some parameters may have moved/changed over time to other groups due to a legitimate change in their characteristics. Typically, organizations do not regularly check the peer groups formed, and hence the discrepancy. • It is possible that the initial variables expected to correctly predict the customer behaviors were wrongly chosen. For example, the initial set of variables used to create clusters might have included “gender,” which does not nec- essarily reflect customer behavior in the long term. It is also quite possible to have missed important input variables such as “income” in the initial variable set, resulting in poor grouping. These reasons, among other scenario-specific reasons, provide an insight into the deterioration of the health of clusters over time.
  • 5. cognizant 20-20 insights 5 Figure 3 Peer group data Gather data Are similar transaction types clustered together? Measure if customers with similar transaction type are together? Measure if customers with similar trans value and volume are together? Is value/ volume-based clustering good? Dashboard displaying clustering measures Transaction data • Entropy • Additive Margin • Variance Ratio • Cluster Quantity No Yes Yes Declare measure for bad clustering No Declare measure for bad clustering A System Architecture to Analyze Cluster Health A perfect clustering would mean that groups assessed on the basis of observed variables are healthy and thus the clusters formed using initial variables remain good in terms of observed variables as well. 4. If the health of the clusters is “bad,” the organi- zation should take steps to: >> Reconsider and redefine the initial variables taken. >> Check if the clustering has deteriorated, not because of wrong initial variables chosen but due to time-dependency. 5. The organization should remedy the problems identified, and repeat the process again for validation. Calculating Cluster Health A perfect clustering would mean that groups assessed on the basis of observed variables are healthy and thus the clusters formed using initial variables remain good in terms of observed variables as well. Different business require- ments may lead to a different selection of sta- tistical measures (e.g., entropy, L-separatability, additive margins, etc.). In the beginning, business decisions must be made to determine the allowed value and variation of the measure being used. However, calculation of clusters’ health using a complete set of observed parameters may not be possible. Not all parameters and their effects can be quantified. For example, a credit- issuing company developing parameters to form customer segments may focus on salary segments (in dollars) and types of products (loans, credit cards, etc.), among other criteria. While it may be easy to plot and cluster the customers using salary figures and to analyze clusters, it is difficult to visualize the type of product, which is a non- ordinal variable that can’t be plotted. This difficulty can be eliminated by using the statistical measure of entropy to see if clusters formed are homogeneous in nature. Step Sequence for Calculating the Health Index for Typical AML Systems This section depicts the complete sequence of steps or the framework used to calculate the health of peer groups using techniques identified in earlier sections of this paper. This generic framework can be used, with relevant scenario-specific modifications, to approach the cluster health problem. Figure 3 represents the sequence of steps leading to the final statistical measure of cluster (peer group) health. It is assumed that the “health” of clusters on the basis of non-ordinal measures such as transaction types can be calculated with the help of entropy.
  • 6. cognizant 20-20 insights 6 Applying the Framework to Different Clustering/ Peer Group Systems To apply our framework to calculate peer group health, the peer group or clustering system should conform to some basic constructs. For instance: • Clusters should have been formed to put con- stituents having similar profiles together. • These profiles have to be represented by two or more dimensions. At least two of these dimensions should be representable either quantitatively or in an ordinal manner. All of these dimensions should be equally important in business decision-making; moreover, these dimensions should be orthogonal to each other. • While creating these clusters, data on these dimensions should not be available. Hence, these clusters should have been formed using some other “predictor” variables, referred to as initial variables in this paper. • These profiles, as mentioned in the first bullet point above, should be an important consid- eration while making business decisions. For example, in case of AML systems, the transac- tion profile of a customer determined whether that customer should be declared suspicious. Although this construct seems very specific, it is found in most scenarios where clustering is used. However, careful consideration is required to fit a given problem in the above construct, so that a peer group health index framework can be applied in the most appropriate way. Looking Forward: Additional Applications While this paper demonstrates the use of this framework in the operational risk area of finance, the generic nature of this framework makes it extremely versatile and pliable for applications in a wide range of subject areas that span financial services, consumer marketing and behavioral analytics. As such, all that is needed for such a peer group health assessment is proper under- standing of the subject area and an intelligent analysis of initial and observed variables. The benefits of this approach have already been seen in the transaction monitoring area for anti- money-laundering systems. This methodology helped a brokerage firm identify issues with its peer groups, which led to corrective measures and eventually to a reduction in the number of false alerts. The recent emergence of big data technologies supporting high density data, velocity and other parameters can enable faster and easier imple- mentation of this framework. We mention some common fields where the cluster health index can be applied: • Transaction analysis in anti-money-laundering systems for customer groups. • Customer segmentation for credit-card-issuing organizations. • Marketing effort validation for marketing campaigns for targeted customers. • Mutual fund rebalancing for segments of stocks grouped by stock price movement char- acteristics.
  • 7. cognizant 20-20 insights 7 Glossary The mathematical details of the statistical measures used in this white paper include the following: • Entropy: To calculate the entropy of a set of peer groups, we first compute the class distribution of the objects in each peer group — i.e., for each cluster j we compute pij, the probability that a member of cluster j belongs to class i. Given this class distribution, the entropy of cluster j is calculated as Ej = - ∑ log( ) (1) E = ∑ (2) L-Sepmax(C,X,d)= ( , , ) { ( , , , ), } , 1,C2,….C k} is some k-clustering. Cij is a clustering identical to C except with clusters Ci,Cj taken over all classes. The total entropy for a set of clusters is computed as Ej = - ∑ log( ) (1) E = ∑ (2) L-Sepmax(C,X,d)= ( , , ) { ( , , , ), } , 1,C2,….C k} is some k-clustering. Cij is a clustering identical to C except with clusters Ci,Cj the weighted sum of the entropies of all clusters, as shown in (2), where nj is the size of cluster j, k is the number of clusters, and n is the total number of data points. • L-separatability: Measures like L-separatability help normalize the loss functions to obtain scale invariance. Ej = - ∑ log( ) (1) E = ∑ (2) L-Sepmax(C,X,d)= ( , , ) { ( , , , ), } , 1,C2,….C k} is some k-clustering. Cij is a clustering identical to C except with clusters Ci,Cj is sensitive to maximal separation between clusters. Here, C is {C1 ,C2 ,….Ck } is some k-clustering. Cij is a clustering identical to C except with clusters Ci ,Cj merged. • Additive margin: If instead of looking at ratios we want to evaluate quality using differences, we use additive margin. The additive margin of a point x is C-AMx ,d(x) =d(x,cj )-d(x,ci ) where ci is the closest center to x and cj is the second closest center to x and C is a center based clustering over (X,d). AMX,d(C) = , ( ) { , } ( , ) The range is [0,∞]. The range is [0,∞]. References • D. Barbara, J. Couto and Y Li, “COOLCAT: an entropy-based algorithm for categorical clustering,” Proceedings of the 11th ACM CIKM Conference, pp. 582–589, 2002. • Wallace, R. S., “Finding natural clusters through entropy minimization,” Technical Report CMU-CS-89- 183, Carnegie Mellon University, 1989. • Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze, Introduction to Information Retrieval, Cambridge University Press, 2008. • R. Ostrovsky, Y. Rabani, L.J. Schulman and C. Swamy, “The Effectiveness of Lloyd-Type Methods for the k-Means Problem,” Foundations of Computer Science, 2006, FOCS ’05, 47th Annual IEEE Symposium, Berkeley, CA, October 2006, pp. 165-176.
  • 8. About Cognizant Cognizant (NASDAQ: CTSH) is a leading provider of information technology, consulting, and business process out- sourcing services, dedicated to helping the world’s leading companies build stronger businesses. Headquartered in Teaneck, New Jersey (U.S.), Cognizant combines a passion for client satisfaction, technology innovation, deep industry and business process expertise, and a global, collaborative workforce that embodies the future of work. With over 50 delivery centers worldwide and approximately 171,400 employees as of December 31, 2013, Cognizant is a member of the NASDAQ-100, the S&P 500, the Forbes Global 2000, and the Fortune 500 and is ranked among the top performing and fastest growing companies in the world. Visit us online at www.cognizant.com or follow us on Twitter: Cognizant. World Headquarters 500 Frank W. Burr Blvd. Teaneck, NJ 07666 USA Phone: +1 201 801 0233 Fax: +1 201 801 0243 Toll Free: +1 888 937 3277 Email: inquiry@cognizant.com European Headquarters 1 Kingdom Street Paddington Central London W2 6BD Phone: +44 (0) 20 7297 7600 Fax: +44 (0) 20 7121 0102 Email: infouk@cognizant.com India Operations Headquarters #5/535, Old Mahabalipuram Road Okkiyam Pettai, Thoraipakkam Chennai, 600 096 India Phone: +91 (0) 44 4209 6000 Fax: +91 (0) 44 4209 6060 Email: inquiryindia@cognizant.com ­­© Copyright 2014, Cognizant. All rights reserved. No part of this document may be reproduced, stored in a retrieval system, transmitted in any form or by any means, electronic, mechanical, photocopying, recording, or otherwise, without the express written permission from Cognizant. The information contained herein is subject to change without notice. All other trademarks mentioned herein are the property of their respective owners. About the Authors Anshuman Sharma is a Financial Risk Consultant in the Governance Regulatory Compliance Practice within Cognizant Business Consulting. He has three-plus years of experience in the finance sector, focused on credit and market risk implementation, governance changes for Basel III/Dodd-Frank and regulatory reporting. He previously worked for the hedge fund D.E. Shaw & Co. and holds an M.B.A. in finance from XLRI Jamshedpur, India and an engineering degree from Indian Institute of Information Technology Allahabad, India. Anshuman can be reached at Anshuman.Sharma2@cognizant.com. Raghvendra Kushwah is a Consulting Manager within Cognizant Business Consulting, heading the Oper- ational Risk Division, and has deep domain experience in fraud and anti-money-laundering processes and implementation. His areas of expertise include analytics for behavioral finance, where he has led numerous operational risk and analytics projects across several geographies. He has eight years of experience in the IT and finance industries and holds an engineering degree from Indian Institute of Technology, Delhi and an M.B.A. from Indian Institute of Management Lucknow. He can be reached at Raghvendra.Kushwah@cognizant.com. Anshuman Choudhary is a Director and heads the Governance Regulatory Compliance Practice within Cognizant Business Consulting. His areas of expertise include consulting in risk management and regulatory reporting. He has 14 years of business technology consulting and domain experience and is a qualified GARP financial risk manager. Anshuman has an M.B.A. in finance from Indian Institute of Social Welfare and Business Management and a bachelor’s degree in metallurgical engineering from REC Durgapur, India. He can be reached at Anshuman.Choudhary@cognizant.