White Paper - Process Neutral Data Modelling

Data Management & Warehousing

WHITE PAPER

Process Neutral Data Modelling
DAVID M WALKER
Version: 1.0
Date: 10/02/2009

Data Management & Warehousing

138 Finchampstead Road, Wokingham, Berkshire, RG41 2NU, United Kingdom

http://www.datamgmt.com

White Paper - Process Neutral Data Modelling

Table of Contents
Table of Contents ...................................................................................................................... 2
Synopsis .................................................................................................................................... 4
Intended Audience .................................................................................................................... 4
About Data Management & Warehousing ................................................................................. 4
Introduction................................................................................................................................ 5
The Problem .............................................................................................................................. 6
The Example Company......................................................................................................... 6
The Real World ..................................................................................................................... 9
The Customer Paradigm ......................................................................................................... 10
Requirements of a Data Warehouse Data Model.................................................................... 12
Assumptions........................................................................................................................ 12
Requirements...................................................................................................................... 12
The Data Model ....................................................................................................................... 14
Major Entities ...................................................................................................................... 14
Type Tables ........................................................................................................................ 17
Band Tables ........................................................................................................................ 19
Property Tables................................................................................................................... 20
Event Tables ....................................................................................................................... 22
Link Tables.......................................................................................................................... 23
Segment Tables .................................................................................................................. 24
The Sub-Model ........................................................................................................................ 25
History Tables ..................................................................................................................... 26
Occurrences and Transactions ........................................................................................... 27
Implementation Issues ............................................................................................................ 33
The ‘Party’ Special Case..................................................................................................... 33
Partitioning .......................................................................................................................... 35
Data Cleansing.................................................................................................................... 36
Null Values .......................................................................................................................... 36
Indexing Strategy ................................................................................................................ 36
Enforcing Referential Integrity............................................................................................. 36
Data Insert versus Data Update.......................................................................................... 37
Row versus Set Based Loading in ETL............................................................................... 37
Disk Space Utilisation ......................................................................................................... 38
Implementation Effort .......................................................................................................... 38
Data Commutativity ................................................................................................................. 39
Data Model Explosion and Compression ................................................................................ 40
How big does the data model get?...................................................................................... 40
Can the data model be compressed? ................................................................................. 40
Which Results to Store? .......................................................................................................... 41
The Holistic Approach ............................................................................................................. 42
Summary ................................................................................................................................. 43
Appendix 1 – Data Modelling Standards ................................................................................. 44
General Conventions .......................................................................................................... 44
Table Conventions .............................................................................................................. 44
Column Conventions........................................................................................................... 46
Index Conventions .............................................................................................................. 50
Standard Table Constructs ................................................................................................. 50
Sequence Numbers For Primary Keys................................................................................ 52
Appendix 2 – Understanding Hierarchies ................................................................................ 53
Sales Regions ..................................................................................................................... 53
Internal Organisation Structure ........................................................................................... 53
Appendix 3 – Industry Standard Data Models ......................................................................... 55
Appendix 4 – Information Sparsity .......................................................................................... 57
Appendix 5 – Set Processing Techniques............................................................................... 59
Appendix 6 – Standing on the shoulders of giants .................................................................. 60

© 2009 Data Management & Warehousing Page 2


Further Reading ...................................................................................................................... 61
Overview Architecture for Enterprise Data Warehouses..................................................... 61
Data Warehouse Governance............................................................................................. 61
Data Warehouse Project Management ............................................................................... 62
Data Warehouse Documentation Roadmap ....................................................................... 62
How Data Works ................................................................................................................. 63
List of Figures .......................................................................................................................... 64
Copyright ................................................................................................................................. 64



Synopsis
This paper describes in detail the process for creating an enterprise data warehouse physical
data model that is less susceptible to change. Change is one of the largest on-going costs in
a data warehouse and therefore reducing change reduces the total cost of ownership of the
system. This is achieved by removing business process specific data and concentrating on
core business information.

The white paper examines why data-modelling style is important and how issues arise when
using a data model for reporting. It discusses a number of techniques and proposes a specific
solution. The techniques should be considered when building a data warehouse solution even
when an organisation decides against using the specific solution.

This paper is intended for a technical audience and project managers involved with the
technical aspects of a data warehouse project.

Intended Audience
Reader Recommended Reading
Executive Synopsis
Business Users Synopsis
IT Management Synopsis
IT Strategy Entire Document
IT Project Management Entire Document
IT Developers Entire Document

About Data Management & Warehousing
Data Management & Warehousing is a specialist consultancy in data warehousing, based in
Wokingham, Berkshire in the United Kingdom. Founded in 1995 by David M Walker, our
consultants have worked for major corporations around the world including the US, Europe,
Africa and the Middle East. Our clients are invariably large organisations with a pressing need
for business intelligence. We have worked in many industry sectors but have specialists in
Telco’s, manufacturing, retail, financial and transport as well as technical expertise in many of
the leading technologies.

For further information visit our website at: http://www.datamgmt.com

Crossword Clue: Expert Gives Us Real Understanding (4 letters)



Introduction
Commissioning a data warehouse system is a major undertaking. Organisations will invest
significant capital in the development of the system. The data model is always a major
consideration and many projects will invest a significant part of the budget on developing and
re-working the initial data model.

Unfortunately projects also often fail to look at the maintenance costs of the data model that
they develop. A data model that is fit for purpose when developed will rapidly become an
expensive overhead if it needs to change when the source systems change. The cost
involved is not only in the change to the data model but also in the changes to the ETL that
feed the data model.

This problem is exacerbated by the fact that changes to the data model may be done in an
inconsistent way from the original design approach. The data model loses transparency and
becomes even more difficult to maintain.

For many large data warehouse solutions it is not uncommon to have a resource permanently
assigned to maintaining the data model and several more resources assigned to managing
the change in the associated ETL within a short time of going live.

By understanding the problem and using techniques imported from other areas of systems
and software development and well as change management techniques it is possible to
define a method that will greatly reduce this overhead.

This white paper sets out an example of the issues from which to develop a statement of
requirements for the data model and then demonstrates a number of techniques which, when
used together, can address those requirements in a sustainable way.



The Problem
Data modelling is the process of defining the database structures in which to hold information.
To understand the Process Neutral Data Modelling approach first this paper looks at why
these database structures have such an impact on the data warehouse.

In order to demonstrate the issues with creating a data model for a data warehouse more
experienced readers are asked bear with the necessarily simplistic examples that follow.

The Example Company
A company supplies and installs widgets. There are a number of different widget types,
each having a name and specific colour. Each individual widget has a unique serial
number and can have a number of red lamps and a number of green lamps plugged
into it. The widgets are installed into cabinets at customer sites and from time to time
engineers come in and change the relative numbers of red and green lamps. The
customer name and a customer cabinet number identify cabinets. For operational
1
systems the data model might look something like this :

2
Figure 1 - Initial Operational System Data Model

This simple data model describes both the widget and the cabinet and provides the
current combinations. It does not provide any historical context: “What was the
previous configuration and when was it changed?”

Historical data can be recorded by simply adding start date and end date to each of
3
the main tables. This provides the ability to report on the historical configuration . In
order to facilitate this a separate reporting environment would be setup because
retaining history in the operational system would unacceptably reduce the operational
system performance. There are three consequences of doing this:

• Queries are now more complex. In order to report the information for a given
date the query has to allow for the required date being between the start date

1
Data models in this document are illustrative and therefore should be viewed as suitable for making
specific points rather than complete production quality solutions. Some errors exist to explicitly
demonstrate certain issues.
2
The are several conventions for data modelling. In this and subsequent diagrams the link with a 1 and
∞ represents a one-to-many relationship where the ‘1’ record is a primary key field and the ‘∞’
represents the foreign key field.
3
Note that the ‘WIDGET_LOCATIONS’ table requires an additional field called ‘INSTALL_SEQUENCE’
to allow for the case where a widget is re-installed in a cabinet.



and the end date of the record in each of the tables. The extra complexity
slows the execution of the query.

o The volume of data stored has also increased. The storage of dates has a
minor impact on the size of each row but this is small when compared to the
4
number of additional rows that need to be stored.

o Data has to be moved from the operational system to the reporting system
via an extract, transform and load (ETL) process. This process has to extract
the data from the operational system, compare the records to the current
records in the reporting system to determine if there are any changes and if
so make the required adjustments to the existing record (e.g. updating the
end date) and insert the new record. Already the process is more complex
5
and time consuming than simply copying the data across.

Figure 2 - Initial Reporting System Data Model

When the reporting system is built, it accurately reflects the current business
processes, operational systems and provides historical data. From a systems
management perspective there is now an additional database, and a series of ETL or
interface scripts that have to be run reliably every day.

The systems architecture may be further enhanced so that the reporting system
becomes a data warehouse and the users make their queries on data marts, or sets
of tables where the data has been re-structured in order to simplify of the users query
environment. The ‘data marts’ typically use star-schema or snowflake-schema data
6
modelling techniques or tool specific storage strategies . This adds an additional layer
of ETL to move between the data warehouse and the data mart.

However the company doesn’t stop here. The product development team create a
new type of widget. This new widget allows amber lamps and can optionally be
mounted in a rack that is in turn mounted in a cabinet. The IT director also insists that
the new OLTP application is more flexible for other future developments.

4
Assume that everything remains the same except that widgets are moved around (i.e. there are no
new widgets and no new cabinet/customer combination) then the WIDGET_LOCATIONS table grows in
direct proportion to the number of changes. If each widget were modified in some way once a month
then the reporting system table would be twelve times bigger than the operational system after one year
and this before any other change is handled.
5
Additional functionality such as data cleansing will also impact the complexity of ETL and affect
performance
6
This is accepted good practice and the design and implementation of data marts is outside the scope
of this paper.



These business process changes results in a new data model for the operational
system.

Figure 3 - Second Version Operational System Data Model

The reporting system is also now a live system with a large amount of historical
information. It too can be re-designed. The operational system will be implemented to
meet the business requirements and timescales regardless of whether the reporting
system is ready. It also may not be possible to create the history required for the new
7
data model when it is changed.

If a data mart is built from the data warehouse there are two impacts. Firstly that the
data mart model will need to be changed to exploit the new data and secondly that
the change to data warehouse model will require the data mart ETL to be modified
regardless of any changes to the data mart data model.

The example company does not stop here however as senior management decide to
acquire a smaller competitor. The new subsidiary has it’s own systems that reflect
their own business processes. The data warehouse was built with a promise of
providing an integrated management reporting so there is an expectation that the
data from the new source system will be quickly and seamlessly integrated into the
data warehouse. From a technical perspective this could present issues around
mapping the new source system data model to the existing data warehouse data
8 9
model, critical information data types , duplication of keys , etc. that all cause
problems with the integration of data and therefore slow down the processing.

Within a few short iterations of change it is possible to see the dramatic impact on the
data warehouse and that the system is likely to run into issues.

7
A common example of this is an organisation that captures the fact that an individual is married or not.
Later the organisation decided to capture the name of the partner if someone is married. It is not
possible to create the historical information systemically so for a period of time the system has to
support the continued use of the marital status and then possibly run other activities such as outbound
calling to complete the missing historical data.
8
The example database assumed that serial number was numeric and used it as a primary key but what
happens if the acquired company uses alphanumeric serial numbers?
9
If both companies use numbers starting from 1 for their customer ID then there will be two customers
who have the same ‘unique’ id, and customers that have two ‘unique’ IDs.



The Real World
The example above is designed to illustrate some of the issues that affect data
warehouse data modelling. In reality business and technical analysts will handle some
of these issues in the design phase but how big is the data-modelling problem in the
real world?

o A UK transport industry organisation has three mainframes, each of which is
only allowed to perform one release a quarter. Each system also feeds the
data warehouse. As a consequence the mainframe feeds require validation
and change every month. Whilst the main data comes from these three
systems there are sixty-five other Unix based operational system that feed
the data warehouse and data from several hundred desktop based
applications that are also provide data. Most of these source systems do not
have good change control or governance procedures to assist in impact
analysis. Change for this organisation is business as usual.

o A global ERP vendor supplies a system with over five thousand database
objects and typically makes a major release every two years, a ‘dot’ release
every six months and has numerous patches and fixes in between each
major release. This type of ERP system is in use in nearly every major
company and the data is a critical source to most data warehouses.

o A global food and drink manufacturer that came into existence as a result of
numerous mergers and acquisitions and also divested some assets found
itself with one hundred and thirty-seven general ledger instances in ten
countries with seventeen different ERP packages. Even where the ERP
packages were the same they were not necessarily using the same version of
the package. The business intelligence requirement was for a single data
warehouse and a single data model.

o A European Telco purchased a three hundred-table ‘industry standard’
enterprise data model from a major business intelligence vendor and then
spent two years analysing it before they started the implementation. Within
six months of implementation they had changed some sixty percent of tables
as a result of analysis omissions.

o A UK based banking and insurance business outsources all of its product
management to business partners and only maintains the unified customer
management systems (website, call centres and marketing). As a result
nearly all of the ‘source systems’ are external to the organisation and whilst
there are contractual agreements about the format and data remaining fixed
in practice there is significant regular change in the format and information
provided to both operational and reporting systems.

Obviously these issues cannot be fixed just by creating the correct data model for the
10
data warehouse but the objective of the data model design should be two fold:

o To ensure that all the required data can be stored effectively in the data
warehouse.

o To ensure that the design of the data model does not impose cost and where
possible actively reduces the cost of change on the system.

10
Data Management & Warehousing have published a number of other white papers that are available
at http://www.datamgmt.com and look at other aspects of data warehousing and address some of these
issues. See Further Reading at the end of this document for more details.



The Customer Paradigm
Data Warehouse development often start with a requirements gathering exercise. This may
take the form of interviews or workshops where people try to define what the customer is. If a
number of different parts of the business are involved then the definition of customer soon
becomes confused and controversial and negatively impacts the project. Most organisations
have a sales funnel that describes the process of capturing, qualifying, converting and
retaining customers.

Marketing say that the customer is anyone
and everyone that they communicate with.

The sales teams view the customer as
those organisations in their qualified lead
database or for whom they have account
management responsibility post-sales.

The customer services team are clear that
the customer is only those organisations
who have purchased a product and where
appropriate have purchased a support
agreement as well.

Other questions are asked in the
workshops such as “What about customers
who are also suppliers or partners?” and
“How do we deal with customers who have
gone away and then come back after a
long period of time?”

Figure 4 - The Sales Funnel The most common solutions that are
created as a result either add ‘flag’ or
‘indicator’ columns to the customer table to represent each category or to create multiple
tables for the different categories required and to repeat the data in each of the tables.

This example clearly demonstrates that the business process is being embedded into the
data model. The current business process definition(s) of customer are defining how the data
model is created. What has been forgotten is that these ‘customers’ exist outside the
organisation and it is their interaction with different parts of the organisation that defines their
status of being a customer, supplier, etc. In legal documents there is the concept of a ‘party’
where a party is a person or group of persons that compose a single entity that can be
11
identified as one for the purposes of the law . This definition is one that should be borrowed
and used in the data model.

If users query a data mart that is loaded with data extracted from the transaction repository
and data marts are built for a specific team or function that only requires one definition of the
12
data then the current definition can be used to build that data mart and different definitions
used for other departments.

11
http://en.wikipedia.org/wiki/Party_(law)
12
This also allows flexibility, as, when business processes change, it is possible at a cost to change the
rules by which data is extracted. The cost of change is relatively much lower than trying to rebuild the
data warehouse and data mart with a new definition.



As a result of this approach two questions are common:

• Isn’t one of the purposes of building a data warehouse to have a single version of the
truth?
Yes. There is a single version of the truth in the data warehouse and this single
version is perpetuated into the data marts, the difference is that the information in the
data mart is qualified. Asking the question “How many customers do we have?”
should get the answer “Customer Services have X active service contract customers”
and not the answer “X” without any further qualification.

• What happens if different teams or departments have different data?
People within the organisation work within different processes and with the same
terminology but often different definitions, it is unlikely and impractical in the short
term to change this, although it is possible that in the long term the data warehouse
project will help with the standardization process. In the mean time it is an education
process to ensure that answers are qualified. It is important to recognise that different
departments legitimately have different definitions and therefore to recognise and
understand the differences, rather than fighting about who is right.

It might be argued that there are too many differences to put all individuals and organisations
in a single table; this and other issues will be discussed later in the paper.



Requirements of a Data Warehouse Data Model
Having looked at the problems that can affect a data warehouse data model it is possible to
describe the requirements that should be made on any data model design.

Assumptions
1. The data model is for use in the architectural component called the transaction
13
repository or data warehouse.

2. As the data model is used in the data warehouse it will not be a place where
users go to query the data, instead users will query separate dependant data
marts.

3. As the data model is used in the data warehouse data will be extracted from it
to populate the data marts by ETL tools.

4. As the data model is used in the data warehouse the data will be loaded into it
from the source systems by ETL tools.

5. Direct updates (i.e. not through formally released ETL processes) will be
prohibited; instead a separate application or applications will exist as a
surrogate source.

6. The data model will not be used in a ‘mixed mode’ where some parts use one
data modelling convention and other parts use another. (This is generally bad
practice with any modelling technique but often the outcome where the
responsibility for data modelling changes is distributed or re-assigned over
time).

Requirements
1. The data model will work on any standard business intelligence relational
14
database. This is to ensure that it can be deployed on any current platform
and if necessary re-deployed on a future platform.

2. The data model will be process neutral i.e. it will not reflect current business
processes, practices or dependencies but instead will store the data items and
relationships as defined by their use at the point in time when the information is
acquired.
15
3. The data model will use a design pattern i.e. a general reusable solution to a
commonly occurring problem. A design pattern is not a finished design but a
description or template for how to solve a problem that can be used in many
different situations.

13
For further information on Transaction Repositories see the Data Management & Warehousing white
paper ”An Overview Architecture For Enterprise Data Warehouses”
14
A typical list would (at the time of writing) include IBM DB2, Microsoft SQL Server, Netezza, Oracle,
Sybase, Sybase IQ, and Teradata. For the purposes of this document it implies compliance with at least
the SQL92 standard
15
http://en.wikipedia.org/wiki/Software_design_pattern



16
4. Convention over configuration : This is a software design paradigm which
seeks to decrease the number of decisions that developers need to make,
gaining simplicity, but not necessarily losing flexibility. It can be applied
successfully to data modelling and reduce the number of decisions of the data
modeller by ensuring that tables and columns use a standard naming
convention and are populated and queried in a consistent fashion. This also
has a significant impact on the efforts of an ETL developer.

5. The design should also follow the DRY (Don’t Repeat Yourself) principle. This
is a process philosophy aimed at reducing duplication. The philosophy
emphasizes that information should not be duplicated, because duplication
increases the difficulty of change, may decrease clarity, and leads to
17
opportunities for inconsistency.

6. The data model should be significantly static over a long period of time, i.e.
there should not be a need to add or modify tables on a regular basis. In this
case there is a difference between designed and implemented, it is possible to
have designed a table but not to implement it until it is actually required. This
does not affect the static nature of the data model, as the placeholder already
exists.
18
7. The data model should store data at the lowest possible level and avoid the
storage of aggregates.

8. The data model should support the best use of platform specific features whilst
19
not compromising the design.

9. The data model should be completely time-variant, i.e. it should be possible to
20
reconstruct the information at any available point in time.

10. The data model should act as a communication tool to aid the refinement of
requirements and an explanation of possibilities.

16
For further information see http://en.wikipedia.org/wiki/Convention_over_Configuration and
http://softwareengineering.vazexqi.com/files/pattern.html. The Ruby on Rails language
(http://www.rubyonrails.org/) makes extensive use of this principle.
17
DRY is a core principle of Andy Hunt and Dave Thomas's book The Pragmatic Programmer. They
apply it quite broadly to include "database schemas, test plans, the build system, even documentation."
When the DRY principle is applied successfully, a modification of any single element of a system does
not change other logically unrelated elements. Additionally, elements that are logically related all change
predictably and uniformly, and are thus kept in sync. (http://en.wikipedia.org/wiki/DRY). This does not
automatically imply database normalisation but database normalisation is one method for ensuring
‘dryness’.
18
This is the origin of the term ‘Transaction Repository’ rather than ‘Data Warehouse’ in Data
Management & Warehousing documentation. The transaction repository stores the lowest level of data
that is practical and/or available. (See An Overview Architecture for Enterprise Data Warehouses)
19
This turns out to be both simple and very effective. For Oracle the most common features that need
support include partitioning and materialized views. For Sybase IQ and Netezza there is a preference for
inserts over updates due to their internal storage mechanisms. For all databases there is variation in
indexing strategies. These and other features should be easily accommodated.
20
Also known as temporal. Most data warehouses are not linearly time variant but quantum time variant.
If a status field is updated three times in a day and the data warehouse reflects all changes then it is
linearly time-variant. If a data warehouse holds the first and last values only because a batch process
loads it once a day then it is quantum time-variant where the quantum is, in this case, one day.
Quantum time variant solutions can only resolve data to the level of the quantum unit of measure.



The Data Model
As this white paper has defined requirements for the data model it is now possible to start
looking at what is needed to design a data model. This is done by breaking down the tables
that will be created into different groups depending on how they are used. The section below
discusses the main elements of the data models. There are some basics such as naming
conventions, standard short names, keys used in the data model, etc. that are not described.
A complete set of data modelling rules and example models can be found in the appendices.

Major Entities
Party is, as described in the customer paradigm section above, an example of a type of
table within the Process Neutral Data Modelling method known as a ‘Major Entity’.
These are tables that deliver the placeholders for all major subject areas of the data
model and around which other information is grouped. Each business transaction will
relate to a number of major entities. Some major entities are global i.e. they apply to all
types of organisation (e.g. Calendar) and there are a number of major entities that are
industry specific (e.g. for Telco, Manufacturing, Retail, Banking, etc.). It would be very
unusual for an organisation to need a major entity that was not industry wide. Below is
a list of some of the most common:

• Calendar
Every data warehouse will need a calendar. It should always contain data to
the day level and never to parts of the day. In some cases there is a need to
21
support sub-types of calendar for non-Gregorian calendars .

• Party
Every organisation will have dealings between parties. This will normally
include three major sub-types: individuals, organisations (any formal
organisation such as a company, charity, trust, partnership, etc.) and
organisational units (the components within an organisation including the
system owners organisation).

• Geography
The information about where. This is normally sub-typed into two components,
22
address and location. Address information is often limited to postal addresses
whilst location is normally described by the longitude and latitude via GPS co-
ordinates. Other specialist geographic models exist that may need to be taken
23
into account.

• Product_Service (also known as Product or as Service)
This is the catalogue of the products and/or services that an organisation
supplies.

• Account
Every customer will have at least one account if financial transactions are
involved (even those organisations that do not think they currently use the
concept of account will do so as accounting systems always have the concept
of a customer with one or more accounts).

21
See http://www.qppstudio.net/footnotes/non-gregorian.htm for various calendars, notably 2008 is the
Muslin Year 1429 and the Jewish Year 5968
22
Some countries, such as the UK, have validated lists of all addresses (see the UK Post Office
Postcode Address File at http://www.royalmail.com/portal/rm/jump2?mediaId=400085&catId=400084)
23
Network Rail in the UK use an Engineers Line Reference, which is based on a linear reference model
and refers to a known distance from a fixed point on a track. In Switzerland they have an entire national
co-ordinate system (http://en.wikipedia.org/wiki/Swiss_coordinate_system)



• Electronic_Address
Any electronic address such as a telephone number, email address, web
address, IP address etc. This is normally sub-typed by the categories used.

• Asset (also known as Equipment)
A physical object that can be uniquely identified (normally by a serial number or
similar). This may be used or incorporated in a PRODUCT_SERVICE, or sold
to a customer etc. In the example Cabinet, Rack and Widget were all examples
of Asset, whilst Widget Type was an example of PRODUCT_SERVICE.

• Component
A physical object that cannot be uniquely identified by a serial number but has
a part number and is used in the make-up of either an asset or of a product
service. In the example company there was not a particular record of the serial
numbers of the lamps, however they would all have had a part number that
described the type of lamp to be used.

• Channel
A conceptual route to market (e.g. direct, indirect, web-based, call-centre, etc.).

• Campaign
A marketing exercise that is designed to promote the organisation, e.g. the
running of a series of adverts on the television.

• Campaign Activities
The running of a specific advert as part of a larger campaign.

• Contract
Depending on the type of business the relationship between the organisation
and its supplier or its customer may require the concept of a contract as well as
that of an account.

• Tariff (also known as Price_List)
A set of charges and discounts that can be applied to product services as a
point in time.

This list is not comprehensive by if an organisation can effectively describe their major
entities and combine this information with the interactions between them (the
occurrences or transactions) then they have the basis of a very successful data
warehouse.

Major Entities can have any meaningful name provided it is not a reserved word in the
database or (as will be seen below) a reserved word within the design pattern of
Process Neutral Data Modelling.

Some readers, who are familiar with the concepts of star schemas and data marts, will
also be aware that these are very close to the basic dimensions that most data marts
use. This should come as no surprise as these are the major data items of any
business regardless of their business processes or of their specific industry sector and
a data mart is only a simplification of the data presented for the user. This effect is
called “natural star schemas” and will be explored in more detail later.



Lifetime Value
The next decision is which columns (attributes) should be included in the table.
24
Much like the processes involved in normalising a database the objective is to
minimise duplication of data and there is also a requirement to minimise updates.
To this end the attributes that are included should therefore have ‘lifetime value’,
i.e. they should remain constant once they have been inserted into the database.
This means that variable data needs to be handled elsewhere.

Using some of the major entities above as examples:

Calendar:
Lifetime Value Attributes: Date, Public Holiday Flag

Geography:
Lifetime Value Attributes: Address Line 1, Address Line 2, City,
25
Postcode , County, Country
Non-Lifetime Value Attributes: Population

Party (Individuals):
26
Lifetime Value Attributes: Forename, Surname , Date of Birth,
27
Date of Death, Gender , State ID Number
Non-Lifetime Value Attributes: Marital Status, Number of Children, Income

Party (Organisations):
Lifetime Value Attributes: Name, Start Date, End Date,
State ID Number
Non-Lifetime Value Attributes: Number of Employees, Turnover,
Shares Issued

Account:
Lifetime Value Attributes: Account Number, Start Date, End Date.
Non-Lifetime Value Attributes: Balance

Other than this lifetime value requirement for columns every table must comply with the
general rules for any table. For example every table will have a key column that uses
28
the table short name made up of six characters and the suffix _DWK , a TIMESTAMP
column and an ORIGIN column.

24
http://en.wikipedia.org/wiki/Database_normalization: Database normalization is a technique for
designing relational database tables to minimize duplication of information and, in so doing, to safeguard
the database against certain types of logical or structural problems, namely data anomalies.
25
This may occasionally be a special case as postal services do, from time to time, change postal codes
that are normally static.
26
There is a specific special case that deals with the change of name for married women that will be
dealt with in the section ‘The Party Special Case’ later.
27
One insurance company had to deal with updatable genders due to the fact that underwriting rules
require assessment based on birth gender and not gender as a result of re-assignment surgery.
Therefore for marketing it had to handle ‘current’ gender and for underwriting it had to deal with ‘birth’
gender.
28
See the data modelling rules appendix for how this name is created.



Type Tables
There is often a need to categorise information into discrete sets of values. The valid
set of categories will probably change over time and therefore each category record
also needs to have lifetime value. Examples of the categorisation have already
occurred with the some of the major entities:

• Party: Individual, Organisation, Organisation Unit
• Geography: Postal Address, Location
• Electronic Address: Telephone, E-Mail

To support this and to comply with the requirement for convention over configuration all
_TYPES tables of this format have a standard data model as follows:

• The table will have the same name as the major entity but with the suffix
_TYPES (e.g. PARTY_TYPES, GEOGRAPHY_TYPES, etc.).
• The table will always have a key column that uses the six character short code
and the _DWK suffix.
• The table will have a _TYPE column that is the type name.
• The table will have a _DESC column that is a description of the type.
• The table will have a _GROUP column that groups certain types together.
• The table will have a _START_DATE column and a _END_DATE column.

This is a type table in its entirety. If a table needs more information (i.e. columns) then
this is not a _TYPES table and must not have the _TYPES extension, as it does not
comply with the rules for a _TYPES table.

Examples of data in _TYPES tables might include:

PARTY_TYPES

Column Example Rows
PARTYP_DWK 1 2 3 4
PARTY_TYPE INDIVIDUAL LTD COMPANY PARTNERSHIP DIVISION
PARTY_TYPE_DESC An Individual A company in This is a business A division of a
which the liability owned by two or larger
of the members in more people who organisation
respect of the are personally
company’s debts liable for all
is limited business debts.
PARTY_TYPE_GROUP INDIVIDUAL ORGANISATION ORGANISATION UNIT
PARTY_TYPE_START_DATE 01-JAN-1900 01-JAN-1900 01-JAN-1900 01-JAN-1900
PARTY_TYPE_END_DATE
Figure 5 - Example data for PARTY_TYPES

The start date in this context has little initial value in this context, although it is a
29
mandatory field and therefore has to be completed with a date before the earliest
party in this example. Legal types of organisation do change over time and so it is
possible that the start and end dates of these will become significant.

These types do not describe the type of role that the party is performing (i.e. Customer,
Supplier, etc.) they describe the type of the party (e.g. Individual, etc.). Describing the
role comes later. The type and group column are repeated for INDIVIDUAL, as there is
no hierarchy of information for this value but the field is mandatory.

29
Start Dates in _TYPES tables are mandatory as, with only a few exceptions, they are required
information. In order to be consistent they therefore have to be mandatory for all _TYPES tables



GEOGRAPHY_TYPES

Column Example Rows
GEOTYP_DWK 1 2
GEOGRAPHY_TYPE POSTAL LOCATION
GEOGRAPHY_TYPE_DESC An address as supported by A point on the surface of the earth
the postal service defined by it’s longitude and
latitude
GEOGRAPHY _TYPE_GROUP POSTAL LOCATION
GEOGRAPHY _TYPE_START_DATE 01-JAN-1900 01-JAN-1900
GEOGRAPHY _TYPE_END_DATE
Figure 6 - Example Data for GEOGRAPHY_TYPES

The start date in this context has little initial value, although it is a mandatory field and
therefore has to be completed with a date.

These types do not describe the type of role that the geography is performing (i.e.
home address, work address, etc.) they describe the type of the geography (postal
address, point location, etc.).

The type and group column are repeated for both values, as there is no hierarchy of
information for them.

CALENDAR_TYPES

The convention over configuration design aspect allows for this table, however it is
rarely needed and can therefore be omitted. This is an example where a table can be
described as designed (i.e. it is known exactly what it looks like) but not implemented.

_TYPES tables will appear in other parts of the data model but they will always have
the same function and format.
30
The consequence of this design re-use is that implementing an application to manage
the source of _TYPE data is easy. The system than manages the type data needs to
have a single table with the same columns as a standard _TYPES table and an
additional column called, for example, DOMAIN. This DOMAIN column has the target
system table name (e.g. PARTY_TYPES) in it. The ETL then simply maps the data
from the source system to the target system where the DOMAIN equals the target table
name. This is an example of re-use generating a significant saving in the
implementation.

30
This is a good use of a Warehouse Support Application as defined in “An Overview Architecture for
Enterprise Data Warehouses”



Band Tables
Whilst _TYPES tables classify information into discrete values it is sometimes
necessary to classify information into ranges or bands i.e. between one value and
another. The classic example of this is for telephone calls which are classified as ‘Off-
Peak Rate’ if they are between 00:00 and 07:59 or between 18:00 and 23:59. Calls
between 08:00 and 17:59 are classified as ‘Peak Rate’ and charged at a premium.

_BANDS is a special case of the _TYPES table and would store the data as follows:

Column Example Rows
TIMBAN_DWK 1 2 3
TIME_BAND Early Off Peak Peak Late Off Peak
31
TIME_BAND_START_VALUE 0 480 1080
TIME_BAND_END_VALUE 479 1079 1439
TIME_BAND_DESC Early Off Peak Peak Late Off Peak
TIME_BAND_GROUP Off Peak Peak Off Peak
TIME_BAND_START_DATE 01-JAN-1900 01-JAN-1900 01-JAN-1900
TIME_BAND_END_DATE
Figure 7 - Example data for TIME_BANDS

Once again the _BANDS table has a standard format as follows

• The table will have the same name as the major entity but with the suffix
_BANDS (e.g. TIME_BANDS, etc.).
• The table will always have a key column that uses the six character short code
and the _DWK suffix.
• The table will have a _BAND column that is the type name.
• The table will have a _START_VALUE and a _END_VALUE that represent the
starting and finishing values of the band.
• The table will have a _DESC column that is a description of the band.
• The table will have a _GROUP column that groups certain band together.
• The table will have a _START_DATE column and a _END_DATE column.

The table has to comply with this convention in order to be given the _BANDS suffix.

31
Note that values are stored as a number of minutes since midnight.



Property Tables
In the discussion of major entities and lifetime value the data that failed to meet the
lifetime value principle was omitted from the major entity tables, however it still needs
to be stored. This is handled via a property table. Property tables also help to support
the extensibility aspects of the data model.

If we use PARTY as an example then as already identified the marital status does not
possess lifetime value and therefore is not included in the major entity. Everyone starts
as single, some marry, some divorce and some are widowed, these ‘status changes’
occur through the lifetime of the individual.

To deal with this problem the property table can be modelled as follows:

Figure 8 - Party Properties Example

As can be seen from example above in order to handle the properties two new tables
are created. The first is the PARTY_PROPERTIES table itself and the second a
supporting PARTY_PROPERTY_TYPES table.

In order to store the marital status of an individual a set of data needs to be entered in
the PARTY_PROPERTY_TYPES table:

TYPE GROUP
Single Marital Status
Married Marital Status
Divorced Marital Status
Co-Habiting Marital Status
Figure 9 - Example Party Property Data

The description, start and end date would be filled in appropriately. Note that the start
and end date here represent the start and end date of the type and not that of the
32
individuals’ use of that type.

It is now possible to insert a row in the PARTY_PROPERTIES table that references the
individual in the PARTY table and the appropriate PARTY_PROPERTY_TYPES (e.g.
‘Married’). The PARTY_PROPERTIES table can also hold the start date and end date
of this status and optionally where appropriate a text or numeric value that relates to
that property.

32
The need for start and end dates on such items is often questioned however experience shows that
legislation changes supposed static values in most countries over the lifetime of the data warehouse.
For example in December 2005 the UK permitted a new type of relationship called a civil partnership.
http://en.wikipedia.org/wiki/Civil_partnerships_in_the_United_Kingdom.



This means that not only the current marital status can be stored but also historical
information.
33
PARTY_DWK PARTY_PROPERTY_DWK START_DATE END_DATE
John Smith Single 01-Jan-1970 02-Feb-1990
John Smith Married 03-Feb-1990 04-Mar-2000
John Smith Divorced 05-Mar-2000 06-Apr-2005
John Smith Co-Habiting 07-Apr-2005
Figure 10 - Example data for PARTY_PROPERTIES

The data shown here describes the complete history of an individual with the last row
showing the current state as the START_DATE is before ‘today’ and the END_DATE is
null. There is also nothing to prevent future information from being held. If John Smith
announces that he is going to get married on a specific date in the future then the
current record can have it’s end date set appropriately and a new record added.

If another property is required (e.g. Number of Children) then no change is required to
the data model. New rows are entered into the PARTY_PROPERTY_TYPES table:

TYPE GROUP
Male Number of Children
Female Number of Children
Figure 11 - Example Data for PARTY_PROPERTY_TYPES

This allows data to be added to the PARTY_PROPERTIES as follows:

PARTY_DWK PARTY_PROPERTY_DWK START_DATE END_DATE VALUE
John Smith Single 01-Jan-1970 02-Feb-1990
John Smith Married 03-Feb-1990 04-Mar-2000
John Smith Divorced 05-Mar-2000 06-Apr-2005
John Smith Co-Habiting 07-Apr-2005
John Smith Male 09-Jun-2001 1
John Smith Female 10-Jul-2002 1
Figure 12 - Example Data for PARTY_PROPERTIES

In fact any number of new properties can be added to the tables as business processes
and source systems change and new data requirements come about.

The effect of this method when compared to other methods of modelling this
information is to create very narrow (i.e. not many columns) long (i.e. many rows)
tables instead of making very much wider, shorter tables. However the properties table
34
is very effective. Firstly, unlike the example, the two _DWK columns are integers , as
are the start and end dates. Many of the _VALUE fields will be NULL, and those that
are not will be predominately numeric rather than text values.

The PARTY_PROPERTY_TYPE acts as a natural partitioning key in those databases
that support table partitions. This method is very effective in terms of performance and
storage of data in databases that use column or vector type storage.

33
Text from the related table is used in the _DWK column rather than the numeric key for clarity in these
examples.
34
Integers are better than text strings for a number of reasons: they usually require less storage and
there is less temptation to mix the requirements of identification and description (a problem clearly
illustrated by car registration numbers in the UK).
Keys are more reliable when implemented as integers because databases often have key generation
mechanisms that deliver unique values. Integers do not suffer from upper/lower case ambiguities and
can never contain special characters or ambiguities caused by different padding conventions (trailing
spaces or leading zeros).



The real saving in the number of rows is normally less than expected when compared
to more conventional data model techniques that store duplicated rows for changed
data. The example above has seven rows of data. The alternate approach of repeated
sets of data requires six rows of data and considerably more storage because of the
duplicated data:

PARTY_DWK START_DATE END_DATE MARITAL_STATUS

UNKNOWN

FEMALE
CHILD

CHILD

CHILD
MALE
John Smith 01-Jan-1970 02-Feb-1990 Single 0 0 0
John Smith 03-Feb-1990 08-Jun-2001 Married 0 0 0
John Smith 09-Jun-2001 09-Jul-2002 Married 0 1 0
John Smith 10-Jul-2002 04-Mar-2000 Married 0 1 1
John Smith 05-Mar-2000 06-Apr-2005 Divorced 0 1 1
John Smith 07-Apr-2005 Co-Habiting 0 1 1
Figure 13 - Example Data for PARTY_PROPERTIES

The other main objection to this technique is often described as the cost of matrix
transformation of the data. That is the changing of the data from rows into columns in
the ETL to load the data warehouse and then changing the columns back to rows in the
ETL to load the data mart(s). This objection is normally due to a lack of knowledge of
appropriate ETL techniques that can make this very efficient such as using SQL set
operations such as ‘UNION’, ‘MINUS’ and ‘INTERSECT’.

Event Tables
An event table is almost identical to a property table except that instead of having
_START_DATE and _END_DATE columns it has a single column _EVENT_DATE. It
also has the appropriate _EVENT_TYPES table. The table name has a suffix of
_EVENTS. For example a wedding is an event (happens at a single point in time), but
‘being married’ is a property (happens over a period of time). Events can be stored in
property tables simply by storing the same value in both the start date and end date
columns and this is a more common solution than creating a separate table. The use of
_EVENTS tables is usually limited to places where events form a significant part of the
data and the cost of storing the extra field becomes significant.

It should be noted that this is only required where the event may occur many times
(e.g. a wedding date) rather than information that can only happen once (e.g. first
wedding date) which would be stored in the appropriate major entity as, once set, it
would have lifetime value.

Figure 14 - Party Events Example

_EVENTS tables are a special case of _PROPERTIES tables.



Link Tables
Up to this point major entity attributes within a single record have been examined. It is
also possible that records within the major entities will also relate to other records in the
same major entity (e.g. John Smith is married to Jane Smith, both of whom are records
within the PARTIES table). This is called a peer-to-peer relationship and is stored in a
table with the suffix _LINKS and the appropriate _LINK_TYPES table.

Figure 15 - Party Links Example

The significant difference in a _LINK table is that there are two relationships from the
major entity (in this case PARTIES).

This also allows hierarchies to be stored so that:

John Smith (Individual) works in Sales (Organisational Unit)
Sales (Organisation Unit) is a division of ACME Enterprises (Organisation)

where ‘works in’ and ‘is a division of’ are examples of the _LINK_TYPE.

It should also be noted that there is a priority to the relationship because one of the
linking fields is the main key (in this case PARTIE_DWK) and the other is the linked
key (in this case LINKED_PARTIE_DWK). There are two options; one is to store the
relationship in both directions (e.g. John Smith is married to Jane Smith and Jane
35
Smith is married to John Smith). This can be made complete with a reversing view
but defeats both the ‘Convention over Configuration’ principle and the ‘DRY (Don’t
Repeat Yourself)’ principle. The second method is to have a convention and only
store the relationship in one direction (e.g. John Smith is married to Jane Smith,
therefore the convention could be that that the male is being stored in the main key
and the female is being stored in the linked key).

35
A reversing view is one that has all the same columns as the underlying table except that the two key
columns are swapped around. In this example PARTIE_DWK would be swapped with
LINKED_PARTIE_DWK.



Segment Tables
The final type of information that might be required about a major entity is the
segment. This is a collection of records from the major entity that share something in
common but more detail is not known. The most common business example of this
would be the market segmentations done on customers. These segments are
normally a result of detailed statistical analysis and then storing the results.

In our example John Smith and Jane Smith could both be part of a segment of
married people along with any number of other individuals for whom it is known that
they are married but there is no information about when or to whom they are married.

Where the _LINKS table provided the peer-to-peer relationship the segment provides
the peer group relationship.

Figure 16 - Party Segments Example



The Sub-Model
The major entities and the six supporting data structures (_TYPES, _BANDS,
_PROPERTIES, _EVENTS, _LINKS, and _SEGMENTS) provide sufficient design pattern
structure to hold a large part of the information in the data warehouse. This is known as a
Major Entity Sub-Model. Significantly the information that has been stored for a single major
entity sub-model is very close to the typical dimensions of a data mart. This design pattern
provides complete temporal support and the ability to re-construct a dimension or dimensions
based on a given set of business rules.

The set of a major entity and the supporting structures is known as a sub-model. For example
the designed PARTY sub-model consists of:

• PARTIES

• PARTY_TYPES
• PARTY_BANDS

• PARTY_PROPERTIES
• PARTY_PROPERTY_TYPES

• PARTY_EVENTS
• PARTY_EVENT_TYPES

• PARTY_LINKS
• PARTY_LINK_TYPES

• PARTY_SEGMENTS
• PARTY_SEGMENT_TYPES

Those tables in bold italics might represent the implemented PARTY sub-model

Importantly what has not been provided is the relationships between major entities and the
business transactions that occur as a result of the interaction between major entities.



History Tables
Extending the example above it is noticeable that the party does not contain any
address information; this is held in the geography major entity. This is also another
example where current business processes and requirements may change. At the
outset the source system may provide a contract address and a billing address. A
change in process may require the capture of additional information e.g. contact
addresses and installation addresses.

In practice the only difference between this type of relationship between major entities
and the _LINKS relationship is that instead of two references to the same major entity
there is one relationship to each of two major entities.

The data model is therefore relatively simple to construct:

Figure 17 – Party Geography History Example

There is one minor semantic difference between links and histories. _LINKS tables join
back on to the major entity and therefore one half of the relationship has to be given
priority. In a _HISTORY table there is no need for priority as each of the two attributes
is associated with a different major entity.

Finally note that in this example the major entity is shown without the rest of the sub-
model that can be assumed.



Occurrences and Transactions
The final part of the data model is to build up all the occurrence or transaction tables. In
the data mart these are most akin to the fact tables although as this is a relational
model they may occur outside a pure star relationship. Like the major entities there is
no standard suffix or prefix, just a meaningful name.

To demonstrate what is required an example from a retail bank is described. The
example is not nearly as complex as a real bank but necessarily longer and more
complex than most examples to demonstrate a number of features. Furthermore
banking has been chosen as an example because the concepts will be familiar to most
readers. The example only looks at some core banking function and not at the activities
such as marketing or specialist products such as insurance.

The Example

The bank has a number of regions and a central ‘premium’ account function that
caters for some business customers. Each region has a number of branches.
Branches have a manager and a number of staff. Each branch manager reports
to a regional manager.

If a customer has a personal account then the account manager is a branch
personal account manager, however if the individual has a net worth in excess of
USD1M the branch manager acts as the account manager. Personal accounts
have contact and statement addresses and a range of telephone numbers, e-
mails, addresses, etc.

If the account belongs to a business with less than USD1M turnover then the
account manager is a business account manager at the branch who reports to
the branch manager. If the account belongs to a business with a turnover of
between USD1M and USD10M then the account manager is an individual at the
regional office who reports to the regional manager. If the account belongs to a
business with a turnover more than USD10M then the account managers at the
central office are responsible for the account. Businesses have contact and
statement addresses as well as a number of approved individuals who can use
the company account and contact details for them.

Branch and account managers periodically review the banding of accounts by
income for individuals and turnover for companies and if they are likely to move
band in the coming year then they are added to the appropriate (future) category.
Note that this is only partially fact based, the rest being based on subjective input
from account managers.

The bank offers a range of services including current, loan and deposit accounts,
credit and debit cards, EPOS (for business accounts only), foreign exchange,
etc.

The bank has a number of channels including branches, a call centre service, a
web service and the ability to use ATMs for certain transactions.

The bank offers a range of transaction types including cash, cheque, standing
order, direct debit, interest, service charges, etc.



After the close of business on the last working day of each month the starting
and ending balances, the average daily balance and any interest is calculated for
each account.

On a daily basis the exposure (i.e. sum of all account balances) is calculated for
each customer along with a risk factor that is a number between 0 and 100 that
is influenced by a number of factors that are reviewed from time to time by the
risk management department. Risk factors might include sudden large deposits
or withdrawals, closure of a number of accounts, long-term non-use of an
account, etc. that might influence account managers’ decisions.

Every transaction that is made is recorded every day and has three associated
dates, the date of the transaction, the date it appeared on the system and the
cleared date.

De-constructing the example

The bank has a number of regions and a central ‘premium’ account function that
caters for some business customers. Each region has a number of branches.
Branches have a manager. Each branch manager reports to a regional manager.

• The bank itself must be held as an organisation.
• The regions and central ‘premium’ account function are held as
36
Organisation Units.
• The bank and the regions have links.
• The branches are held as organisational units.
• The regions and the branches have links.
• The branches have addresses via a history table.
• The branches have electronic addresses via a history table.
• There are a number of roles stored as organisation units.
• There roles and the individuals have links.
• The roles may have addresses via a history table.
• The roles may have electronic addresses via a history table.
• The individuals may have addresses via a history table.
• The individuals have electronic addresses via a history table.

At this point only existing major entities and history tables have been used. Also
this information would be re-usable in many places just like the conformed
dimensions concept of star schemas but with more flexibility.

If a customer has a personal account then the account manager is a branch
personal account manager, however if the individual has a net worth in excess of
USD1M the branch manager acts as the account manager. Personal accounts
have contact and statement addresses and a range of telephone numbers, e-
mails, etc.

• Customers are held as Parties, either Individuals or Organisations.
• Customers have addresses via a history table.
• Customers have electronic addresses via a history table.
• Accounts are held in the Accounts major entity.
• Customers are related to accounts via a history table.
• Branches are related to accounts via a history table.
• Accounts are associated with a role via a history table.
• An individual’s net worth is generated elsewhere and stored as a property
of the party.

36
See Appendix 2 – Understanding Hierarchies for an explanation as to why the regions are
organisational units and not geography.



• A high net worth individual is a member of a similarly named segment.
• The accounts may have addresses via a history table.
• The accounts may have electronic addresses via a history table.

If the account belongs to a business with less than USD1M turnover then the
account manager is a business account manager at the branch who reports to
the branch manager. If the account belongs to a business with a turnover of
between USD1M and USD10M then the account manager is an individual at the
regional office who reports to the regional manager. If the account belongs to a
business with a turnover over USD10M then the account managers at the central
office are responsible for the account. Businesses have contact and statement
addresses as well as a number of approved individuals who can use the
company account, and contact details for them.

• Businesses are held as parties.
• The business turnover is held as a party property.
• The category membership based on turnover is held as a segment.
• The businesses may have addresses via a history table.
• The businesses may have electronic addresses via a history table.

Branch and account managers periodically review the banding of accounts by
turnover for both individuals and companies and if they are likely to move band in
the coming year then they are added to the appropriate (future) category. Note
that this is only partially fact based, the rest being based on subjective input from
account managers.

• There is a need to allow manual input via a warehouse support
application for the party segments.

At this point only the PARTY, ADDRESS, ELECTRONIC ADDRESS sub-models
and associated _HISTORY tables have been used.

The bank offers a range of services including current, loan and deposit accounts,
credit and debit cards, epos (for business accounts only), foreign exchange, etc.

• The product services are held in the product service major entity.
• The product services are associated with an account via a history.

The bank has a number of channels including branches, a call centre service, a
web service and the ability to use ATMs for certain transactions.

• The channels are held in the channels major entity.
• The ability to use a channel for a specific product service is held in the
history that relates the two major entities.

This adds the PRODUCT_SERVICE and CHANNEL major entities into the
model.

The bank offers a range of transaction types including cash, cheque, standing
order, direct debit, interest, service charges, etc.

• This requires a TRANSACTION_TYPE table that will be added to the
transaction table, which has not yet been defined.

After the close of business on the last working day of each month the starting
and ending balances, the average daily balance and any interest is calculated for
each account.

• This is stored as an account property (it may be an event).



On a daily basis the exposure (i.e. sum of all account balances) is calculated for
each customer along with a risk factor that is a number between 0 and 100 that
is influenced by a number of factors that are reviewed from time to time by the
risk management department. Risk factors might include sudden large deposits
or withdrawals, closure of a number of accounts, long-term non-use of an
account, etc. that might influence account managers’ decisions.

• The exposure is stored as a party property (or event).
• The party risk factor is stored as a party property.

Everything that is required to describe the transaction table is now available.

Every transaction that is made is recorded every day and has three associated
dates, the date of the transaction, the date it appeared on the system and the
cleared date.

• The Transaction Table will have the following columns
o Transaction Date
o Transaction System Date
o Transaction Cleared Date
o From Account
o To Account
o Transaction Type
o Amount

This would complete the model for the example. There are some interesting
features to examine. The first is that all amounts would be positive. This is
because for a credit to an account the ‘from account’ would be the sending party
and the ‘to account’ would be the customer’s account. For a debit the ‘to account’
would be the recipient and the ‘from account’ would be the customer’s account.

This has a number of effects. Firstly it complies with the DRY (Don’t Repeat
Yourself) principle and means that extra data is not stored for the transaction. It
also means that a collection of account information not related to any current
party (e.g. a customer at another bank) is built up. This information is useful in
the analysis of fraud, churn, market share, competitive analysis, etc.

For a customer analysis data mart the data can be extracted and converted into
the positive credit/negative debt arrangement required by the users.

The payment of bank changes and interest would also have accounts and this
information in a different data mart could be used to look at profitability,
exposure, etc.

The process has used seven major entities’ sub-models, an additional type table
and an occurrence or transaction table. Storing this information should
accommodate and absorb almost any change in business process or source
system without the need to change the data warehouse model and will allow
multiple data marts to be built from a single data warehouse quickly and easily.
In effect the type tables act as metadata for how to use and extend the data
model rather than defining the business process explicitly in the data model,
hence the name process neutral data modelling.

It also demonstrates the ability of the data model to support the requirements
process. By knowing the major entities and using a storyboard approach similar
to the example above, and familiar as an approach to agile developers, it is
possible to quickly and easily identify business, data and query requirements.



Party Sub Model
including:
• Individuals
History • Organisations History
• Organisation Units
• Roles

Addresses Sub Model Electronic Addresses Sub Model
including: including:
• Postal Address • Telephone Numbers
• Point Location • E-Mail Addresses
• Telex

History

Accounts Sub Model
History History

History History

Channel Sub Model Product Service Sub Model

Retail Banking Transactions Transaction
Calendar
Types
Sub Model

Figure 18 - The Example Bank Data Model



The model above has been almost fully described in detail by this document since the self-
similar modelling for all the sub-model components has been described along with the history
tables, most of the retail banking transactions and some of the lifetime attributes of the major
entities. To complete the model just needs these additional attributes to be added.

Two other effects that will influence the creation of data marts from this model can also be
seen. Firstly the creation of dimensions will revolve around the de-normalisation of the
attributes that are required from each of the major entities into one of the two dimensions
associate with account as these have the hierarchies for the customer, account manager, etc
associated with them.

The second effect is that of the natural star schema. It is clear from this diagram that the fact
tables will be based around the ‘Retail Banking Transactions’ table. As has already been
stated there are several data marts that can be built from this fact table, probably at different
levels of aggregation and with different dimensions.

The occurrence or transaction table above is one of perhaps twenty that a large enterprise
would require along with approximately thirty _HISTORY tables. This would be combined with
around twenty major entity sub models to create an enterprise data warehouse data model.

For those readers who have also read and are familiar with the Data Management &
37
Warehousing white paper ‘How Data Works’ that describes natural star schemas in more
detail and also a technique called left to right entity diagrams will see a correlation as follows:

Level Description
1 _TYPE and _BAND tables, simple small volume reference data.
2 Major Entities, complex low volume data.
3 Some major entities that are dependent on others along with _PROPERTIES and _SEGMENTS
tables, less complex but with greater volume.
4 _HISTORY tables and some occurrence or transaction tables.
5 Occurrence or transaction tables. Significant volume but low complexity data.
Figure 19 - Volume & Complexity Correlations

37
Available for download from http://www.datamgmt.com/whitepapers



Implementation Issues
The use of a process neutral data model and a design pattern is meant to ease the design of
a system but there will always be exceptions and things that need further explanation in order
to fit them into the solution. Much of this section refers to ETL issues that can only be briefly
38
described in this context.

The ‘Party’ Special Case
The examples throughout this document have used the PARTY table as a major entity
but in practice this is one of the more difficult tables to deal with. The first issue is that
in many cases name does not have lifetime value, for example when a woman gets
39
married or divorced and changes her name or when a company renames itself. Also
Individual names often have multiple parts (title, forename, surname).

There is also a requirement to track some form of state identity number. In the United
Kingdom an individual has their National Insurance number and in the United States
their social security number, other numbers (e.g. passport, ID card, etc are simply
stored as properties). Organisations have other numbers (Companies have registration
numbers, charities and trusts have different registration numbers, but VAT numbers are
properties as they can and do change).

Another minor issue is that people have a date of birth and a date of death. This is
simply resolved as date of birth is the Individual Start Date and date of death is the
Individual End Date however this terminology can sometimes prove controversial.

The solution to the PARTY special case depends on the database technology being
used. If the database supports the creation of views and the ‘UNION ALL’ SQL
40
operator then the preferred solution is as follows:

Create the INDIVIDUALS table as follows:

• PARTY_DWK
• PARTY_TYPE_DWK
• TITLE
• FORENAME
41
• CURRENT_SURNAME
• PREVIOUS_SURNAME
• MAIDEN_SURNAME
• DATE_OF_BIRTH
• DATE_OF_DEATH
• STATE_ID_NUMBER
• Other lifetime attributes as required

38
Data Management & Warehousing provide consultancy on ETL design and techniques to ensure that
data warehouses can be loaded effectively regardless of the data modelling approach used.
39
Interestingly, in Scotland, which has different regulations from England & Wales, birth marriage and
death certificates (also known as vital records) have, since 1855, understood the importance of knowing
the birth names of everyone on the certificate. For example on a wedding certificate you will get the
groom’s mother’s maiden name and a married woman’s death certificate will also feature the her maiden
name. Effectively the birth name has lifetime value and all other names are additional information.
http://www.scotlandspeople.gov.uk/content/help/index.aspx?r=554&628
40
Nearly all business intelligence databases support this functionality.
41
CURRENT_ and PREVIOUS_ are reserved prefixes; see Appendix 1 Data Modelling Standards.


White Paper - Process Neutral Data Modelling

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie White Paper - Process Neutral Data Modelling

Ähnlich wie White Paper - Process Neutral Data Modelling (20)

Mehr von David Walker

Mehr von David Walker (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

White Paper - Process Neutral Data Modelling