SlideShare ist ein Scribd-Unternehmen logo
1 von 64
Downloaden Sie, um offline zu lesen
Data Management & Warehousing




                                                              WHITE PAPER


   Process Neutral Data Modelling
                                                      DAVID M WALKER
                                                                           Version: 1.0
                                                                      Date: 10/02/2009




                      Data Management & Warehousing

   138 Finchampstead Road, Wokingham, Berkshire, RG41 2NU, United Kingdom

                          http://www.datamgmt.com
White Paper - Process Neutral Data Modelling




Table of Contents
Table of Contents ...................................................................................................................... 2 
Synopsis .................................................................................................................................... 4 
Intended Audience .................................................................................................................... 4 
About Data Management & Warehousing ................................................................................. 4 
Introduction................................................................................................................................ 5 
The Problem .............................................................................................................................. 6 
   The Example Company......................................................................................................... 6 
   The Real World ..................................................................................................................... 9 
The Customer Paradigm ......................................................................................................... 10 
Requirements of a Data Warehouse Data Model.................................................................... 12 
   Assumptions........................................................................................................................ 12 
   Requirements...................................................................................................................... 12 
The Data Model ....................................................................................................................... 14 
   Major Entities ...................................................................................................................... 14 
   Type Tables ........................................................................................................................ 17 
   Band Tables ........................................................................................................................ 19 
   Property Tables................................................................................................................... 20 
   Event Tables ....................................................................................................................... 22 
   Link Tables.......................................................................................................................... 23 
   Segment Tables .................................................................................................................. 24 
The Sub-Model ........................................................................................................................ 25 
   History Tables ..................................................................................................................... 26 
   Occurrences and Transactions ........................................................................................... 27 
Implementation Issues ............................................................................................................ 33 
   The ‘Party’ Special Case..................................................................................................... 33 
   Partitioning .......................................................................................................................... 35 
   Data Cleansing.................................................................................................................... 36 
   Null Values .......................................................................................................................... 36 
   Indexing Strategy ................................................................................................................ 36 
   Enforcing Referential Integrity............................................................................................. 36 
   Data Insert versus Data Update.......................................................................................... 37 
   Row versus Set Based Loading in ETL............................................................................... 37 
   Disk Space Utilisation ......................................................................................................... 38 
   Implementation Effort .......................................................................................................... 38 
Data Commutativity ................................................................................................................. 39 
Data Model Explosion and Compression ................................................................................ 40 
   How big does the data model get?...................................................................................... 40 
   Can the data model be compressed? ................................................................................. 40 
Which Results to Store? .......................................................................................................... 41 
The Holistic Approach ............................................................................................................. 42 
Summary ................................................................................................................................. 43 
Appendix 1 – Data Modelling Standards ................................................................................. 44 
   General Conventions .......................................................................................................... 44 
   Table Conventions .............................................................................................................. 44 
   Column Conventions........................................................................................................... 46 
   Index Conventions .............................................................................................................. 50 
   Standard Table Constructs ................................................................................................. 50 
   Sequence Numbers For Primary Keys................................................................................ 52 
Appendix 2 – Understanding Hierarchies ................................................................................ 53 
   Sales Regions ..................................................................................................................... 53 
   Internal Organisation Structure ........................................................................................... 53 
Appendix 3 – Industry Standard Data Models ......................................................................... 55 
Appendix 4 – Information Sparsity .......................................................................................... 57 
Appendix 5 – Set Processing Techniques............................................................................... 59 
Appendix 6 – Standing on the shoulders of giants .................................................................. 60 


         © 2009 Data Management & Warehousing                                                                                         Page 2
White Paper - Process Neutral Data Modelling



Further Reading ...................................................................................................................... 61 
   Overview Architecture for Enterprise Data Warehouses..................................................... 61 
   Data Warehouse Governance............................................................................................. 61 
   Data Warehouse Project Management ............................................................................... 62 
   Data Warehouse Documentation Roadmap ....................................................................... 62 
   How Data Works ................................................................................................................. 63 
List of Figures .......................................................................................................................... 64 
Copyright ................................................................................................................................. 64 




         © 2009 Data Management & Warehousing                                                                                        Page 3
White Paper - Process Neutral Data Modelling




Synopsis
This paper describes in detail the process for creating an enterprise data warehouse physical
data model that is less susceptible to change. Change is one of the largest on-going costs in
a data warehouse and therefore reducing change reduces the total cost of ownership of the
system. This is achieved by removing business process specific data and concentrating on
core business information.

The white paper examines why data-modelling style is important and how issues arise when
using a data model for reporting. It discusses a number of techniques and proposes a specific
solution. The techniques should be considered when building a data warehouse solution even
when an organisation decides against using the specific solution.

This paper is intended for a technical audience and project managers involved with the
technical aspects of a data warehouse project.



Intended Audience
Reader                                              Recommended Reading
Executive                                           Synopsis
Business Users                                      Synopsis
IT Management                                       Synopsis
IT Strategy                                         Entire Document
IT Project Management                               Entire Document
IT Developers                                       Entire Document




About Data Management & Warehousing
Data Management & Warehousing is a specialist consultancy in data warehousing, based in
Wokingham, Berkshire in the United Kingdom. Founded in 1995 by David M Walker, our
consultants have worked for major corporations around the world including the US, Europe,
Africa and the Middle East. Our clients are invariably large organisations with a pressing need
for business intelligence. We have worked in many industry sectors but have specialists in
Telco’s, manufacturing, retail, financial and transport as well as technical expertise in many of
the leading technologies.

For further information visit our website at: http://www.datamgmt.com

Crossword Clue: Expert Gives Us Real Understanding (4 letters)




      © 2009 Data Management & Warehousing                                                 Page 4
White Paper - Process Neutral Data Modelling




Introduction
Commissioning a data warehouse system is a major undertaking. Organisations will invest
significant capital in the development of the system. The data model is always a major
consideration and many projects will invest a significant part of the budget on developing and
re-working the initial data model.

Unfortunately projects also often fail to look at the maintenance costs of the data model that
they develop. A data model that is fit for purpose when developed will rapidly become an
expensive overhead if it needs to change when the source systems change. The cost
involved is not only in the change to the data model but also in the changes to the ETL that
feed the data model.

This problem is exacerbated by the fact that changes to the data model may be done in an
inconsistent way from the original design approach. The data model loses transparency and
becomes even more difficult to maintain.

For many large data warehouse solutions it is not uncommon to have a resource permanently
assigned to maintaining the data model and several more resources assigned to managing
the change in the associated ETL within a short time of going live.

By understanding the problem and using techniques imported from other areas of systems
and software development and well as change management techniques it is possible to
define a method that will greatly reduce this overhead.

This white paper sets out an example of the issues from which to develop a statement of
requirements for the data model and then demonstrates a number of techniques which, when
used together, can address those requirements in a sustainable way.




      © 2009 Data Management & Warehousing                                              Page 5
White Paper - Process Neutral Data Modelling




The Problem
Data modelling is the process of defining the database structures in which to hold information.
To understand the Process Neutral Data Modelling approach first this paper looks at why
these database structures have such an impact on the data warehouse.

In order to demonstrate the issues with creating a data model for a data warehouse more
experienced readers are asked bear with the necessarily simplistic examples that follow.

       The Example Company
       A company supplies and installs widgets. There are a number of different widget types,
       each having a name and specific colour. Each individual widget has a unique serial
       number and can have a number of red lamps and a number of green lamps plugged
       into it. The widgets are installed into cabinets at customer sites and from time to time
       engineers come in and change the relative numbers of red and green lamps. The
       customer name and a customer cabinet number identify cabinets. For operational
                                                               1
       systems the data model might look something like this :




                                                          2
Figure 1 - Initial Operational System Data Model

        This simple data model describes both the widget and the cabinet and provides the
        current combinations. It does not provide any historical context: “What was the
        previous configuration and when was it changed?”

        Historical data can be recorded by simply adding start date and end date to each of
                                                                                            3
        the main tables. This provides the ability to report on the historical configuration . In
        order to facilitate this a separate reporting environment would be setup because
        retaining history in the operational system would unacceptably reduce the operational
        system performance. There are three consequences of doing this:

             •   Queries are now more complex. In order to report the information for a given
                 date the query has to allow for the required date being between the start date

1
  Data models in this document are illustrative and therefore should be viewed as suitable for making
specific points rather than complete production quality solutions. Some errors exist to explicitly
demonstrate certain issues.
2
  The are several conventions for data modelling. In this and subsequent diagrams the link with a 1 and
∞ represents a one-to-many relationship where the ‘1’ record is a primary key field and the ‘∞’
represents the foreign key field.
3
  Note that the ‘WIDGET_LOCATIONS’ table requires an additional field called ‘INSTALL_SEQUENCE’
to allow for the case where a widget is re-installed in a cabinet.



      © 2009 Data Management & Warehousing                                                      Page 6
White Paper - Process Neutral Data Modelling



                 and the end date of the record in each of the tables. The extra complexity
                 slows the execution of the query.

             o   The volume of data stored has also increased. The storage of dates has a
                 minor impact on the size of each row but this is small when compared to the
                                                                   4
                 number of additional rows that need to be stored.

             o   Data has to be moved from the operational system to the reporting system
                 via an extract, transform and load (ETL) process. This process has to extract
                 the data from the operational system, compare the records to the current
                 records in the reporting system to determine if there are any changes and if
                 so make the required adjustments to the existing record (e.g. updating the
                 end date) and insert the new record. Already the process is more complex
                                                                          5
                 and time consuming than simply copying the data across.




Figure 2 - Initial Reporting System Data Model

        When the reporting system is built, it accurately reflects the current business
        processes, operational systems and provides historical data. From a systems
        management perspective there is now an additional database, and a series of ETL or
        interface scripts that have to be run reliably every day.

        The systems architecture may be further enhanced so that the reporting system
        becomes a data warehouse and the users make their queries on data marts, or sets
        of tables where the data has been re-structured in order to simplify of the users query
        environment. The ‘data marts’ typically use star-schema or snowflake-schema data
                                                                6
        modelling techniques or tool specific storage strategies . This adds an additional layer
        of ETL to move between the data warehouse and the data mart.

        However the company doesn’t stop here. The product development team create a
        new type of widget. This new widget allows amber lamps and can optionally be
        mounted in a rack that is in turn mounted in a cabinet. The IT director also insists that
        the new OLTP application is more flexible for other future developments.


4
  Assume that everything remains the same except that widgets are moved around (i.e. there are no
new widgets and no new cabinet/customer combination) then the WIDGET_LOCATIONS table grows in
direct proportion to the number of changes. If each widget were modified in some way once a month
then the reporting system table would be twelve times bigger than the operational system after one year
and this before any other change is handled.
5
  Additional functionality such as data cleansing will also impact the complexity of ETL and affect
performance
6
  This is accepted good practice and the design and implementation of data marts is outside the scope
of this paper.



      © 2009 Data Management & Warehousing                                                      Page 7
White Paper - Process Neutral Data Modelling



         These business process changes results in a new data model for the operational
         system.




Figure 3 - Second Version Operational System Data Model

         The reporting system is also now a live system with a large amount of historical
         information. It too can be re-designed. The operational system will be implemented to
         meet the business requirements and timescales regardless of whether the reporting
         system is ready. It also may not be possible to create the history required for the new
                                         7
         data model when it is changed.

         If a data mart is built from the data warehouse there are two impacts. Firstly that the
         data mart model will need to be changed to exploit the new data and secondly that
         the change to data warehouse model will require the data mart ETL to be modified
         regardless of any changes to the data mart data model.

         The example company does not stop here however as senior management decide to
         acquire a smaller competitor. The new subsidiary has it’s own systems that reflect
         their own business processes. The data warehouse was built with a promise of
         providing an integrated management reporting so there is an expectation that the
         data from the new source system will be quickly and seamlessly integrated into the
         data warehouse. From a technical perspective this could present issues around
         mapping the new source system data model to the existing data warehouse data
                                                 8                      9
         model, critical information data types , duplication of keys , etc. that all cause
         problems with the integration of data and therefore slow down the processing.

         Within a few short iterations of change it is possible to see the dramatic impact on the
         data warehouse and that the system is likely to run into issues.




7
  A common example of this is an organisation that captures the fact that an individual is married or not.
Later the organisation decided to capture the name of the partner if someone is married. It is not
possible to create the historical information systemically so for a period of time the system has to
support the continued use of the marital status and then possibly run other activities such as outbound
calling to complete the missing historical data.
8
  The example database assumed that serial number was numeric and used it as a primary key but what
happens if the acquired company uses alphanumeric serial numbers?
9
  If both companies use numbers starting from 1 for their customer ID then there will be two customers
who have the same ‘unique’ id, and customers that have two ‘unique’ IDs.



       © 2009 Data Management & Warehousing                                                        Page 8
White Paper - Process Neutral Data Modelling




      The Real World
      The example above is designed to illustrate some of the issues that affect data
      warehouse data modelling. In reality business and technical analysts will handle some
      of these issues in the design phase but how big is the data-modelling problem in the
      real world?

            o   A UK transport industry organisation has three mainframes, each of which is
                only allowed to perform one release a quarter. Each system also feeds the
                data warehouse. As a consequence the mainframe feeds require validation
                and change every month. Whilst the main data comes from these three
                systems there are sixty-five other Unix based operational system that feed
                the data warehouse and data from several hundred desktop based
                applications that are also provide data. Most of these source systems do not
                have good change control or governance procedures to assist in impact
                analysis. Change for this organisation is business as usual.

            o   A global ERP vendor supplies a system with over five thousand database
                objects and typically makes a major release every two years, a ‘dot’ release
                every six months and has numerous patches and fixes in between each
                major release. This type of ERP system is in use in nearly every major
                company and the data is a critical source to most data warehouses.

            o   A global food and drink manufacturer that came into existence as a result of
                numerous mergers and acquisitions and also divested some assets found
                itself with one hundred and thirty-seven general ledger instances in ten
                countries with seventeen different ERP packages. Even where the ERP
                packages were the same they were not necessarily using the same version of
                the package. The business intelligence requirement was for a single data
                warehouse and a single data model.

            o   A European Telco purchased a three hundred-table ‘industry standard’
                enterprise data model from a major business intelligence vendor and then
                spent two years analysing it before they started the implementation. Within
                six months of implementation they had changed some sixty percent of tables
                as a result of analysis omissions.

            o   A UK based banking and insurance business outsources all of its product
                management to business partners and only maintains the unified customer
                management systems (website, call centres and marketing). As a result
                nearly all of the ‘source systems’ are external to the organisation and whilst
                there are contractual agreements about the format and data remaining fixed
                in practice there is significant regular change in the format and information
                provided to both operational and reporting systems.

        Obviously these issues cannot be fixed just by creating the correct data model for the
                       10
        data warehouse but the objective of the data model design should be two fold:

            o   To ensure that all the required data can be stored effectively in the data
                warehouse.

            o   To ensure that the design of the data model does not impose cost and where
                possible actively reduces the cost of change on the system.

10
   Data Management & Warehousing have published a number of other white papers that are available
at http://www.datamgmt.com and look at other aspects of data warehousing and address some of these
issues. See Further Reading at the end of this document for more details.



      © 2009 Data Management & Warehousing                                                  Page 9
White Paper - Process Neutral Data Modelling




The Customer Paradigm
Data Warehouse development often start with a requirements gathering exercise. This may
take the form of interviews or workshops where people try to define what the customer is. If a
number of different parts of the business are involved then the definition of customer soon
becomes confused and controversial and negatively impacts the project. Most organisations
have a sales funnel that describes the process of capturing, qualifying, converting and
retaining customers.

                                                         Marketing say that the customer is anyone
                                                         and everyone that they communicate with.

                                                         The sales teams view the customer as
                                                         those organisations in their qualified lead
                                                         database or for whom they have account
                                                         management responsibility post-sales.

                                                         The customer services team are clear that
                                                         the customer is only those organisations
                                                         who have purchased a product and where
                                                         appropriate have purchased a support
                                                         agreement as well.

                                                         Other questions are asked in the
                                                         workshops such as “What about customers
                                                         who are also suppliers or partners?” and
                                                         “How do we deal with customers who have
                                                         gone away and then come back after a
                                                         long period of time?”

     Figure 4 - The Sales Funnel                     The most common solutions that are
                                                     created as a result either add ‘flag’ or
‘indicator’ columns to the customer table to represent each category or to create multiple
tables for the different categories required and to repeat the data in each of the tables.

This example clearly demonstrates that the business process is being embedded into the
data model. The current business process definition(s) of customer are defining how the data
model is created. What has been forgotten is that these ‘customers’ exist outside the
organisation and it is their interaction with different parts of the organisation that defines their
status of being a customer, supplier, etc. In legal documents there is the concept of a ‘party’
where a party is a person or group of persons that compose a single entity that can be
                                                11
identified as one for the purposes of the law . This definition is one that should be borrowed
and used in the data model.

If users query a data mart that is loaded with data extracted from the transaction repository
and data marts are built for a specific team or function that only requires one definition of the
                     12
data then the current definition can be used to build that data mart and different definitions
used for other departments.




11
   http://en.wikipedia.org/wiki/Party_(law)
12
   This also allows flexibility, as, when business processes change, it is possible at a cost to change the
rules by which data is extracted. The cost of change is relatively much lower than trying to rebuild the
data warehouse and data mart with a new definition.



       © 2009 Data Management & Warehousing                                                        Page 10
White Paper - Process Neutral Data Modelling



As a result of this approach two questions are common:

    •    Isn’t one of the purposes of building a data warehouse to have a single version of the
         truth?
         Yes. There is a single version of the truth in the data warehouse and this single
         version is perpetuated into the data marts, the difference is that the information in the
         data mart is qualified. Asking the question “How many customers do we have?”
         should get the answer “Customer Services have X active service contract customers”
         and not the answer “X” without any further qualification.

    •    What happens if different teams or departments have different data?
         People within the organisation work within different processes and with the same
         terminology but often different definitions, it is unlikely and impractical in the short
         term to change this, although it is possible that in the long term the data warehouse
         project will help with the standardization process. In the mean time it is an education
         process to ensure that answers are qualified. It is important to recognise that different
         departments legitimately have different definitions and therefore to recognise and
         understand the differences, rather than fighting about who is right.

It might be argued that there are too many differences to put all individuals and organisations
in a single table; this and other issues will be discussed later in the paper.




        © 2009 Data Management & Warehousing                                               Page 11
White Paper - Process Neutral Data Modelling




Requirements of a Data Warehouse Data Model
Having looked at the problems that can affect a data warehouse data model it is possible to
describe the requirements that should be made on any data model design.


       Assumptions
           1. The data model is for use in the architectural component called the transaction
                                             13
              repository or data warehouse.

           2. As the data model is used in the data warehouse it will not be a place where
              users go to query the data, instead users will query separate dependant data
              marts.

           3. As the data model is used in the data warehouse data will be extracted from it
              to populate the data marts by ETL tools.

           4. As the data model is used in the data warehouse the data will be loaded into it
              from the source systems by ETL tools.

           5. Direct updates (i.e. not through formally released ETL processes) will be
              prohibited; instead a separate application or applications will exist as a
              surrogate source.

           6. The data model will not be used in a ‘mixed mode’ where some parts use one
              data modelling convention and other parts use another. (This is generally bad
              practice with any modelling technique but often the outcome where the
              responsibility for data modelling changes is distributed or re-assigned over
              time).

       Requirements
           1. The data model will work on any standard business intelligence relational
                        14
              database. This is to ensure that it can be deployed on any current platform
              and if necessary re-deployed on a future platform.

           2. The data model will be process neutral i.e. it will not reflect current business
              processes, practices or dependencies but instead will store the data items and
              relationships as defined by their use at the point in time when the information is
              acquired.
                                                              15
           3. The data model will use a design pattern i.e. a general reusable solution to a
              commonly occurring problem. A design pattern is not a finished design but a
              description or template for how to solve a problem that can be used in many
              different situations.




13
   For further information on Transaction Repositories see the Data Management & Warehousing white
paper ”An Overview Architecture For Enterprise Data Warehouses”
14
   A typical list would (at the time of writing) include IBM DB2, Microsoft SQL Server, Netezza, Oracle,
Sybase, Sybase IQ, and Teradata. For the purposes of this document it implies compliance with at least
the SQL92 standard
15
   http://en.wikipedia.org/wiki/Software_design_pattern



      © 2009 Data Management & Warehousing                                                      Page 12
White Paper - Process Neutral Data Modelling



                                                   16
           4. Convention over configuration : This is a software design paradigm which
              seeks to decrease the number of decisions that developers need to make,
              gaining simplicity, but not necessarily losing flexibility. It can be applied
              successfully to data modelling and reduce the number of decisions of the data
              modeller by ensuring that tables and columns use a standard naming
              convention and are populated and queried in a consistent fashion. This also
              has a significant impact on the efforts of an ETL developer.

           5. The design should also follow the DRY (Don’t Repeat Yourself) principle. This
              is a process philosophy aimed at reducing duplication. The philosophy
              emphasizes that information should not be duplicated, because duplication
              increases the difficulty of change, may decrease clarity, and leads to
                                               17
              opportunities for inconsistency.

           6. The data model should be significantly static over a long period of time, i.e.
              there should not be a need to add or modify tables on a regular basis. In this
              case there is a difference between designed and implemented, it is possible to
              have designed a table but not to implement it until it is actually required. This
              does not affect the static nature of the data model, as the placeholder already
              exists.
                                                                                      18
           7. The data model should store data at the lowest possible level                and avoid the
              storage of aggregates.

           8. The data model should support the best use of platform specific features whilst
                                          19
              not compromising the design.

           9. The data model should be completely time-variant, i.e. it should be possible to
                                                                          20
              reconstruct the information at any available point in time.

           10. The data model should act as a communication tool to aid the refinement of
               requirements and an explanation of possibilities.




16
   For further information see http://en.wikipedia.org/wiki/Convention_over_Configuration and
http://softwareengineering.vazexqi.com/files/pattern.html. The Ruby on Rails language
(http://www.rubyonrails.org/) makes extensive use of this principle.
17
   DRY is a core principle of Andy Hunt and Dave Thomas's book The Pragmatic Programmer. They
apply it quite broadly to include "database schemas, test plans, the build system, even documentation."
When the DRY principle is applied successfully, a modification of any single element of a system does
not change other logically unrelated elements. Additionally, elements that are logically related all change
predictably and uniformly, and are thus kept in sync. (http://en.wikipedia.org/wiki/DRY). This does not
automatically imply database normalisation but database normalisation is one method for ensuring
‘dryness’.
18
    This is the origin of the term ‘Transaction Repository’ rather than ‘Data Warehouse’ in Data
Management & Warehousing documentation. The transaction repository stores the lowest level of data
that is practical and/or available. (See An Overview Architecture for Enterprise Data Warehouses)
19
   This turns out to be both simple and very effective. For Oracle the most common features that need
support include partitioning and materialized views. For Sybase IQ and Netezza there is a preference for
inserts over updates due to their internal storage mechanisms. For all databases there is variation in
indexing strategies. These and other features should be easily accommodated.
20
   Also known as temporal. Most data warehouses are not linearly time variant but quantum time variant.
If a status field is updated three times in a day and the data warehouse reflects all changes then it is
linearly time-variant. If a data warehouse holds the first and last values only because a batch process
loads it once a day then it is quantum time-variant where the quantum is, in this case, one day.
Quantum time variant solutions can only resolve data to the level of the quantum unit of measure.



       © 2009 Data Management & Warehousing                                                        Page 13
White Paper - Process Neutral Data Modelling




The Data Model
As this white paper has defined requirements for the data model it is now possible to start
looking at what is needed to design a data model. This is done by breaking down the tables
that will be created into different groups depending on how they are used. The section below
discusses the main elements of the data models. There are some basics such as naming
conventions, standard short names, keys used in the data model, etc. that are not described.
A complete set of data modelling rules and example models can be found in the appendices.

       Major Entities
       Party is, as described in the customer paradigm section above, an example of a type of
       table within the Process Neutral Data Modelling method known as a ‘Major Entity’.
       These are tables that deliver the placeholders for all major subject areas of the data
       model and around which other information is grouped. Each business transaction will
       relate to a number of major entities. Some major entities are global i.e. they apply to all
       types of organisation (e.g. Calendar) and there are a number of major entities that are
       industry specific (e.g. for Telco, Manufacturing, Retail, Banking, etc.). It would be very
       unusual for an organisation to need a major entity that was not industry wide. Below is
       a list of some of the most common:

           •    Calendar
                Every data warehouse will need a calendar. It should always contain data to
                the day level and never to parts of the day. In some cases there is a need to
                                                                           21
                support sub-types of calendar for non-Gregorian calendars .

           •    Party
                Every organisation will have dealings between parties. This will normally
                include three major sub-types: individuals, organisations (any formal
                organisation such as a company, charity, trust, partnership, etc.) and
                organisational units (the components within an organisation including the
                system owners organisation).

           •    Geography
                The information about where. This is normally sub-typed into two components,
                                                                                                22
                address and location. Address information is often limited to postal addresses
                whilst location is normally described by the longitude and latitude via GPS co-
                ordinates. Other specialist geographic models exist that may need to be taken
                              23
                into account.

           •    Product_Service (also known as Product or as Service)
                This is the catalogue of the products and/or services that an organisation
                supplies.

           •    Account
                Every customer will have at least one account if financial transactions are
                involved (even those organisations that do not think they currently use the
                concept of account will do so as accounting systems always have the concept
                of a customer with one or more accounts).


21
   See http://www.qppstudio.net/footnotes/non-gregorian.htm for various calendars, notably 2008 is the
Muslin Year 1429 and the Jewish Year 5968
22
   Some countries, such as the UK, have validated lists of all addresses (see the UK Post Office
Postcode Address File at http://www.royalmail.com/portal/rm/jump2?mediaId=400085&catId=400084)
23
   Network Rail in the UK use an Engineers Line Reference, which is based on a linear reference model
and refers to a known distance from a fixed point on a track. In Switzerland they have an entire national
co-ordinate system (http://en.wikipedia.org/wiki/Swiss_coordinate_system)



       © 2009 Data Management & Warehousing                                                      Page 14
White Paper - Process Neutral Data Modelling



    •   Electronic_Address
        Any electronic address such as a telephone number, email address, web
        address, IP address etc. This is normally sub-typed by the categories used.

    •   Asset (also known as Equipment)
        A physical object that can be uniquely identified (normally by a serial number or
        similar). This may be used or incorporated in a PRODUCT_SERVICE, or sold
        to a customer etc. In the example Cabinet, Rack and Widget were all examples
        of Asset, whilst Widget Type was an example of PRODUCT_SERVICE.

    •   Component
        A physical object that cannot be uniquely identified by a serial number but has
        a part number and is used in the make-up of either an asset or of a product
        service. In the example company there was not a particular record of the serial
        numbers of the lamps, however they would all have had a part number that
        described the type of lamp to be used.

    •   Channel
        A conceptual route to market (e.g. direct, indirect, web-based, call-centre, etc.).

    •   Campaign
        A marketing exercise that is designed to promote the organisation, e.g. the
        running of a series of adverts on the television.

    •   Campaign Activities
        The running of a specific advert as part of a larger campaign.

    •   Contract
        Depending on the type of business the relationship between the organisation
        and its supplier or its customer may require the concept of a contract as well as
        that of an account.

    •   Tariff (also known as Price_List)
        A set of charges and discounts that can be applied to product services as a
        point in time.

This list is not comprehensive by if an organisation can effectively describe their major
entities and combine this information with the interactions between them (the
occurrences or transactions) then they have the basis of a very successful data
warehouse.

Major Entities can have any meaningful name provided it is not a reserved word in the
database or (as will be seen below) a reserved word within the design pattern of
Process Neutral Data Modelling.

Some readers, who are familiar with the concepts of star schemas and data marts, will
also be aware that these are very close to the basic dimensions that most data marts
use. This should come as no surprise as these are the major data items of any
business regardless of their business processes or of their specific industry sector and
a data mart is only a simplification of the data presented for the user. This effect is
called “natural star schemas” and will be explored in more detail later.




© 2009 Data Management & Warehousing                                                Page 15
White Paper - Process Neutral Data Modelling




              Lifetime Value
              The next decision is which columns (attributes) should be included in the table.
                                                                            24
              Much like the processes involved in normalising a database the objective is to
              minimise duplication of data and there is also a requirement to minimise updates.
              To this end the attributes that are included should therefore have ‘lifetime value’,
              i.e. they should remain constant once they have been inserted into the database.
              This means that variable data needs to be handled elsewhere.

              Using some of the major entities above as examples:

              Calendar:
                 Lifetime Value Attributes:             Date, Public Holiday Flag

              Geography:
                Lifetime Value Attributes:              Address Line 1, Address Line 2, City,
                                                                 25
                                                        Postcode , County, Country
                  Non-Lifetime Value Attributes:        Population

              Party (Individuals):
                                                                               26
                 Lifetime Value Attributes:             Forename, Surname , Date of Birth,
                                                                              27
                                                        Date of Death, Gender , State ID Number
                  Non-Lifetime Value Attributes:        Marital Status, Number of Children, Income

              Party (Organisations):
                 Lifetime Value Attributes:             Name, Start Date, End Date,
                                                        State ID Number
                  Non-Lifetime Value Attributes:        Number of Employees, Turnover,
                                                        Shares Issued

              Account:
                Lifetime Value Attributes:              Account Number, Start Date, End Date.
                Non-Lifetime Value Attributes:          Balance

       Other than this lifetime value requirement for columns every table must comply with the
       general rules for any table. For example every table will have a key column that uses
                                                                            28
       the table short name made up of six characters and the suffix _DWK , a TIMESTAMP
       column and an ORIGIN column.




24
    http://en.wikipedia.org/wiki/Database_normalization: Database normalization is a technique for
designing relational database tables to minimize duplication of information and, in so doing, to safeguard
the database against certain types of logical or structural problems, namely data anomalies.
25
   This may occasionally be a special case as postal services do, from time to time, change postal codes
that are normally static.
26
   There is a specific special case that deals with the change of name for married women that will be
dealt with in the section ‘The Party Special Case’ later.
27
   One insurance company had to deal with updatable genders due to the fact that underwriting rules
require assessment based on birth gender and not gender as a result of re-assignment surgery.
Therefore for marketing it had to handle ‘current’ gender and for underwriting it had to deal with ‘birth’
gender.
28
   See the data modelling rules appendix for how this name is created.



       © 2009 Data Management & Warehousing                                                       Page 16
White Paper - Process Neutral Data Modelling




       Type Tables
       There is often a need to categorise information into discrete sets of values. The valid
       set of categories will probably change over time and therefore each category record
       also needs to have lifetime value. Examples of the categorisation have already
       occurred with the some of the major entities:

           •   Party:                         Individual, Organisation, Organisation Unit
           •   Geography:                     Postal Address, Location
           •   Electronic Address:            Telephone, E-Mail

       To support this and to comply with the requirement for convention over configuration all
       _TYPES tables of this format have a standard data model as follows:

           •   The table will have the same name as the major entity but with the suffix
               _TYPES (e.g. PARTY_TYPES, GEOGRAPHY_TYPES, etc.).
           •   The table will always have a key column that uses the six character short code
               and the _DWK suffix.
           •   The table will have a _TYPE column that is the type name.
           •   The table will have a _DESC column that is a description of the type.
           •   The table will have a _GROUP column that groups certain types together.
           •   The table will have a _START_DATE column and a _END_DATE column.

       This is a type table in its entirety. If a table needs more information (i.e. columns) then
       this is not a _TYPES table and must not have the _TYPES extension, as it does not
       comply with the rules for a _TYPES table.

       Examples of data in _TYPES tables might include:

       PARTY_TYPES

     Column                        Example Rows
     PARTYP_DWK                    1                  2                      3                      4
     PARTY_TYPE                    INDIVIDUAL         LTD COMPANY            PARTNERSHIP            DIVISION
     PARTY_TYPE_DESC               An Individual      A company in           This is a business     A division of a
                                                      which the liability    owned by two or        larger
                                                      of the members in      more people who        organisation
                                                      respect of the         are      personally
                                                      company’s debts        liable    for   all
                                                      is limited             business debts.
     PARTY_TYPE_GROUP              INDIVIDUAL         ORGANISATION           ORGANISATION           UNIT
     PARTY_TYPE_START_DATE         01-JAN-1900        01-JAN-1900            01-JAN-1900            01-JAN-1900
     PARTY_TYPE_END_DATE
 Figure 5 - Example data for PARTY_TYPES

       The start date in this context has little initial value in this context, although it is a
                        29
       mandatory field and therefore has to be completed with a date before the earliest
       party in this example. Legal types of organisation do change over time and so it is
       possible that the start and end dates of these will become significant.

       These types do not describe the type of role that the party is performing (i.e. Customer,
       Supplier, etc.) they describe the type of the party (e.g. Individual, etc.). Describing the
       role comes later. The type and group column are repeated for INDIVIDUAL, as there is
       no hierarchy of information for this value but the field is mandatory.


29
   Start Dates in _TYPES tables are mandatory as, with only a few exceptions, they are required
information. In order to be consistent they therefore have to be mandatory for all _TYPES tables



       © 2009 Data Management & Warehousing                                                        Page 17
White Paper - Process Neutral Data Modelling




       GEOGRAPHY_TYPES

     Column                                   Example Rows
     GEOTYP_DWK                               1                              2
     GEOGRAPHY_TYPE                           POSTAL                         LOCATION
     GEOGRAPHY_TYPE_DESC                      An address as supported by     A point on the surface of the earth
                                              the postal service             defined by it’s longitude and
                                                                             latitude
     GEOGRAPHY _TYPE_GROUP                    POSTAL                         LOCATION
     GEOGRAPHY _TYPE_START_DATE               01-JAN-1900                    01-JAN-1900
     GEOGRAPHY _TYPE_END_DATE
 Figure 6 - Example Data for GEOGRAPHY_TYPES

       The start date in this context has little initial value, although it is a mandatory field and
       therefore has to be completed with a date.

       These types do not describe the type of role that the geography is performing (i.e.
       home address, work address, etc.) they describe the type of the geography (postal
       address, point location, etc.).

       The type and group column are repeated for both values, as there is no hierarchy of
       information for them.

       CALENDAR_TYPES

       The convention over configuration design aspect allows for this table, however it is
       rarely needed and can therefore be omitted. This is an example where a table can be
       described as designed (i.e. it is known exactly what it looks like) but not implemented.

       _TYPES tables will appear in other parts of the data model but they will always have
       the same function and format.
                                                                                         30
       The consequence of this design re-use is that implementing an application to manage
       the source of _TYPE data is easy. The system than manages the type data needs to
       have a single table with the same columns as a standard _TYPES table and an
       additional column called, for example, DOMAIN. This DOMAIN column has the target
       system table name (e.g. PARTY_TYPES) in it. The ETL then simply maps the data
       from the source system to the target system where the DOMAIN equals the target table
       name. This is an example of re-use generating a significant saving in the
       implementation.




30
  This is a good use of a Warehouse Support Application as defined in “An Overview Architecture for
Enterprise Data Warehouses”



       © 2009 Data Management & Warehousing                                                     Page 18
White Paper - Process Neutral Data Modelling




         Band Tables
         Whilst _TYPES tables classify information into discrete values it is sometimes
         necessary to classify information into ranges or bands i.e. between one value and
         another. The classic example of this is for telephone calls which are classified as ‘Off-
         Peak Rate’ if they are between 00:00 and 07:59 or between 18:00 and 23:59. Calls
         between 08:00 and 17:59 are classified as ‘Peak Rate’ and charged at a premium.

         _BANDS is a special case of the _TYPES table and would store the data as follows:

           Column                           Example Rows
           TIMBAN_DWK                       1                     2             3
           TIME_BAND                        Early Off Peak        Peak          Late Off Peak
                                              31
           TIME_BAND_START_VALUE            0                     480           1080
           TIME_BAND_END_VALUE              479                   1079          1439
           TIME_BAND_DESC                   Early Off Peak        Peak          Late Off Peak
           TIME_BAND_GROUP                  Off Peak              Peak          Off Peak
           TIME_BAND_START_DATE             01-JAN-1900           01-JAN-1900   01-JAN-1900
           TIME_BAND_END_DATE
         Figure 7 - Example data for TIME_BANDS

         Once again the _BANDS table has a standard format as follows

              •   The table will have the same name as the major entity but with the suffix
                  _BANDS (e.g. TIME_BANDS, etc.).
              •   The table will always have a key column that uses the six character short code
                  and the _DWK suffix.
              •   The table will have a _BAND column that is the type name.
              •   The table will have a _START_VALUE and a _END_VALUE that represent the
                  starting and finishing values of the band.
              •   The table will have a _DESC column that is a description of the band.
              •   The table will have a _GROUP column that groups certain band together.
              •   The table will have a _START_DATE column and a _END_DATE column.

         The table has to comply with this convention in order to be given the _BANDS suffix.




31
     Note that values are stored as a number of minutes since midnight.



         © 2009 Data Management & Warehousing                                                   Page 19
White Paper - Process Neutral Data Modelling




      Property Tables
      In the discussion of major entities and lifetime value the data that failed to meet the
      lifetime value principle was omitted from the major entity tables, however it still needs
      to be stored. This is handled via a property table. Property tables also help to support
      the extensibility aspects of the data model.

      If we use PARTY as an example then as already identified the marital status does not
      possess lifetime value and therefore is not included in the major entity. Everyone starts
      as single, some marry, some divorce and some are widowed, these ‘status changes’
      occur through the lifetime of the individual.

      To deal with this problem the property table can be modelled as follows:




      Figure 8 - Party Properties Example

      As can be seen from example above in order to handle the properties two new tables
      are created. The first is the PARTY_PROPERTIES table itself and the second a
      supporting PARTY_PROPERTY_TYPES table.

      In order to store the marital status of an individual a set of data needs to be entered in
      the PARTY_PROPERTY_TYPES table:

                                    TYPE            GROUP
                                    Single          Marital Status
                                    Married         Marital Status
                                    Divorced        Marital Status
                                    Co-Habiting     Marital Status
                                  Figure 9 - Example Party Property Data

      The description, start and end date would be filled in appropriately. Note that the start
      and end date here represent the start and end date of the type and not that of the
                                     32
      individuals’ use of that type.

      It is now possible to insert a row in the PARTY_PROPERTIES table that references the
      individual in the PARTY table and the appropriate PARTY_PROPERTY_TYPES (e.g.
      ‘Married’). The PARTY_PROPERTIES table can also hold the start date and end date
      of this status and optionally where appropriate a text or numeric value that relates to
      that property.




32
   The need for start and end dates on such items is often questioned however experience shows that
legislation changes supposed static values in most countries over the lifetime of the data warehouse.
For example in December 2005 the UK permitted a new type of relationship called a civil partnership.
http://en.wikipedia.org/wiki/Civil_partnerships_in_the_United_Kingdom.



      © 2009 Data Management & Warehousing                                                   Page 20
White Paper - Process Neutral Data Modelling



       This means that not only the current marital status can be stored but also historical
       information.
                           33
            PARTY_DWK           PARTY_PROPERTY_DWK             START_DATE           END_DATE
            John Smith          Single                         01-Jan-1970          02-Feb-1990
            John Smith          Married                        03-Feb-1990          04-Mar-2000
            John Smith          Divorced                       05-Mar-2000          06-Apr-2005
            John Smith          Co-Habiting                    07-Apr-2005
           Figure 10 - Example data for PARTY_PROPERTIES

       The data shown here describes the complete history of an individual with the last row
       showing the current state as the START_DATE is before ‘today’ and the END_DATE is
       null. There is also nothing to prevent future information from being held. If John Smith
       announces that he is going to get married on a specific date in the future then the
       current record can have it’s end date set appropriately and a new record added.

       If another property is required (e.g. Number of Children) then no change is required to
       the data model. New rows are entered into the PARTY_PROPERTY_TYPES table:

                                      TYPE      GROUP
                                      Male      Number of Children
                                      Female    Number of Children
                                     Figure 11 - Example Data for PARTY_PROPERTY_TYPES

       This allows data to be added to the PARTY_PROPERTIES as follows:

        PARTY_DWK        PARTY_PROPERTY_DWK              START_DATE            END_DATE      VALUE
        John Smith       Single                          01-Jan-1970           02-Feb-1990
        John Smith       Married                         03-Feb-1990           04-Mar-2000
        John Smith       Divorced                        05-Mar-2000           06-Apr-2005
        John Smith       Co-Habiting                     07-Apr-2005
        John Smith       Male                            09-Jun-2001                         1
        John Smith       Female                          10-Jul-2002                         1
       Figure 12 - Example Data for PARTY_PROPERTIES

       In fact any number of new properties can be added to the tables as business processes
       and source systems change and new data requirements come about.

       The effect of this method when compared to other methods of modelling this
       information is to create very narrow (i.e. not many columns) long (i.e. many rows)
       tables instead of making very much wider, shorter tables. However the properties table
                                                                                        34
       is very effective. Firstly, unlike the example, the two _DWK columns are integers , as
       are the start and end dates. Many of the _VALUE fields will be NULL, and those that
       are not will be predominately numeric rather than text values.

       The PARTY_PROPERTY_TYPE acts as a natural partitioning key in those databases
       that support table partitions. This method is very effective in terms of performance and
       storage of data in databases that use column or vector type storage.



33
    Text from the related table is used in the _DWK column rather than the numeric key for clarity in these
examples.
34
    Integers are better than text strings for a number of reasons: they usually require less storage and
there is less temptation to mix the requirements of identification and description (a problem clearly
illustrated by car registration numbers in the UK).
Keys are more reliable when implemented as integers because databases often have key generation
mechanisms that deliver unique values. Integers do not suffer from upper/lower case ambiguities and
can never contain special characters or ambiguities caused by different padding conventions (trailing
spaces or leading zeros).



       © 2009 Data Management & Warehousing                                                          Page 21
White Paper - Process Neutral Data Modelling



The real saving in the number of rows is normally less than expected when compared
to more conventional data model techniques that store duplicated rows for changed
data. The example above has seven rows of data. The alternate approach of repeated
sets of data requires six rows of data and considerably more storage because of the
duplicated data:

 PARTY_DWK       START_DATE        END_DATE         MARITAL_STATUS




                                                                      UNKNOWN




                                                                                         FEMALE
                                                                      CHILD


                                                                                CHILD


                                                                                         CHILD
                                                                                MALE
 John Smith      01-Jan-1970       02-Feb-1990      Single            0         0       0
 John Smith      03-Feb-1990       08-Jun-2001      Married           0         0       0
 John Smith      09-Jun-2001       09-Jul-2002      Married           0         1       0
 John Smith      10-Jul-2002       04-Mar-2000      Married           0         1       1
 John Smith      05-Mar-2000       06-Apr-2005      Divorced          0         1       1
 John Smith      07-Apr-2005                        Co-Habiting       0         1       1
Figure 13 - Example Data for PARTY_PROPERTIES

The other main objection to this technique is often described as the cost of matrix
transformation of the data. That is the changing of the data from rows into columns in
the ETL to load the data warehouse and then changing the columns back to rows in the
ETL to load the data mart(s). This objection is normally due to a lack of knowledge of
appropriate ETL techniques that can make this very efficient such as using SQL set
operations such as ‘UNION’, ‘MINUS’ and ‘INTERSECT’.

Event Tables
An event table is almost identical to a property table except that instead of having
_START_DATE and _END_DATE columns it has a single column _EVENT_DATE. It
also has the appropriate _EVENT_TYPES table. The table name has a suffix of
_EVENTS. For example a wedding is an event (happens at a single point in time), but
‘being married’ is a property (happens over a period of time). Events can be stored in
property tables simply by storing the same value in both the start date and end date
columns and this is a more common solution than creating a separate table. The use of
_EVENTS tables is usually limited to places where events form a significant part of the
data and the cost of storing the extra field becomes significant.

It should be noted that this is only required where the event may occur many times
(e.g. a wedding date) rather than information that can only happen once (e.g. first
wedding date) which would be stored in the appropriate major entity as, once set, it
would have lifetime value.




Figure 14 - Party Events Example

_EVENTS tables are a special case of _PROPERTIES tables.




© 2009 Data Management & Warehousing                                                    Page 22
White Paper - Process Neutral Data Modelling




       Link Tables
       Up to this point major entity attributes within a single record have been examined. It is
       also possible that records within the major entities will also relate to other records in the
       same major entity (e.g. John Smith is married to Jane Smith, both of whom are records
       within the PARTIES table). This is called a peer-to-peer relationship and is stored in a
       table with the suffix _LINKS and the appropriate _LINK_TYPES table.




       Figure 15 - Party Links Example

        The significant difference in a _LINK table is that there are two relationships from the
        major entity (in this case PARTIES).

        This also allows hierarchies to be stored so that:

                 John Smith (Individual) works in Sales (Organisational Unit)
                 Sales (Organisation Unit) is a division of ACME Enterprises (Organisation)

        where ‘works in’ and ‘is a division of’ are examples of the _LINK_TYPE.

        It should also be noted that there is a priority to the relationship because one of the
        linking fields is the main key (in this case PARTIE_DWK) and the other is the linked
        key (in this case LINKED_PARTIE_DWK). There are two options; one is to store the
        relationship in both directions (e.g. John Smith is married to Jane Smith and Jane
                                                                                              35
        Smith is married to John Smith). This can be made complete with a reversing view
        but defeats both the ‘Convention over Configuration’ principle and the ‘DRY (Don’t
        Repeat Yourself)’ principle. The second method is to have a convention and only
        store the relationship in one direction (e.g. John Smith is married to Jane Smith,
        therefore the convention could be that that the male is being stored in the main key
        and the female is being stored in the linked key).




35
  A reversing view is one that has all the same columns as the underlying table except that the two key
columns are swapped around. In this example PARTIE_DWK would be swapped with
LINKED_PARTIE_DWK.



      © 2009 Data Management & Warehousing                                                     Page 23
White Paper - Process Neutral Data Modelling




 Segment Tables
 The final type of information that might be required about a major entity is the
 segment. This is a collection of records from the major entity that share something in
 common but more detail is not known. The most common business example of this
 would be the market segmentations done on customers. These segments are
 normally a result of detailed statistical analysis and then storing the results.

 In our example John Smith and Jane Smith could both be part of a segment of
 married people along with any number of other individuals for whom it is known that
 they are married but there is no information about when or to whom they are married.

 Where the _LINKS table provided the peer-to-peer relationship the segment provides
 the peer group relationship.




 Figure 16 - Party Segments Example




© 2009 Data Management & Warehousing                                            Page 24
White Paper - Process Neutral Data Modelling




The Sub-Model
The major entities and the six supporting data structures (_TYPES, _BANDS,
_PROPERTIES, _EVENTS, _LINKS, and _SEGMENTS) provide sufficient design pattern
structure to hold a large part of the information in the data warehouse. This is known as a
Major Entity Sub-Model. Significantly the information that has been stored for a single major
entity sub-model is very close to the typical dimensions of a data mart. This design pattern
provides complete temporal support and the ability to re-construct a dimension or dimensions
based on a given set of business rules.

The set of a major entity and the supporting structures is known as a sub-model. For example
the designed PARTY sub-model consists of:

    •    PARTIES

    •    PARTY_TYPES
    •    PARTY_BANDS

    •    PARTY_PROPERTIES
    •    PARTY_PROPERTY_TYPES

    •    PARTY_EVENTS
    •    PARTY_EVENT_TYPES

    •    PARTY_LINKS
    •    PARTY_LINK_TYPES

    •    PARTY_SEGMENTS
    •    PARTY_SEGMENT_TYPES

Those tables in bold italics might represent the implemented PARTY sub-model

Importantly what has not been provided is the relationships between major entities and the
business transactions that occur as a result of the interaction between major entities.




        © 2009 Data Management & Warehousing                                          Page 25
White Paper - Process Neutral Data Modelling




History Tables
Extending the example above it is noticeable that the party does not contain any
address information; this is held in the geography major entity. This is also another
example where current business processes and requirements may change. At the
outset the source system may provide a contract address and a billing address. A
change in process may require the capture of additional information e.g. contact
addresses and installation addresses.

In practice the only difference between this type of relationship between major entities
and the _LINKS relationship is that instead of two references to the same major entity
there is one relationship to each of two major entities.

The data model is therefore relatively simple to construct:




Figure 17 – Party Geography History Example

There is one minor semantic difference between links and histories. _LINKS tables join
back on to the major entity and therefore one half of the relationship has to be given
priority. In a _HISTORY table there is no need for priority as each of the two attributes
is associated with a different major entity.

Finally note that in this example the major entity is shown without the rest of the sub-
model that can be assumed.




© 2009 Data Management & Warehousing                                              Page 26
White Paper - Process Neutral Data Modelling




Occurrences and Transactions
The final part of the data model is to build up all the occurrence or transaction tables. In
the data mart these are most akin to the fact tables although as this is a relational
model they may occur outside a pure star relationship. Like the major entities there is
no standard suffix or prefix, just a meaningful name.

To demonstrate what is required an example from a retail bank is described. The
example is not nearly as complex as a real bank but necessarily longer and more
complex than most examples to demonstrate a number of features. Furthermore
banking has been chosen as an example because the concepts will be familiar to most
readers. The example only looks at some core banking function and not at the activities
such as marketing or specialist products such as insurance.


      The Example

      The bank has a number of regions and a central ‘premium’ account function that
      caters for some business customers. Each region has a number of branches.
      Branches have a manager and a number of staff. Each branch manager reports
      to a regional manager.

      If a customer has a personal account then the account manager is a branch
      personal account manager, however if the individual has a net worth in excess of
      USD1M the branch manager acts as the account manager. Personal accounts
      have contact and statement addresses and a range of telephone numbers, e-
      mails, addresses, etc.

      If the account belongs to a business with less than USD1M turnover then the
      account manager is a business account manager at the branch who reports to
      the branch manager. If the account belongs to a business with a turnover of
      between USD1M and USD10M then the account manager is an individual at the
      regional office who reports to the regional manager. If the account belongs to a
      business with a turnover more than USD10M then the account managers at the
      central office are responsible for the account. Businesses have contact and
      statement addresses as well as a number of approved individuals who can use
      the company account and contact details for them.

      Branch and account managers periodically review the banding of accounts by
      income for individuals and turnover for companies and if they are likely to move
      band in the coming year then they are added to the appropriate (future) category.
      Note that this is only partially fact based, the rest being based on subjective input
      from account managers.

      The bank offers a range of services including current, loan and deposit accounts,
      credit and debit cards, EPOS (for business accounts only), foreign exchange,
      etc.

      The bank has a number of channels including branches, a call centre service, a
      web service and the ability to use ATMs for certain transactions.

      The bank offers a range of transaction types including cash, cheque, standing
      order, direct debit, interest, service charges, etc.




© 2009 Data Management & Warehousing                                                Page 27
White Paper - Process Neutral Data Modelling



            After the close of business on the last working day of each month the starting
            and ending balances, the average daily balance and any interest is calculated for
            each account.

            On a daily basis the exposure (i.e. sum of all account balances) is calculated for
            each customer along with a risk factor that is a number between 0 and 100 that
            is influenced by a number of factors that are reviewed from time to time by the
            risk management department. Risk factors might include sudden large deposits
            or withdrawals, closure of a number of accounts, long-term non-use of an
            account, etc. that might influence account managers’ decisions.

            Every transaction that is made is recorded every day and has three associated
            dates, the date of the transaction, the date it appeared on the system and the
            cleared date.

            De-constructing the example

            The bank has a number of regions and a central ‘premium’ account function that
            caters for some business customers. Each region has a number of branches.
            Branches have a manager. Each branch manager reports to a regional manager.

                 •   The bank itself must be held as an organisation.
                 •   The regions and central ‘premium’ account function are held as
                                         36
                     Organisation Units.
                 •   The bank and the regions have links.
                 •   The branches are held as organisational units.
                 •   The regions and the branches have links.
                 •   The branches have addresses via a history table.
                 •   The branches have electronic addresses via a history table.
                 •   There are a number of roles stored as organisation units.
                 •   There roles and the individuals have links.
                 •   The roles may have addresses via a history table.
                 •   The roles may have electronic addresses via a history table.
                 •   The individuals may have addresses via a history table.
                 •   The individuals have electronic addresses via a history table.

            At this point only existing major entities and history tables have been used. Also
            this information would be re-usable in many places just like the conformed
            dimensions concept of star schemas but with more flexibility.

            If a customer has a personal account then the account manager is a branch
            personal account manager, however if the individual has a net worth in excess of
            USD1M the branch manager acts as the account manager. Personal accounts
            have contact and statement addresses and a range of telephone numbers, e-
            mails, etc.

                •    Customers are held as Parties, either Individuals or Organisations.
                •    Customers have addresses via a history table.
                •    Customers have electronic addresses via a history table.
                •    Accounts are held in the Accounts major entity.
                •    Customers are related to accounts via a history table.
                •    Branches are related to accounts via a history table.
                •    Accounts are associated with a role via a history table.
                •    An individual’s net worth is generated elsewhere and stored as a property
                     of the party.

36
   See Appendix 2 – Understanding Hierarchies for an explanation as to why the regions are
organisational units and not geography.



      © 2009 Data Management & Warehousing                                             Page 28
White Paper - Process Neutral Data Modelling



          •    A high net worth individual is a member of a similarly named segment.
          •    The accounts may have addresses via a history table.
          •    The accounts may have electronic addresses via a history table.

      If the account belongs to a business with less than USD1M turnover then the
      account manager is a business account manager at the branch who reports to
      the branch manager. If the account belongs to a business with a turnover of
      between USD1M and USD10M then the account manager is an individual at the
      regional office who reports to the regional manager. If the account belongs to a
      business with a turnover over USD10M then the account managers at the central
      office are responsible for the account. Businesses have contact and statement
      addresses as well as a number of approved individuals who can use the
      company account, and contact details for them.

           •   Businesses are held as parties.
           •   The business turnover is held as a party property.
           •   The category membership based on turnover is held as a segment.
           •   The businesses may have addresses via a history table.
           •   The businesses may have electronic addresses via a history table.

      Branch and account managers periodically review the banding of accounts by
      turnover for both individuals and companies and if they are likely to move band in
      the coming year then they are added to the appropriate (future) category. Note
      that this is only partially fact based, the rest being based on subjective input from
      account managers.

           •   There is a need to allow manual input via a warehouse support
               application for the party segments.

      At this point only the PARTY, ADDRESS, ELECTRONIC ADDRESS sub-models
      and associated _HISTORY tables have been used.

      The bank offers a range of services including current, loan and deposit accounts,
      credit and debit cards, epos (for business accounts only), foreign exchange, etc.

           •   The product services are held in the product service major entity.
           •   The product services are associated with an account via a history.

      The bank has a number of channels including branches, a call centre service, a
      web service and the ability to use ATMs for certain transactions.

           •   The channels are held in the channels major entity.
           •   The ability to use a channel for a specific product service is held in the
               history that relates the two major entities.

      This adds the PRODUCT_SERVICE and CHANNEL major entities into the
      model.

      The bank offers a range of transaction types including cash, cheque, standing
      order, direct debit, interest, service charges, etc.

           •   This requires a TRANSACTION_TYPE table that will be added to the
               transaction table, which has not yet been defined.

      After the close of business on the last working day of each month the starting
      and ending balances, the average daily balance and any interest is calculated for
      each account.

           •   This is stored as an account property (it may be an event).


© 2009 Data Management & Warehousing                                                Page 29
White Paper - Process Neutral Data Modelling




      On a daily basis the exposure (i.e. sum of all account balances) is calculated for
      each customer along with a risk factor that is a number between 0 and 100 that
      is influenced by a number of factors that are reviewed from time to time by the
      risk management department. Risk factors might include sudden large deposits
      or withdrawals, closure of a number of accounts, long-term non-use of an
      account, etc. that might influence account managers’ decisions.

           •   The exposure is stored as a party property (or event).
           •   The party risk factor is stored as a party property.

      Everything that is required to describe the transaction table is now available.

      Every transaction that is made is recorded every day and has three associated
      dates, the date of the transaction, the date it appeared on the system and the
      cleared date.

           •   The Transaction Table will have the following columns
                  o Transaction Date
                  o Transaction System Date
                  o Transaction Cleared Date
                  o From Account
                  o To Account
                  o Transaction Type
                  o Amount

      This would complete the model for the example. There are some interesting
      features to examine. The first is that all amounts would be positive. This is
      because for a credit to an account the ‘from account’ would be the sending party
      and the ‘to account’ would be the customer’s account. For a debit the ‘to account’
      would be the recipient and the ‘from account’ would be the customer’s account.

      This has a number of effects. Firstly it complies with the DRY (Don’t Repeat
      Yourself) principle and means that extra data is not stored for the transaction. It
      also means that a collection of account information not related to any current
      party (e.g. a customer at another bank) is built up. This information is useful in
      the analysis of fraud, churn, market share, competitive analysis, etc.

      For a customer analysis data mart the data can be extracted and converted into
      the positive credit/negative debt arrangement required by the users.

      The payment of bank changes and interest would also have accounts and this
      information in a different data mart could be used to look at profitability,
      exposure, etc.

      The process has used seven major entities’ sub-models, an additional type table
      and an occurrence or transaction table. Storing this information should
      accommodate and absorb almost any change in business process or source
      system without the need to change the data warehouse model and will allow
      multiple data marts to be built from a single data warehouse quickly and easily.
      In effect the type tables act as metadata for how to use and extend the data
      model rather than defining the business process explicitly in the data model,
      hence the name process neutral data modelling.

      It also demonstrates the ability of the data model to support the requirements
      process. By knowing the major entities and using a storyboard approach similar
      to the example above, and familiar as an approach to agile developers, it is
      possible to quickly and easily identify business, data and query requirements.




© 2009 Data Management & Warehousing                                               Page 30
White Paper - Process Neutral Data Modelling




                           Party Sub Model
                           including:
                                • Individuals
       History                  • Organisations                                History
                                • Organisation Units
                                • Roles




  Addresses Sub Model                                  Electronic Addresses Sub Model
  including:                                           including:
       • Postal Address                                     • Telephone Numbers
       • Point Location                                     • E-Mail Addresses
                                                            • Telex




                                            History




                          Accounts Sub Model
       History                                                                History



       History                                                                History




  Channel Sub Model                                   Product Service Sub Model




                            Retail Banking Transactions                    Transaction
  Calendar
                                                                           Types
  Sub Model




Figure 18 - The Example Bank Data Model



     © 2009 Data Management & Warehousing                                                Page 31
White Paper - Process Neutral Data Modelling



The model above has been almost fully described in detail by this document since the self-
similar modelling for all the sub-model components has been described along with the history
tables, most of the retail banking transactions and some of the lifetime attributes of the major
entities. To complete the model just needs these additional attributes to be added.

Two other effects that will influence the creation of data marts from this model can also be
seen. Firstly the creation of dimensions will revolve around the de-normalisation of the
attributes that are required from each of the major entities into one of the two dimensions
associate with account as these have the hierarchies for the customer, account manager, etc
associated with them.

The second effect is that of the natural star schema. It is clear from this diagram that the fact
tables will be based around the ‘Retail Banking Transactions’ table. As has already been
stated there are several data marts that can be built from this fact table, probably at different
levels of aggregation and with different dimensions.

The occurrence or transaction table above is one of perhaps twenty that a large enterprise
would require along with approximately thirty _HISTORY tables. This would be combined with
around twenty major entity sub models to create an enterprise data warehouse data model.

For those readers who have also read and are familiar with the Data Management &
                                                 37
Warehousing white paper ‘How Data Works’ that describes natural star schemas in more
detail and also a technique called left to right entity diagrams will see a correlation as follows:

Level     Description
1         _TYPE and _BAND tables, simple small volume reference data.
2         Major Entities, complex low volume data.
3         Some major entities that are dependent on others along with _PROPERTIES and _SEGMENTS
          tables, less complex but with greater volume.
4         _HISTORY tables and some occurrence or transaction tables.
5         Occurrence or transaction tables. Significant volume but low complexity data.
Figure 19 - Volume & Complexity Correlations




37
     Available for download from http://www.datamgmt.com/whitepapers



         © 2009 Data Management & Warehousing                                              Page 32
White Paper - Process Neutral Data Modelling




Implementation Issues
The use of a process neutral data model and a design pattern is meant to ease the design of
a system but there will always be exceptions and things that need further explanation in order
to fit them into the solution. Much of this section refers to ETL issues that can only be briefly
                           38
described in this context.

       The ‘Party’ Special Case
       The examples throughout this document have used the PARTY table as a major entity
       but in practice this is one of the more difficult tables to deal with. The first issue is that
       in many cases name does not have lifetime value, for example when a woman gets
                                                                                             39
       married or divorced and changes her name or when a company renames itself. Also
       Individual names often have multiple parts (title, forename, surname).

       There is also a requirement to track some form of state identity number. In the United
       Kingdom an individual has their National Insurance number and in the United States
       their social security number, other numbers (e.g. passport, ID card, etc are simply
       stored as properties). Organisations have other numbers (Companies have registration
       numbers, charities and trusts have different registration numbers, but VAT numbers are
       properties as they can and do change).

       Another minor issue is that people have a date of birth and a date of death. This is
       simply resolved as date of birth is the Individual Start Date and date of death is the
       Individual End Date however this terminology can sometimes prove controversial.

       The solution to the PARTY special case depends on the database technology being
       used. If the database supports the creation of views and the ‘UNION ALL’ SQL
                                                           40
       operator then the preferred solution is as follows:

       Create the INDIVIDUALS table as follows:

                  •   PARTY_DWK
                  •   PARTY_TYPE_DWK
                  •   TITLE
                  •   FORENAME
                                               41
                  •   CURRENT_SURNAME
                  •   PREVIOUS_SURNAME
                  •   MAIDEN_SURNAME
                  •   DATE_OF_BIRTH
                  •   DATE_OF_DEATH
                  •   STATE_ID_NUMBER
                  •   Other lifetime attributes as required




38
   Data Management & Warehousing provide consultancy on ETL design and techniques to ensure that
data warehouses can be loaded effectively regardless of the data modelling approach used.
39
   Interestingly, in Scotland, which has different regulations from England & Wales, birth marriage and
death certificates (also known as vital records) have, since 1855, understood the importance of knowing
the birth names of everyone on the certificate. For example on a wedding certificate you will get the
groom’s mother’s maiden name and a married woman’s death certificate will also feature the her maiden
name. Effectively the birth name has lifetime value and all other names are additional information.
http://www.scotlandspeople.gov.uk/content/help/index.aspx?r=554&628
40
   Nearly all business intelligence databases support this functionality.
41
   CURRENT_ and PREVIOUS_ are reserved prefixes; see Appendix 1 Data Modelling Standards.



      © 2009 Data Management & Warehousing                                                     Page 33
White Paper -  Process Neutral Data Modelling
White Paper -  Process Neutral Data Modelling
White Paper -  Process Neutral Data Modelling
White Paper -  Process Neutral Data Modelling
White Paper -  Process Neutral Data Modelling
White Paper -  Process Neutral Data Modelling
White Paper -  Process Neutral Data Modelling
White Paper -  Process Neutral Data Modelling
White Paper -  Process Neutral Data Modelling
White Paper -  Process Neutral Data Modelling
White Paper -  Process Neutral Data Modelling
White Paper -  Process Neutral Data Modelling
White Paper -  Process Neutral Data Modelling
White Paper -  Process Neutral Data Modelling
White Paper -  Process Neutral Data Modelling
White Paper -  Process Neutral Data Modelling
White Paper -  Process Neutral Data Modelling
White Paper -  Process Neutral Data Modelling
White Paper -  Process Neutral Data Modelling
White Paper -  Process Neutral Data Modelling
White Paper -  Process Neutral Data Modelling
White Paper -  Process Neutral Data Modelling
White Paper -  Process Neutral Data Modelling
White Paper -  Process Neutral Data Modelling
White Paper -  Process Neutral Data Modelling
White Paper -  Process Neutral Data Modelling
White Paper -  Process Neutral Data Modelling
White Paper -  Process Neutral Data Modelling
White Paper -  Process Neutral Data Modelling
White Paper -  Process Neutral Data Modelling
White Paper -  Process Neutral Data Modelling

Weitere ähnliche Inhalte

Was ist angesagt?

GDPR - Context, Principles, Implementation, Operation, Data Governance, Data ...
GDPR - Context, Principles, Implementation, Operation, Data Governance, Data ...GDPR - Context, Principles, Implementation, Operation, Data Governance, Data ...
GDPR - Context, Principles, Implementation, Operation, Data Governance, Data ...Alan McSweeney
 
Dw hk-white paper
Dw hk-white paperDw hk-white paper
Dw hk-white paperjuly12jana
 
Moving To MicroServices
Moving To MicroServicesMoving To MicroServices
Moving To MicroServicesDavid Walker
 
Why you need excellent documents and how to produce them… with Enterprise Arc...
Why you need excellent documents and how to produce them… with Enterprise Arc...Why you need excellent documents and how to produce them… with Enterprise Arc...
Why you need excellent documents and how to produce them… with Enterprise Arc...eaDocX
 
Sample - Data Warehouse Requirements
Sample -  Data Warehouse RequirementsSample -  Data Warehouse Requirements
Sample - Data Warehouse RequirementsDavid Walker
 
ETIS11 - Agile Business Intelligence - Presentation
ETIS11 -  Agile Business Intelligence - PresentationETIS11 -  Agile Business Intelligence - Presentation
ETIS11 - Agile Business Intelligence - PresentationDavid Walker
 
Sql server bi poweredby pw_v16
Sql server bi poweredby pw_v16Sql server bi poweredby pw_v16
Sql server bi poweredby pw_v16MILL5
 
Capturing Data Requirements
Capturing Data RequirementsCapturing Data Requirements
Capturing Data Requirementsmcomtraining
 
Data Warehouse Project Report
Data Warehouse Project Report Data Warehouse Project Report
Data Warehouse Project Report Tom Donoghue
 
Gathering Business Requirements for Data Warehouses
Gathering Business Requirements for Data WarehousesGathering Business Requirements for Data Warehouses
Gathering Business Requirements for Data WarehousesDavid Walker
 
Whitepaper on Master Data Management
Whitepaper on Master Data Management Whitepaper on Master Data Management
Whitepaper on Master Data Management Jagruti Dwibedi ITIL
 
HCLT Whitepaper : ITSM Approach for Clouds
HCLT Whitepaper : ITSM Approach for CloudsHCLT Whitepaper : ITSM Approach for Clouds
HCLT Whitepaper : ITSM Approach for CloudsHCL Technologies
 
Microsoft Data Warehouse Business Intelligence Lifecycle - The Kimball Approach
Microsoft Data Warehouse Business Intelligence Lifecycle - The Kimball ApproachMicrosoft Data Warehouse Business Intelligence Lifecycle - The Kimball Approach
Microsoft Data Warehouse Business Intelligence Lifecycle - The Kimball ApproachMark Ginnebaugh
 
03. Business Information Requirements Template
03. Business Information Requirements Template03. Business Information Requirements Template
03. Business Information Requirements TemplateAlan D. Duncan
 
02. Information solution outline template
02. Information solution outline template02. Information solution outline template
02. Information solution outline templateAlan D. Duncan
 
Capturing Business Requirements For Scorecards, Dashboards And Reports
Capturing Business Requirements For Scorecards, Dashboards And ReportsCapturing Business Requirements For Scorecards, Dashboards And Reports
Capturing Business Requirements For Scorecards, Dashboards And ReportsJulian Rains
 
The Data Warehouse Lifecycle
The Data Warehouse LifecycleThe Data Warehouse Lifecycle
The Data Warehouse Lifecyclebartlowe
 
Data warehousing and business intelligence project report
Data warehousing and business intelligence project reportData warehousing and business intelligence project report
Data warehousing and business intelligence project reportsonalighai
 

Was ist angesagt? (20)

GDPR - Context, Principles, Implementation, Operation, Data Governance, Data ...
GDPR - Context, Principles, Implementation, Operation, Data Governance, Data ...GDPR - Context, Principles, Implementation, Operation, Data Governance, Data ...
GDPR - Context, Principles, Implementation, Operation, Data Governance, Data ...
 
Dw hk-white paper
Dw hk-white paperDw hk-white paper
Dw hk-white paper
 
Moving To MicroServices
Moving To MicroServicesMoving To MicroServices
Moving To MicroServices
 
Why you need excellent documents and how to produce them… with Enterprise Arc...
Why you need excellent documents and how to produce them… with Enterprise Arc...Why you need excellent documents and how to produce them… with Enterprise Arc...
Why you need excellent documents and how to produce them… with Enterprise Arc...
 
Sample - Data Warehouse Requirements
Sample -  Data Warehouse RequirementsSample -  Data Warehouse Requirements
Sample - Data Warehouse Requirements
 
ETIS11 - Agile Business Intelligence - Presentation
ETIS11 -  Agile Business Intelligence - PresentationETIS11 -  Agile Business Intelligence - Presentation
ETIS11 - Agile Business Intelligence - Presentation
 
Planning Data Warehouse
Planning Data WarehousePlanning Data Warehouse
Planning Data Warehouse
 
Sql server bi poweredby pw_v16
Sql server bi poweredby pw_v16Sql server bi poweredby pw_v16
Sql server bi poweredby pw_v16
 
Capturing Data Requirements
Capturing Data RequirementsCapturing Data Requirements
Capturing Data Requirements
 
Jn2516891694
Jn2516891694Jn2516891694
Jn2516891694
 
Data Warehouse Project Report
Data Warehouse Project Report Data Warehouse Project Report
Data Warehouse Project Report
 
Gathering Business Requirements for Data Warehouses
Gathering Business Requirements for Data WarehousesGathering Business Requirements for Data Warehouses
Gathering Business Requirements for Data Warehouses
 
Whitepaper on Master Data Management
Whitepaper on Master Data Management Whitepaper on Master Data Management
Whitepaper on Master Data Management
 
HCLT Whitepaper : ITSM Approach for Clouds
HCLT Whitepaper : ITSM Approach for CloudsHCLT Whitepaper : ITSM Approach for Clouds
HCLT Whitepaper : ITSM Approach for Clouds
 
Microsoft Data Warehouse Business Intelligence Lifecycle - The Kimball Approach
Microsoft Data Warehouse Business Intelligence Lifecycle - The Kimball ApproachMicrosoft Data Warehouse Business Intelligence Lifecycle - The Kimball Approach
Microsoft Data Warehouse Business Intelligence Lifecycle - The Kimball Approach
 
03. Business Information Requirements Template
03. Business Information Requirements Template03. Business Information Requirements Template
03. Business Information Requirements Template
 
02. Information solution outline template
02. Information solution outline template02. Information solution outline template
02. Information solution outline template
 
Capturing Business Requirements For Scorecards, Dashboards And Reports
Capturing Business Requirements For Scorecards, Dashboards And ReportsCapturing Business Requirements For Scorecards, Dashboards And Reports
Capturing Business Requirements For Scorecards, Dashboards And Reports
 
The Data Warehouse Lifecycle
The Data Warehouse LifecycleThe Data Warehouse Lifecycle
The Data Warehouse Lifecycle
 
Data warehousing and business intelligence project report
Data warehousing and business intelligence project reportData warehousing and business intelligence project report
Data warehousing and business intelligence project report
 

Ähnlich wie White Paper - Process Neutral Data Modelling

Dimensional modeling in a bi environment
Dimensional modeling in a bi environmentDimensional modeling in a bi environment
Dimensional modeling in a bi environmentdivjeev
 
Dimensional modelling sg247138
Dimensional modelling sg247138Dimensional modelling sg247138
Dimensional modelling sg247138Sourav Singh
 
BizTalk Practical Course Preview
BizTalk Practical Course PreviewBizTalk Practical Course Preview
BizTalk Practical Course PreviewMoustafaRefaat
 
Soa In The Real World
Soa In The Real WorldSoa In The Real World
Soa In The Real Worldssiliveri
 
SAP MM Tutorial ds_42_tutorial_en.pdf
SAP MM Tutorial    ds_42_tutorial_en.pdfSAP MM Tutorial    ds_42_tutorial_en.pdf
SAP MM Tutorial ds_42_tutorial_en.pdfsjha120721
 
Ibm info sphere datastage data flow and job design
Ibm info sphere datastage data flow and job designIbm info sphere datastage data flow and job design
Ibm info sphere datastage data flow and job designdivjeev
 
(Deprecated) Slicing the Gordian Knot of SOA Governance
(Deprecated) Slicing the Gordian Knot of SOA Governance(Deprecated) Slicing the Gordian Knot of SOA Governance
(Deprecated) Slicing the Gordian Knot of SOA GovernanceGanesh Prasad
 
Ibm tivoli usage accounting manager v7.1 handbook sg247404
Ibm tivoli usage accounting manager v7.1 handbook sg247404Ibm tivoli usage accounting manager v7.1 handbook sg247404
Ibm tivoli usage accounting manager v7.1 handbook sg247404Banking at Ho Chi Minh city
 
Big data technologies : A survey
Big data technologies : A survey Big data technologies : A survey
Big data technologies : A survey fatimabenjelloun1
 
Mvc music store tutorial - v3.0
Mvc music store   tutorial - v3.0Mvc music store   tutorial - v3.0
Mvc music store tutorial - v3.0mahmud467
 
Mvc music store tutorial - v3.0
Mvc music store   tutorial - v3.0Mvc music store   tutorial - v3.0
Mvc music store tutorial - v3.0jackmilesdvo
 

Ähnlich wie White Paper - Process Neutral Data Modelling (20)

Dimensional modeling in a bi environment
Dimensional modeling in a bi environmentDimensional modeling in a bi environment
Dimensional modeling in a bi environment
 
sg248293
sg248293sg248293
sg248293
 
Dimensional modelling sg247138
Dimensional modelling sg247138Dimensional modelling sg247138
Dimensional modelling sg247138
 
BizTalk Practical Course Preview
BizTalk Practical Course PreviewBizTalk Practical Course Preview
BizTalk Practical Course Preview
 
Red book Blueworks Live
Red book Blueworks LiveRed book Blueworks Live
Red book Blueworks Live
 
Bwl red book
Bwl red bookBwl red book
Bwl red book
 
By d ui_styleguide_2012_fp35
By d ui_styleguide_2012_fp35By d ui_styleguide_2012_fp35
By d ui_styleguide_2012_fp35
 
Soa In The Real World
Soa In The Real WorldSoa In The Real World
Soa In The Real World
 
SAP MM Tutorial ds_42_tutorial_en.pdf
SAP MM Tutorial    ds_42_tutorial_en.pdfSAP MM Tutorial    ds_42_tutorial_en.pdf
SAP MM Tutorial ds_42_tutorial_en.pdf
 
Tools Users Guide
Tools Users GuideTools Users Guide
Tools Users Guide
 
Ibm info sphere datastage data flow and job design
Ibm info sphere datastage data flow and job designIbm info sphere datastage data flow and job design
Ibm info sphere datastage data flow and job design
 
(Deprecated) Slicing the Gordian Knot of SOA Governance
(Deprecated) Slicing the Gordian Knot of SOA Governance(Deprecated) Slicing the Gordian Knot of SOA Governance
(Deprecated) Slicing the Gordian Knot of SOA Governance
 
Introduction to BIRT
Introduction to BIRTIntroduction to BIRT
Introduction to BIRT
 
Deform 3 d v6.0
Deform 3 d v6.0Deform 3 d v6.0
Deform 3 d v6.0
 
Deform 3 d v6.0
Deform 3 d v6.0Deform 3 d v6.0
Deform 3 d v6.0
 
Ibm tivoli usage accounting manager v7.1 handbook sg247404
Ibm tivoli usage accounting manager v7.1 handbook sg247404Ibm tivoli usage accounting manager v7.1 handbook sg247404
Ibm tivoli usage accounting manager v7.1 handbook sg247404
 
Big data technologies : A survey
Big data technologies : A survey Big data technologies : A survey
Big data technologies : A survey
 
Certifications
CertificationsCertifications
Certifications
 
Mvc music store tutorial - v3.0
Mvc music store   tutorial - v3.0Mvc music store   tutorial - v3.0
Mvc music store tutorial - v3.0
 
Mvc music store tutorial - v3.0
Mvc music store   tutorial - v3.0Mvc music store   tutorial - v3.0
Mvc music store tutorial - v3.0
 

Mehr von David Walker

Big Data Week 2016 - Worldpay - Deploying Secure Clusters
Big Data Week 2016  - Worldpay - Deploying Secure ClustersBig Data Week 2016  - Worldpay - Deploying Secure Clusters
Big Data Week 2016 - Worldpay - Deploying Secure ClustersDavid Walker
 
Data Works Summit Munich 2017 - Worldpay - Multi Tenancy Clusters
Data Works Summit Munich 2017 - Worldpay - Multi Tenancy ClustersData Works Summit Munich 2017 - Worldpay - Multi Tenancy Clusters
Data Works Summit Munich 2017 - Worldpay - Multi Tenancy ClustersDavid Walker
 
Big Data Analytics 2017 - Worldpay - Empowering Payments
Big Data Analytics 2017  - Worldpay - Empowering PaymentsBig Data Analytics 2017  - Worldpay - Empowering Payments
Big Data Analytics 2017 - Worldpay - Empowering PaymentsDavid Walker
 
Data Driven Insurance Underwriting
Data Driven Insurance UnderwritingData Driven Insurance Underwriting
Data Driven Insurance UnderwritingDavid Walker
 
Data Driven Insurance Underwriting (Dutch Language Version)
Data Driven Insurance Underwriting (Dutch Language Version)Data Driven Insurance Underwriting (Dutch Language Version)
Data Driven Insurance Underwriting (Dutch Language Version)David Walker
 
An introduction to data virtualization in business intelligence
An introduction to data virtualization in business intelligenceAn introduction to data virtualization in business intelligence
An introduction to data virtualization in business intelligenceDavid Walker
 
BI SaaS & Cloud Strategies for Telcos
BI SaaS & Cloud Strategies for TelcosBI SaaS & Cloud Strategies for Telcos
BI SaaS & Cloud Strategies for TelcosDavid Walker
 
Building an analytical platform
Building an analytical platformBuilding an analytical platform
Building an analytical platformDavid Walker
 
Building a data warehouse of call data records
Building a data warehouse of call data recordsBuilding a data warehouse of call data records
Building a data warehouse of call data recordsDavid Walker
 
Struggling with data management
Struggling with data managementStruggling with data management
Struggling with data managementDavid Walker
 
A linux mac os x command line interface
A linux mac os x command line interfaceA linux mac os x command line interface
A linux mac os x command line interfaceDavid Walker
 
Connections a life in the day of - david walker
Connections   a life in the day of - david walkerConnections   a life in the day of - david walker
Connections a life in the day of - david walkerDavid Walker
 
Conspectus data warehousing appliances – fad or future
Conspectus   data warehousing appliances – fad or futureConspectus   data warehousing appliances – fad or future
Conspectus data warehousing appliances – fad or futureDavid Walker
 
An introduction to social network data
An introduction to social network dataAn introduction to social network data
An introduction to social network dataDavid Walker
 
Using the right data model in a data mart
Using the right data model in a data martUsing the right data model in a data mart
Using the right data model in a data martDavid Walker
 
Implementing Netezza Spatial
Implementing Netezza SpatialImplementing Netezza Spatial
Implementing Netezza SpatialDavid Walker
 
Storage Characteristics Of Call Data Records In Column Store Databases
Storage Characteristics Of Call Data Records In Column Store DatabasesStorage Characteristics Of Call Data Records In Column Store Databases
Storage Characteristics Of Call Data Records In Column Store DatabasesDavid Walker
 
UKOUG06 - An Introduction To Process Neutral Data Modelling - Presentation
UKOUG06 - An Introduction To Process Neutral Data Modelling - PresentationUKOUG06 - An Introduction To Process Neutral Data Modelling - Presentation
UKOUG06 - An Introduction To Process Neutral Data Modelling - PresentationDavid Walker
 
Oracle BI06 From Volume To Value - Presentation
Oracle BI06   From Volume To Value - PresentationOracle BI06   From Volume To Value - Presentation
Oracle BI06 From Volume To Value - PresentationDavid Walker
 
Openworld04 - Information Delivery - The Change In Data Management At Network...
Openworld04 - Information Delivery - The Change In Data Management At Network...Openworld04 - Information Delivery - The Change In Data Management At Network...
Openworld04 - Information Delivery - The Change In Data Management At Network...David Walker
 

Mehr von David Walker (20)

Big Data Week 2016 - Worldpay - Deploying Secure Clusters
Big Data Week 2016  - Worldpay - Deploying Secure ClustersBig Data Week 2016  - Worldpay - Deploying Secure Clusters
Big Data Week 2016 - Worldpay - Deploying Secure Clusters
 
Data Works Summit Munich 2017 - Worldpay - Multi Tenancy Clusters
Data Works Summit Munich 2017 - Worldpay - Multi Tenancy ClustersData Works Summit Munich 2017 - Worldpay - Multi Tenancy Clusters
Data Works Summit Munich 2017 - Worldpay - Multi Tenancy Clusters
 
Big Data Analytics 2017 - Worldpay - Empowering Payments
Big Data Analytics 2017  - Worldpay - Empowering PaymentsBig Data Analytics 2017  - Worldpay - Empowering Payments
Big Data Analytics 2017 - Worldpay - Empowering Payments
 
Data Driven Insurance Underwriting
Data Driven Insurance UnderwritingData Driven Insurance Underwriting
Data Driven Insurance Underwriting
 
Data Driven Insurance Underwriting (Dutch Language Version)
Data Driven Insurance Underwriting (Dutch Language Version)Data Driven Insurance Underwriting (Dutch Language Version)
Data Driven Insurance Underwriting (Dutch Language Version)
 
An introduction to data virtualization in business intelligence
An introduction to data virtualization in business intelligenceAn introduction to data virtualization in business intelligence
An introduction to data virtualization in business intelligence
 
BI SaaS & Cloud Strategies for Telcos
BI SaaS & Cloud Strategies for TelcosBI SaaS & Cloud Strategies for Telcos
BI SaaS & Cloud Strategies for Telcos
 
Building an analytical platform
Building an analytical platformBuilding an analytical platform
Building an analytical platform
 
Building a data warehouse of call data records
Building a data warehouse of call data recordsBuilding a data warehouse of call data records
Building a data warehouse of call data records
 
Struggling with data management
Struggling with data managementStruggling with data management
Struggling with data management
 
A linux mac os x command line interface
A linux mac os x command line interfaceA linux mac os x command line interface
A linux mac os x command line interface
 
Connections a life in the day of - david walker
Connections   a life in the day of - david walkerConnections   a life in the day of - david walker
Connections a life in the day of - david walker
 
Conspectus data warehousing appliances – fad or future
Conspectus   data warehousing appliances – fad or futureConspectus   data warehousing appliances – fad or future
Conspectus data warehousing appliances – fad or future
 
An introduction to social network data
An introduction to social network dataAn introduction to social network data
An introduction to social network data
 
Using the right data model in a data mart
Using the right data model in a data martUsing the right data model in a data mart
Using the right data model in a data mart
 
Implementing Netezza Spatial
Implementing Netezza SpatialImplementing Netezza Spatial
Implementing Netezza Spatial
 
Storage Characteristics Of Call Data Records In Column Store Databases
Storage Characteristics Of Call Data Records In Column Store DatabasesStorage Characteristics Of Call Data Records In Column Store Databases
Storage Characteristics Of Call Data Records In Column Store Databases
 
UKOUG06 - An Introduction To Process Neutral Data Modelling - Presentation
UKOUG06 - An Introduction To Process Neutral Data Modelling - PresentationUKOUG06 - An Introduction To Process Neutral Data Modelling - Presentation
UKOUG06 - An Introduction To Process Neutral Data Modelling - Presentation
 
Oracle BI06 From Volume To Value - Presentation
Oracle BI06   From Volume To Value - PresentationOracle BI06   From Volume To Value - Presentation
Oracle BI06 From Volume To Value - Presentation
 
Openworld04 - Information Delivery - The Change In Data Management At Network...
Openworld04 - Information Delivery - The Change In Data Management At Network...Openworld04 - Information Delivery - The Change In Data Management At Network...
Openworld04 - Information Delivery - The Change In Data Management At Network...
 

Kürzlich hochgeladen

The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...AliaaTarek5
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Scott Andery
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 

Kürzlich hochgeladen (20)

The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 

White Paper - Process Neutral Data Modelling

  • 1. Data Management & Warehousing WHITE PAPER Process Neutral Data Modelling DAVID M WALKER Version: 1.0 Date: 10/02/2009 Data Management & Warehousing 138 Finchampstead Road, Wokingham, Berkshire, RG41 2NU, United Kingdom http://www.datamgmt.com
  • 2. White Paper - Process Neutral Data Modelling Table of Contents Table of Contents ...................................................................................................................... 2  Synopsis .................................................................................................................................... 4  Intended Audience .................................................................................................................... 4  About Data Management & Warehousing ................................................................................. 4  Introduction................................................................................................................................ 5  The Problem .............................................................................................................................. 6  The Example Company......................................................................................................... 6  The Real World ..................................................................................................................... 9  The Customer Paradigm ......................................................................................................... 10  Requirements of a Data Warehouse Data Model.................................................................... 12  Assumptions........................................................................................................................ 12  Requirements...................................................................................................................... 12  The Data Model ....................................................................................................................... 14  Major Entities ...................................................................................................................... 14  Type Tables ........................................................................................................................ 17  Band Tables ........................................................................................................................ 19  Property Tables................................................................................................................... 20  Event Tables ....................................................................................................................... 22  Link Tables.......................................................................................................................... 23  Segment Tables .................................................................................................................. 24  The Sub-Model ........................................................................................................................ 25  History Tables ..................................................................................................................... 26  Occurrences and Transactions ........................................................................................... 27  Implementation Issues ............................................................................................................ 33  The ‘Party’ Special Case..................................................................................................... 33  Partitioning .......................................................................................................................... 35  Data Cleansing.................................................................................................................... 36  Null Values .......................................................................................................................... 36  Indexing Strategy ................................................................................................................ 36  Enforcing Referential Integrity............................................................................................. 36  Data Insert versus Data Update.......................................................................................... 37  Row versus Set Based Loading in ETL............................................................................... 37  Disk Space Utilisation ......................................................................................................... 38  Implementation Effort .......................................................................................................... 38  Data Commutativity ................................................................................................................. 39  Data Model Explosion and Compression ................................................................................ 40  How big does the data model get?...................................................................................... 40  Can the data model be compressed? ................................................................................. 40  Which Results to Store? .......................................................................................................... 41  The Holistic Approach ............................................................................................................. 42  Summary ................................................................................................................................. 43  Appendix 1 – Data Modelling Standards ................................................................................. 44  General Conventions .......................................................................................................... 44  Table Conventions .............................................................................................................. 44  Column Conventions........................................................................................................... 46  Index Conventions .............................................................................................................. 50  Standard Table Constructs ................................................................................................. 50  Sequence Numbers For Primary Keys................................................................................ 52  Appendix 2 – Understanding Hierarchies ................................................................................ 53  Sales Regions ..................................................................................................................... 53  Internal Organisation Structure ........................................................................................... 53  Appendix 3 – Industry Standard Data Models ......................................................................... 55  Appendix 4 – Information Sparsity .......................................................................................... 57  Appendix 5 – Set Processing Techniques............................................................................... 59  Appendix 6 – Standing on the shoulders of giants .................................................................. 60  © 2009 Data Management & Warehousing Page 2
  • 3. White Paper - Process Neutral Data Modelling Further Reading ...................................................................................................................... 61  Overview Architecture for Enterprise Data Warehouses..................................................... 61  Data Warehouse Governance............................................................................................. 61  Data Warehouse Project Management ............................................................................... 62  Data Warehouse Documentation Roadmap ....................................................................... 62  How Data Works ................................................................................................................. 63  List of Figures .......................................................................................................................... 64  Copyright ................................................................................................................................. 64  © 2009 Data Management & Warehousing Page 3
  • 4. White Paper - Process Neutral Data Modelling Synopsis This paper describes in detail the process for creating an enterprise data warehouse physical data model that is less susceptible to change. Change is one of the largest on-going costs in a data warehouse and therefore reducing change reduces the total cost of ownership of the system. This is achieved by removing business process specific data and concentrating on core business information. The white paper examines why data-modelling style is important and how issues arise when using a data model for reporting. It discusses a number of techniques and proposes a specific solution. The techniques should be considered when building a data warehouse solution even when an organisation decides against using the specific solution. This paper is intended for a technical audience and project managers involved with the technical aspects of a data warehouse project. Intended Audience Reader Recommended Reading Executive Synopsis Business Users Synopsis IT Management Synopsis IT Strategy Entire Document IT Project Management Entire Document IT Developers Entire Document About Data Management & Warehousing Data Management & Warehousing is a specialist consultancy in data warehousing, based in Wokingham, Berkshire in the United Kingdom. Founded in 1995 by David M Walker, our consultants have worked for major corporations around the world including the US, Europe, Africa and the Middle East. Our clients are invariably large organisations with a pressing need for business intelligence. We have worked in many industry sectors but have specialists in Telco’s, manufacturing, retail, financial and transport as well as technical expertise in many of the leading technologies. For further information visit our website at: http://www.datamgmt.com Crossword Clue: Expert Gives Us Real Understanding (4 letters) © 2009 Data Management & Warehousing Page 4
  • 5. White Paper - Process Neutral Data Modelling Introduction Commissioning a data warehouse system is a major undertaking. Organisations will invest significant capital in the development of the system. The data model is always a major consideration and many projects will invest a significant part of the budget on developing and re-working the initial data model. Unfortunately projects also often fail to look at the maintenance costs of the data model that they develop. A data model that is fit for purpose when developed will rapidly become an expensive overhead if it needs to change when the source systems change. The cost involved is not only in the change to the data model but also in the changes to the ETL that feed the data model. This problem is exacerbated by the fact that changes to the data model may be done in an inconsistent way from the original design approach. The data model loses transparency and becomes even more difficult to maintain. For many large data warehouse solutions it is not uncommon to have a resource permanently assigned to maintaining the data model and several more resources assigned to managing the change in the associated ETL within a short time of going live. By understanding the problem and using techniques imported from other areas of systems and software development and well as change management techniques it is possible to define a method that will greatly reduce this overhead. This white paper sets out an example of the issues from which to develop a statement of requirements for the data model and then demonstrates a number of techniques which, when used together, can address those requirements in a sustainable way. © 2009 Data Management & Warehousing Page 5
  • 6. White Paper - Process Neutral Data Modelling The Problem Data modelling is the process of defining the database structures in which to hold information. To understand the Process Neutral Data Modelling approach first this paper looks at why these database structures have such an impact on the data warehouse. In order to demonstrate the issues with creating a data model for a data warehouse more experienced readers are asked bear with the necessarily simplistic examples that follow. The Example Company A company supplies and installs widgets. There are a number of different widget types, each having a name and specific colour. Each individual widget has a unique serial number and can have a number of red lamps and a number of green lamps plugged into it. The widgets are installed into cabinets at customer sites and from time to time engineers come in and change the relative numbers of red and green lamps. The customer name and a customer cabinet number identify cabinets. For operational 1 systems the data model might look something like this : 2 Figure 1 - Initial Operational System Data Model This simple data model describes both the widget and the cabinet and provides the current combinations. It does not provide any historical context: “What was the previous configuration and when was it changed?” Historical data can be recorded by simply adding start date and end date to each of 3 the main tables. This provides the ability to report on the historical configuration . In order to facilitate this a separate reporting environment would be setup because retaining history in the operational system would unacceptably reduce the operational system performance. There are three consequences of doing this: • Queries are now more complex. In order to report the information for a given date the query has to allow for the required date being between the start date 1 Data models in this document are illustrative and therefore should be viewed as suitable for making specific points rather than complete production quality solutions. Some errors exist to explicitly demonstrate certain issues. 2 The are several conventions for data modelling. In this and subsequent diagrams the link with a 1 and ∞ represents a one-to-many relationship where the ‘1’ record is a primary key field and the ‘∞’ represents the foreign key field. 3 Note that the ‘WIDGET_LOCATIONS’ table requires an additional field called ‘INSTALL_SEQUENCE’ to allow for the case where a widget is re-installed in a cabinet. © 2009 Data Management & Warehousing Page 6
  • 7. White Paper - Process Neutral Data Modelling and the end date of the record in each of the tables. The extra complexity slows the execution of the query. o The volume of data stored has also increased. The storage of dates has a minor impact on the size of each row but this is small when compared to the 4 number of additional rows that need to be stored. o Data has to be moved from the operational system to the reporting system via an extract, transform and load (ETL) process. This process has to extract the data from the operational system, compare the records to the current records in the reporting system to determine if there are any changes and if so make the required adjustments to the existing record (e.g. updating the end date) and insert the new record. Already the process is more complex 5 and time consuming than simply copying the data across. Figure 2 - Initial Reporting System Data Model When the reporting system is built, it accurately reflects the current business processes, operational systems and provides historical data. From a systems management perspective there is now an additional database, and a series of ETL or interface scripts that have to be run reliably every day. The systems architecture may be further enhanced so that the reporting system becomes a data warehouse and the users make their queries on data marts, or sets of tables where the data has been re-structured in order to simplify of the users query environment. The ‘data marts’ typically use star-schema or snowflake-schema data 6 modelling techniques or tool specific storage strategies . This adds an additional layer of ETL to move between the data warehouse and the data mart. However the company doesn’t stop here. The product development team create a new type of widget. This new widget allows amber lamps and can optionally be mounted in a rack that is in turn mounted in a cabinet. The IT director also insists that the new OLTP application is more flexible for other future developments. 4 Assume that everything remains the same except that widgets are moved around (i.e. there are no new widgets and no new cabinet/customer combination) then the WIDGET_LOCATIONS table grows in direct proportion to the number of changes. If each widget were modified in some way once a month then the reporting system table would be twelve times bigger than the operational system after one year and this before any other change is handled. 5 Additional functionality such as data cleansing will also impact the complexity of ETL and affect performance 6 This is accepted good practice and the design and implementation of data marts is outside the scope of this paper. © 2009 Data Management & Warehousing Page 7
  • 8. White Paper - Process Neutral Data Modelling These business process changes results in a new data model for the operational system. Figure 3 - Second Version Operational System Data Model The reporting system is also now a live system with a large amount of historical information. It too can be re-designed. The operational system will be implemented to meet the business requirements and timescales regardless of whether the reporting system is ready. It also may not be possible to create the history required for the new 7 data model when it is changed. If a data mart is built from the data warehouse there are two impacts. Firstly that the data mart model will need to be changed to exploit the new data and secondly that the change to data warehouse model will require the data mart ETL to be modified regardless of any changes to the data mart data model. The example company does not stop here however as senior management decide to acquire a smaller competitor. The new subsidiary has it’s own systems that reflect their own business processes. The data warehouse was built with a promise of providing an integrated management reporting so there is an expectation that the data from the new source system will be quickly and seamlessly integrated into the data warehouse. From a technical perspective this could present issues around mapping the new source system data model to the existing data warehouse data 8 9 model, critical information data types , duplication of keys , etc. that all cause problems with the integration of data and therefore slow down the processing. Within a few short iterations of change it is possible to see the dramatic impact on the data warehouse and that the system is likely to run into issues. 7 A common example of this is an organisation that captures the fact that an individual is married or not. Later the organisation decided to capture the name of the partner if someone is married. It is not possible to create the historical information systemically so for a period of time the system has to support the continued use of the marital status and then possibly run other activities such as outbound calling to complete the missing historical data. 8 The example database assumed that serial number was numeric and used it as a primary key but what happens if the acquired company uses alphanumeric serial numbers? 9 If both companies use numbers starting from 1 for their customer ID then there will be two customers who have the same ‘unique’ id, and customers that have two ‘unique’ IDs. © 2009 Data Management & Warehousing Page 8
  • 9. White Paper - Process Neutral Data Modelling The Real World The example above is designed to illustrate some of the issues that affect data warehouse data modelling. In reality business and technical analysts will handle some of these issues in the design phase but how big is the data-modelling problem in the real world? o A UK transport industry organisation has three mainframes, each of which is only allowed to perform one release a quarter. Each system also feeds the data warehouse. As a consequence the mainframe feeds require validation and change every month. Whilst the main data comes from these three systems there are sixty-five other Unix based operational system that feed the data warehouse and data from several hundred desktop based applications that are also provide data. Most of these source systems do not have good change control or governance procedures to assist in impact analysis. Change for this organisation is business as usual. o A global ERP vendor supplies a system with over five thousand database objects and typically makes a major release every two years, a ‘dot’ release every six months and has numerous patches and fixes in between each major release. This type of ERP system is in use in nearly every major company and the data is a critical source to most data warehouses. o A global food and drink manufacturer that came into existence as a result of numerous mergers and acquisitions and also divested some assets found itself with one hundred and thirty-seven general ledger instances in ten countries with seventeen different ERP packages. Even where the ERP packages were the same they were not necessarily using the same version of the package. The business intelligence requirement was for a single data warehouse and a single data model. o A European Telco purchased a three hundred-table ‘industry standard’ enterprise data model from a major business intelligence vendor and then spent two years analysing it before they started the implementation. Within six months of implementation they had changed some sixty percent of tables as a result of analysis omissions. o A UK based banking and insurance business outsources all of its product management to business partners and only maintains the unified customer management systems (website, call centres and marketing). As a result nearly all of the ‘source systems’ are external to the organisation and whilst there are contractual agreements about the format and data remaining fixed in practice there is significant regular change in the format and information provided to both operational and reporting systems. Obviously these issues cannot be fixed just by creating the correct data model for the 10 data warehouse but the objective of the data model design should be two fold: o To ensure that all the required data can be stored effectively in the data warehouse. o To ensure that the design of the data model does not impose cost and where possible actively reduces the cost of change on the system. 10 Data Management & Warehousing have published a number of other white papers that are available at http://www.datamgmt.com and look at other aspects of data warehousing and address some of these issues. See Further Reading at the end of this document for more details. © 2009 Data Management & Warehousing Page 9
  • 10. White Paper - Process Neutral Data Modelling The Customer Paradigm Data Warehouse development often start with a requirements gathering exercise. This may take the form of interviews or workshops where people try to define what the customer is. If a number of different parts of the business are involved then the definition of customer soon becomes confused and controversial and negatively impacts the project. Most organisations have a sales funnel that describes the process of capturing, qualifying, converting and retaining customers. Marketing say that the customer is anyone and everyone that they communicate with. The sales teams view the customer as those organisations in their qualified lead database or for whom they have account management responsibility post-sales. The customer services team are clear that the customer is only those organisations who have purchased a product and where appropriate have purchased a support agreement as well. Other questions are asked in the workshops such as “What about customers who are also suppliers or partners?” and “How do we deal with customers who have gone away and then come back after a long period of time?” Figure 4 - The Sales Funnel The most common solutions that are created as a result either add ‘flag’ or ‘indicator’ columns to the customer table to represent each category or to create multiple tables for the different categories required and to repeat the data in each of the tables. This example clearly demonstrates that the business process is being embedded into the data model. The current business process definition(s) of customer are defining how the data model is created. What has been forgotten is that these ‘customers’ exist outside the organisation and it is their interaction with different parts of the organisation that defines their status of being a customer, supplier, etc. In legal documents there is the concept of a ‘party’ where a party is a person or group of persons that compose a single entity that can be 11 identified as one for the purposes of the law . This definition is one that should be borrowed and used in the data model. If users query a data mart that is loaded with data extracted from the transaction repository and data marts are built for a specific team or function that only requires one definition of the 12 data then the current definition can be used to build that data mart and different definitions used for other departments. 11 http://en.wikipedia.org/wiki/Party_(law) 12 This also allows flexibility, as, when business processes change, it is possible at a cost to change the rules by which data is extracted. The cost of change is relatively much lower than trying to rebuild the data warehouse and data mart with a new definition. © 2009 Data Management & Warehousing Page 10
  • 11. White Paper - Process Neutral Data Modelling As a result of this approach two questions are common: • Isn’t one of the purposes of building a data warehouse to have a single version of the truth? Yes. There is a single version of the truth in the data warehouse and this single version is perpetuated into the data marts, the difference is that the information in the data mart is qualified. Asking the question “How many customers do we have?” should get the answer “Customer Services have X active service contract customers” and not the answer “X” without any further qualification. • What happens if different teams or departments have different data? People within the organisation work within different processes and with the same terminology but often different definitions, it is unlikely and impractical in the short term to change this, although it is possible that in the long term the data warehouse project will help with the standardization process. In the mean time it is an education process to ensure that answers are qualified. It is important to recognise that different departments legitimately have different definitions and therefore to recognise and understand the differences, rather than fighting about who is right. It might be argued that there are too many differences to put all individuals and organisations in a single table; this and other issues will be discussed later in the paper. © 2009 Data Management & Warehousing Page 11
  • 12. White Paper - Process Neutral Data Modelling Requirements of a Data Warehouse Data Model Having looked at the problems that can affect a data warehouse data model it is possible to describe the requirements that should be made on any data model design. Assumptions 1. The data model is for use in the architectural component called the transaction 13 repository or data warehouse. 2. As the data model is used in the data warehouse it will not be a place where users go to query the data, instead users will query separate dependant data marts. 3. As the data model is used in the data warehouse data will be extracted from it to populate the data marts by ETL tools. 4. As the data model is used in the data warehouse the data will be loaded into it from the source systems by ETL tools. 5. Direct updates (i.e. not through formally released ETL processes) will be prohibited; instead a separate application or applications will exist as a surrogate source. 6. The data model will not be used in a ‘mixed mode’ where some parts use one data modelling convention and other parts use another. (This is generally bad practice with any modelling technique but often the outcome where the responsibility for data modelling changes is distributed or re-assigned over time). Requirements 1. The data model will work on any standard business intelligence relational 14 database. This is to ensure that it can be deployed on any current platform and if necessary re-deployed on a future platform. 2. The data model will be process neutral i.e. it will not reflect current business processes, practices or dependencies but instead will store the data items and relationships as defined by their use at the point in time when the information is acquired. 15 3. The data model will use a design pattern i.e. a general reusable solution to a commonly occurring problem. A design pattern is not a finished design but a description or template for how to solve a problem that can be used in many different situations. 13 For further information on Transaction Repositories see the Data Management & Warehousing white paper ”An Overview Architecture For Enterprise Data Warehouses” 14 A typical list would (at the time of writing) include IBM DB2, Microsoft SQL Server, Netezza, Oracle, Sybase, Sybase IQ, and Teradata. For the purposes of this document it implies compliance with at least the SQL92 standard 15 http://en.wikipedia.org/wiki/Software_design_pattern © 2009 Data Management & Warehousing Page 12
  • 13. White Paper - Process Neutral Data Modelling 16 4. Convention over configuration : This is a software design paradigm which seeks to decrease the number of decisions that developers need to make, gaining simplicity, but not necessarily losing flexibility. It can be applied successfully to data modelling and reduce the number of decisions of the data modeller by ensuring that tables and columns use a standard naming convention and are populated and queried in a consistent fashion. This also has a significant impact on the efforts of an ETL developer. 5. The design should also follow the DRY (Don’t Repeat Yourself) principle. This is a process philosophy aimed at reducing duplication. The philosophy emphasizes that information should not be duplicated, because duplication increases the difficulty of change, may decrease clarity, and leads to 17 opportunities for inconsistency. 6. The data model should be significantly static over a long period of time, i.e. there should not be a need to add or modify tables on a regular basis. In this case there is a difference between designed and implemented, it is possible to have designed a table but not to implement it until it is actually required. This does not affect the static nature of the data model, as the placeholder already exists. 18 7. The data model should store data at the lowest possible level and avoid the storage of aggregates. 8. The data model should support the best use of platform specific features whilst 19 not compromising the design. 9. The data model should be completely time-variant, i.e. it should be possible to 20 reconstruct the information at any available point in time. 10. The data model should act as a communication tool to aid the refinement of requirements and an explanation of possibilities. 16 For further information see http://en.wikipedia.org/wiki/Convention_over_Configuration and http://softwareengineering.vazexqi.com/files/pattern.html. The Ruby on Rails language (http://www.rubyonrails.org/) makes extensive use of this principle. 17 DRY is a core principle of Andy Hunt and Dave Thomas's book The Pragmatic Programmer. They apply it quite broadly to include "database schemas, test plans, the build system, even documentation." When the DRY principle is applied successfully, a modification of any single element of a system does not change other logically unrelated elements. Additionally, elements that are logically related all change predictably and uniformly, and are thus kept in sync. (http://en.wikipedia.org/wiki/DRY). This does not automatically imply database normalisation but database normalisation is one method for ensuring ‘dryness’. 18 This is the origin of the term ‘Transaction Repository’ rather than ‘Data Warehouse’ in Data Management & Warehousing documentation. The transaction repository stores the lowest level of data that is practical and/or available. (See An Overview Architecture for Enterprise Data Warehouses) 19 This turns out to be both simple and very effective. For Oracle the most common features that need support include partitioning and materialized views. For Sybase IQ and Netezza there is a preference for inserts over updates due to their internal storage mechanisms. For all databases there is variation in indexing strategies. These and other features should be easily accommodated. 20 Also known as temporal. Most data warehouses are not linearly time variant but quantum time variant. If a status field is updated three times in a day and the data warehouse reflects all changes then it is linearly time-variant. If a data warehouse holds the first and last values only because a batch process loads it once a day then it is quantum time-variant where the quantum is, in this case, one day. Quantum time variant solutions can only resolve data to the level of the quantum unit of measure. © 2009 Data Management & Warehousing Page 13
  • 14. White Paper - Process Neutral Data Modelling The Data Model As this white paper has defined requirements for the data model it is now possible to start looking at what is needed to design a data model. This is done by breaking down the tables that will be created into different groups depending on how they are used. The section below discusses the main elements of the data models. There are some basics such as naming conventions, standard short names, keys used in the data model, etc. that are not described. A complete set of data modelling rules and example models can be found in the appendices. Major Entities Party is, as described in the customer paradigm section above, an example of a type of table within the Process Neutral Data Modelling method known as a ‘Major Entity’. These are tables that deliver the placeholders for all major subject areas of the data model and around which other information is grouped. Each business transaction will relate to a number of major entities. Some major entities are global i.e. they apply to all types of organisation (e.g. Calendar) and there are a number of major entities that are industry specific (e.g. for Telco, Manufacturing, Retail, Banking, etc.). It would be very unusual for an organisation to need a major entity that was not industry wide. Below is a list of some of the most common: • Calendar Every data warehouse will need a calendar. It should always contain data to the day level and never to parts of the day. In some cases there is a need to 21 support sub-types of calendar for non-Gregorian calendars . • Party Every organisation will have dealings between parties. This will normally include three major sub-types: individuals, organisations (any formal organisation such as a company, charity, trust, partnership, etc.) and organisational units (the components within an organisation including the system owners organisation). • Geography The information about where. This is normally sub-typed into two components, 22 address and location. Address information is often limited to postal addresses whilst location is normally described by the longitude and latitude via GPS co- ordinates. Other specialist geographic models exist that may need to be taken 23 into account. • Product_Service (also known as Product or as Service) This is the catalogue of the products and/or services that an organisation supplies. • Account Every customer will have at least one account if financial transactions are involved (even those organisations that do not think they currently use the concept of account will do so as accounting systems always have the concept of a customer with one or more accounts). 21 See http://www.qppstudio.net/footnotes/non-gregorian.htm for various calendars, notably 2008 is the Muslin Year 1429 and the Jewish Year 5968 22 Some countries, such as the UK, have validated lists of all addresses (see the UK Post Office Postcode Address File at http://www.royalmail.com/portal/rm/jump2?mediaId=400085&catId=400084) 23 Network Rail in the UK use an Engineers Line Reference, which is based on a linear reference model and refers to a known distance from a fixed point on a track. In Switzerland they have an entire national co-ordinate system (http://en.wikipedia.org/wiki/Swiss_coordinate_system) © 2009 Data Management & Warehousing Page 14
  • 15. White Paper - Process Neutral Data Modelling • Electronic_Address Any electronic address such as a telephone number, email address, web address, IP address etc. This is normally sub-typed by the categories used. • Asset (also known as Equipment) A physical object that can be uniquely identified (normally by a serial number or similar). This may be used or incorporated in a PRODUCT_SERVICE, or sold to a customer etc. In the example Cabinet, Rack and Widget were all examples of Asset, whilst Widget Type was an example of PRODUCT_SERVICE. • Component A physical object that cannot be uniquely identified by a serial number but has a part number and is used in the make-up of either an asset or of a product service. In the example company there was not a particular record of the serial numbers of the lamps, however they would all have had a part number that described the type of lamp to be used. • Channel A conceptual route to market (e.g. direct, indirect, web-based, call-centre, etc.). • Campaign A marketing exercise that is designed to promote the organisation, e.g. the running of a series of adverts on the television. • Campaign Activities The running of a specific advert as part of a larger campaign. • Contract Depending on the type of business the relationship between the organisation and its supplier or its customer may require the concept of a contract as well as that of an account. • Tariff (also known as Price_List) A set of charges and discounts that can be applied to product services as a point in time. This list is not comprehensive by if an organisation can effectively describe their major entities and combine this information with the interactions between them (the occurrences or transactions) then they have the basis of a very successful data warehouse. Major Entities can have any meaningful name provided it is not a reserved word in the database or (as will be seen below) a reserved word within the design pattern of Process Neutral Data Modelling. Some readers, who are familiar with the concepts of star schemas and data marts, will also be aware that these are very close to the basic dimensions that most data marts use. This should come as no surprise as these are the major data items of any business regardless of their business processes or of their specific industry sector and a data mart is only a simplification of the data presented for the user. This effect is called “natural star schemas” and will be explored in more detail later. © 2009 Data Management & Warehousing Page 15
  • 16. White Paper - Process Neutral Data Modelling Lifetime Value The next decision is which columns (attributes) should be included in the table. 24 Much like the processes involved in normalising a database the objective is to minimise duplication of data and there is also a requirement to minimise updates. To this end the attributes that are included should therefore have ‘lifetime value’, i.e. they should remain constant once they have been inserted into the database. This means that variable data needs to be handled elsewhere. Using some of the major entities above as examples: Calendar: Lifetime Value Attributes: Date, Public Holiday Flag Geography: Lifetime Value Attributes: Address Line 1, Address Line 2, City, 25 Postcode , County, Country Non-Lifetime Value Attributes: Population Party (Individuals): 26 Lifetime Value Attributes: Forename, Surname , Date of Birth, 27 Date of Death, Gender , State ID Number Non-Lifetime Value Attributes: Marital Status, Number of Children, Income Party (Organisations): Lifetime Value Attributes: Name, Start Date, End Date, State ID Number Non-Lifetime Value Attributes: Number of Employees, Turnover, Shares Issued Account: Lifetime Value Attributes: Account Number, Start Date, End Date. Non-Lifetime Value Attributes: Balance Other than this lifetime value requirement for columns every table must comply with the general rules for any table. For example every table will have a key column that uses 28 the table short name made up of six characters and the suffix _DWK , a TIMESTAMP column and an ORIGIN column. 24 http://en.wikipedia.org/wiki/Database_normalization: Database normalization is a technique for designing relational database tables to minimize duplication of information and, in so doing, to safeguard the database against certain types of logical or structural problems, namely data anomalies. 25 This may occasionally be a special case as postal services do, from time to time, change postal codes that are normally static. 26 There is a specific special case that deals with the change of name for married women that will be dealt with in the section ‘The Party Special Case’ later. 27 One insurance company had to deal with updatable genders due to the fact that underwriting rules require assessment based on birth gender and not gender as a result of re-assignment surgery. Therefore for marketing it had to handle ‘current’ gender and for underwriting it had to deal with ‘birth’ gender. 28 See the data modelling rules appendix for how this name is created. © 2009 Data Management & Warehousing Page 16
  • 17. White Paper - Process Neutral Data Modelling Type Tables There is often a need to categorise information into discrete sets of values. The valid set of categories will probably change over time and therefore each category record also needs to have lifetime value. Examples of the categorisation have already occurred with the some of the major entities: • Party: Individual, Organisation, Organisation Unit • Geography: Postal Address, Location • Electronic Address: Telephone, E-Mail To support this and to comply with the requirement for convention over configuration all _TYPES tables of this format have a standard data model as follows: • The table will have the same name as the major entity but with the suffix _TYPES (e.g. PARTY_TYPES, GEOGRAPHY_TYPES, etc.). • The table will always have a key column that uses the six character short code and the _DWK suffix. • The table will have a _TYPE column that is the type name. • The table will have a _DESC column that is a description of the type. • The table will have a _GROUP column that groups certain types together. • The table will have a _START_DATE column and a _END_DATE column. This is a type table in its entirety. If a table needs more information (i.e. columns) then this is not a _TYPES table and must not have the _TYPES extension, as it does not comply with the rules for a _TYPES table. Examples of data in _TYPES tables might include: PARTY_TYPES Column Example Rows PARTYP_DWK 1 2 3 4 PARTY_TYPE INDIVIDUAL LTD COMPANY PARTNERSHIP DIVISION PARTY_TYPE_DESC An Individual A company in This is a business A division of a which the liability owned by two or larger of the members in more people who organisation respect of the are personally company’s debts liable for all is limited business debts. PARTY_TYPE_GROUP INDIVIDUAL ORGANISATION ORGANISATION UNIT PARTY_TYPE_START_DATE 01-JAN-1900 01-JAN-1900 01-JAN-1900 01-JAN-1900 PARTY_TYPE_END_DATE Figure 5 - Example data for PARTY_TYPES The start date in this context has little initial value in this context, although it is a 29 mandatory field and therefore has to be completed with a date before the earliest party in this example. Legal types of organisation do change over time and so it is possible that the start and end dates of these will become significant. These types do not describe the type of role that the party is performing (i.e. Customer, Supplier, etc.) they describe the type of the party (e.g. Individual, etc.). Describing the role comes later. The type and group column are repeated for INDIVIDUAL, as there is no hierarchy of information for this value but the field is mandatory. 29 Start Dates in _TYPES tables are mandatory as, with only a few exceptions, they are required information. In order to be consistent they therefore have to be mandatory for all _TYPES tables © 2009 Data Management & Warehousing Page 17
  • 18. White Paper - Process Neutral Data Modelling GEOGRAPHY_TYPES Column Example Rows GEOTYP_DWK 1 2 GEOGRAPHY_TYPE POSTAL LOCATION GEOGRAPHY_TYPE_DESC An address as supported by A point on the surface of the earth the postal service defined by it’s longitude and latitude GEOGRAPHY _TYPE_GROUP POSTAL LOCATION GEOGRAPHY _TYPE_START_DATE 01-JAN-1900 01-JAN-1900 GEOGRAPHY _TYPE_END_DATE Figure 6 - Example Data for GEOGRAPHY_TYPES The start date in this context has little initial value, although it is a mandatory field and therefore has to be completed with a date. These types do not describe the type of role that the geography is performing (i.e. home address, work address, etc.) they describe the type of the geography (postal address, point location, etc.). The type and group column are repeated for both values, as there is no hierarchy of information for them. CALENDAR_TYPES The convention over configuration design aspect allows for this table, however it is rarely needed and can therefore be omitted. This is an example where a table can be described as designed (i.e. it is known exactly what it looks like) but not implemented. _TYPES tables will appear in other parts of the data model but they will always have the same function and format. 30 The consequence of this design re-use is that implementing an application to manage the source of _TYPE data is easy. The system than manages the type data needs to have a single table with the same columns as a standard _TYPES table and an additional column called, for example, DOMAIN. This DOMAIN column has the target system table name (e.g. PARTY_TYPES) in it. The ETL then simply maps the data from the source system to the target system where the DOMAIN equals the target table name. This is an example of re-use generating a significant saving in the implementation. 30 This is a good use of a Warehouse Support Application as defined in “An Overview Architecture for Enterprise Data Warehouses” © 2009 Data Management & Warehousing Page 18
  • 19. White Paper - Process Neutral Data Modelling Band Tables Whilst _TYPES tables classify information into discrete values it is sometimes necessary to classify information into ranges or bands i.e. between one value and another. The classic example of this is for telephone calls which are classified as ‘Off- Peak Rate’ if they are between 00:00 and 07:59 or between 18:00 and 23:59. Calls between 08:00 and 17:59 are classified as ‘Peak Rate’ and charged at a premium. _BANDS is a special case of the _TYPES table and would store the data as follows: Column Example Rows TIMBAN_DWK 1 2 3 TIME_BAND Early Off Peak Peak Late Off Peak 31 TIME_BAND_START_VALUE 0 480 1080 TIME_BAND_END_VALUE 479 1079 1439 TIME_BAND_DESC Early Off Peak Peak Late Off Peak TIME_BAND_GROUP Off Peak Peak Off Peak TIME_BAND_START_DATE 01-JAN-1900 01-JAN-1900 01-JAN-1900 TIME_BAND_END_DATE Figure 7 - Example data for TIME_BANDS Once again the _BANDS table has a standard format as follows • The table will have the same name as the major entity but with the suffix _BANDS (e.g. TIME_BANDS, etc.). • The table will always have a key column that uses the six character short code and the _DWK suffix. • The table will have a _BAND column that is the type name. • The table will have a _START_VALUE and a _END_VALUE that represent the starting and finishing values of the band. • The table will have a _DESC column that is a description of the band. • The table will have a _GROUP column that groups certain band together. • The table will have a _START_DATE column and a _END_DATE column. The table has to comply with this convention in order to be given the _BANDS suffix. 31 Note that values are stored as a number of minutes since midnight. © 2009 Data Management & Warehousing Page 19
  • 20. White Paper - Process Neutral Data Modelling Property Tables In the discussion of major entities and lifetime value the data that failed to meet the lifetime value principle was omitted from the major entity tables, however it still needs to be stored. This is handled via a property table. Property tables also help to support the extensibility aspects of the data model. If we use PARTY as an example then as already identified the marital status does not possess lifetime value and therefore is not included in the major entity. Everyone starts as single, some marry, some divorce and some are widowed, these ‘status changes’ occur through the lifetime of the individual. To deal with this problem the property table can be modelled as follows: Figure 8 - Party Properties Example As can be seen from example above in order to handle the properties two new tables are created. The first is the PARTY_PROPERTIES table itself and the second a supporting PARTY_PROPERTY_TYPES table. In order to store the marital status of an individual a set of data needs to be entered in the PARTY_PROPERTY_TYPES table: TYPE GROUP Single Marital Status Married Marital Status Divorced Marital Status Co-Habiting Marital Status Figure 9 - Example Party Property Data The description, start and end date would be filled in appropriately. Note that the start and end date here represent the start and end date of the type and not that of the 32 individuals’ use of that type. It is now possible to insert a row in the PARTY_PROPERTIES table that references the individual in the PARTY table and the appropriate PARTY_PROPERTY_TYPES (e.g. ‘Married’). The PARTY_PROPERTIES table can also hold the start date and end date of this status and optionally where appropriate a text or numeric value that relates to that property. 32 The need for start and end dates on such items is often questioned however experience shows that legislation changes supposed static values in most countries over the lifetime of the data warehouse. For example in December 2005 the UK permitted a new type of relationship called a civil partnership. http://en.wikipedia.org/wiki/Civil_partnerships_in_the_United_Kingdom. © 2009 Data Management & Warehousing Page 20
  • 21. White Paper - Process Neutral Data Modelling This means that not only the current marital status can be stored but also historical information. 33 PARTY_DWK PARTY_PROPERTY_DWK START_DATE END_DATE John Smith Single 01-Jan-1970 02-Feb-1990 John Smith Married 03-Feb-1990 04-Mar-2000 John Smith Divorced 05-Mar-2000 06-Apr-2005 John Smith Co-Habiting 07-Apr-2005 Figure 10 - Example data for PARTY_PROPERTIES The data shown here describes the complete history of an individual with the last row showing the current state as the START_DATE is before ‘today’ and the END_DATE is null. There is also nothing to prevent future information from being held. If John Smith announces that he is going to get married on a specific date in the future then the current record can have it’s end date set appropriately and a new record added. If another property is required (e.g. Number of Children) then no change is required to the data model. New rows are entered into the PARTY_PROPERTY_TYPES table: TYPE GROUP Male Number of Children Female Number of Children Figure 11 - Example Data for PARTY_PROPERTY_TYPES This allows data to be added to the PARTY_PROPERTIES as follows: PARTY_DWK PARTY_PROPERTY_DWK START_DATE END_DATE VALUE John Smith Single 01-Jan-1970 02-Feb-1990 John Smith Married 03-Feb-1990 04-Mar-2000 John Smith Divorced 05-Mar-2000 06-Apr-2005 John Smith Co-Habiting 07-Apr-2005 John Smith Male 09-Jun-2001 1 John Smith Female 10-Jul-2002 1 Figure 12 - Example Data for PARTY_PROPERTIES In fact any number of new properties can be added to the tables as business processes and source systems change and new data requirements come about. The effect of this method when compared to other methods of modelling this information is to create very narrow (i.e. not many columns) long (i.e. many rows) tables instead of making very much wider, shorter tables. However the properties table 34 is very effective. Firstly, unlike the example, the two _DWK columns are integers , as are the start and end dates. Many of the _VALUE fields will be NULL, and those that are not will be predominately numeric rather than text values. The PARTY_PROPERTY_TYPE acts as a natural partitioning key in those databases that support table partitions. This method is very effective in terms of performance and storage of data in databases that use column or vector type storage. 33 Text from the related table is used in the _DWK column rather than the numeric key for clarity in these examples. 34 Integers are better than text strings for a number of reasons: they usually require less storage and there is less temptation to mix the requirements of identification and description (a problem clearly illustrated by car registration numbers in the UK). Keys are more reliable when implemented as integers because databases often have key generation mechanisms that deliver unique values. Integers do not suffer from upper/lower case ambiguities and can never contain special characters or ambiguities caused by different padding conventions (trailing spaces or leading zeros). © 2009 Data Management & Warehousing Page 21
  • 22. White Paper - Process Neutral Data Modelling The real saving in the number of rows is normally less than expected when compared to more conventional data model techniques that store duplicated rows for changed data. The example above has seven rows of data. The alternate approach of repeated sets of data requires six rows of data and considerably more storage because of the duplicated data: PARTY_DWK START_DATE END_DATE MARITAL_STATUS UNKNOWN FEMALE CHILD CHILD CHILD MALE John Smith 01-Jan-1970 02-Feb-1990 Single 0 0 0 John Smith 03-Feb-1990 08-Jun-2001 Married 0 0 0 John Smith 09-Jun-2001 09-Jul-2002 Married 0 1 0 John Smith 10-Jul-2002 04-Mar-2000 Married 0 1 1 John Smith 05-Mar-2000 06-Apr-2005 Divorced 0 1 1 John Smith 07-Apr-2005 Co-Habiting 0 1 1 Figure 13 - Example Data for PARTY_PROPERTIES The other main objection to this technique is often described as the cost of matrix transformation of the data. That is the changing of the data from rows into columns in the ETL to load the data warehouse and then changing the columns back to rows in the ETL to load the data mart(s). This objection is normally due to a lack of knowledge of appropriate ETL techniques that can make this very efficient such as using SQL set operations such as ‘UNION’, ‘MINUS’ and ‘INTERSECT’. Event Tables An event table is almost identical to a property table except that instead of having _START_DATE and _END_DATE columns it has a single column _EVENT_DATE. It also has the appropriate _EVENT_TYPES table. The table name has a suffix of _EVENTS. For example a wedding is an event (happens at a single point in time), but ‘being married’ is a property (happens over a period of time). Events can be stored in property tables simply by storing the same value in both the start date and end date columns and this is a more common solution than creating a separate table. The use of _EVENTS tables is usually limited to places where events form a significant part of the data and the cost of storing the extra field becomes significant. It should be noted that this is only required where the event may occur many times (e.g. a wedding date) rather than information that can only happen once (e.g. first wedding date) which would be stored in the appropriate major entity as, once set, it would have lifetime value. Figure 14 - Party Events Example _EVENTS tables are a special case of _PROPERTIES tables. © 2009 Data Management & Warehousing Page 22
  • 23. White Paper - Process Neutral Data Modelling Link Tables Up to this point major entity attributes within a single record have been examined. It is also possible that records within the major entities will also relate to other records in the same major entity (e.g. John Smith is married to Jane Smith, both of whom are records within the PARTIES table). This is called a peer-to-peer relationship and is stored in a table with the suffix _LINKS and the appropriate _LINK_TYPES table. Figure 15 - Party Links Example The significant difference in a _LINK table is that there are two relationships from the major entity (in this case PARTIES). This also allows hierarchies to be stored so that: John Smith (Individual) works in Sales (Organisational Unit) Sales (Organisation Unit) is a division of ACME Enterprises (Organisation) where ‘works in’ and ‘is a division of’ are examples of the _LINK_TYPE. It should also be noted that there is a priority to the relationship because one of the linking fields is the main key (in this case PARTIE_DWK) and the other is the linked key (in this case LINKED_PARTIE_DWK). There are two options; one is to store the relationship in both directions (e.g. John Smith is married to Jane Smith and Jane 35 Smith is married to John Smith). This can be made complete with a reversing view but defeats both the ‘Convention over Configuration’ principle and the ‘DRY (Don’t Repeat Yourself)’ principle. The second method is to have a convention and only store the relationship in one direction (e.g. John Smith is married to Jane Smith, therefore the convention could be that that the male is being stored in the main key and the female is being stored in the linked key). 35 A reversing view is one that has all the same columns as the underlying table except that the two key columns are swapped around. In this example PARTIE_DWK would be swapped with LINKED_PARTIE_DWK. © 2009 Data Management & Warehousing Page 23
  • 24. White Paper - Process Neutral Data Modelling Segment Tables The final type of information that might be required about a major entity is the segment. This is a collection of records from the major entity that share something in common but more detail is not known. The most common business example of this would be the market segmentations done on customers. These segments are normally a result of detailed statistical analysis and then storing the results. In our example John Smith and Jane Smith could both be part of a segment of married people along with any number of other individuals for whom it is known that they are married but there is no information about when or to whom they are married. Where the _LINKS table provided the peer-to-peer relationship the segment provides the peer group relationship. Figure 16 - Party Segments Example © 2009 Data Management & Warehousing Page 24
  • 25. White Paper - Process Neutral Data Modelling The Sub-Model The major entities and the six supporting data structures (_TYPES, _BANDS, _PROPERTIES, _EVENTS, _LINKS, and _SEGMENTS) provide sufficient design pattern structure to hold a large part of the information in the data warehouse. This is known as a Major Entity Sub-Model. Significantly the information that has been stored for a single major entity sub-model is very close to the typical dimensions of a data mart. This design pattern provides complete temporal support and the ability to re-construct a dimension or dimensions based on a given set of business rules. The set of a major entity and the supporting structures is known as a sub-model. For example the designed PARTY sub-model consists of: • PARTIES • PARTY_TYPES • PARTY_BANDS • PARTY_PROPERTIES • PARTY_PROPERTY_TYPES • PARTY_EVENTS • PARTY_EVENT_TYPES • PARTY_LINKS • PARTY_LINK_TYPES • PARTY_SEGMENTS • PARTY_SEGMENT_TYPES Those tables in bold italics might represent the implemented PARTY sub-model Importantly what has not been provided is the relationships between major entities and the business transactions that occur as a result of the interaction between major entities. © 2009 Data Management & Warehousing Page 25
  • 26. White Paper - Process Neutral Data Modelling History Tables Extending the example above it is noticeable that the party does not contain any address information; this is held in the geography major entity. This is also another example where current business processes and requirements may change. At the outset the source system may provide a contract address and a billing address. A change in process may require the capture of additional information e.g. contact addresses and installation addresses. In practice the only difference between this type of relationship between major entities and the _LINKS relationship is that instead of two references to the same major entity there is one relationship to each of two major entities. The data model is therefore relatively simple to construct: Figure 17 – Party Geography History Example There is one minor semantic difference between links and histories. _LINKS tables join back on to the major entity and therefore one half of the relationship has to be given priority. In a _HISTORY table there is no need for priority as each of the two attributes is associated with a different major entity. Finally note that in this example the major entity is shown without the rest of the sub- model that can be assumed. © 2009 Data Management & Warehousing Page 26
  • 27. White Paper - Process Neutral Data Modelling Occurrences and Transactions The final part of the data model is to build up all the occurrence or transaction tables. In the data mart these are most akin to the fact tables although as this is a relational model they may occur outside a pure star relationship. Like the major entities there is no standard suffix or prefix, just a meaningful name. To demonstrate what is required an example from a retail bank is described. The example is not nearly as complex as a real bank but necessarily longer and more complex than most examples to demonstrate a number of features. Furthermore banking has been chosen as an example because the concepts will be familiar to most readers. The example only looks at some core banking function and not at the activities such as marketing or specialist products such as insurance. The Example The bank has a number of regions and a central ‘premium’ account function that caters for some business customers. Each region has a number of branches. Branches have a manager and a number of staff. Each branch manager reports to a regional manager. If a customer has a personal account then the account manager is a branch personal account manager, however if the individual has a net worth in excess of USD1M the branch manager acts as the account manager. Personal accounts have contact and statement addresses and a range of telephone numbers, e- mails, addresses, etc. If the account belongs to a business with less than USD1M turnover then the account manager is a business account manager at the branch who reports to the branch manager. If the account belongs to a business with a turnover of between USD1M and USD10M then the account manager is an individual at the regional office who reports to the regional manager. If the account belongs to a business with a turnover more than USD10M then the account managers at the central office are responsible for the account. Businesses have contact and statement addresses as well as a number of approved individuals who can use the company account and contact details for them. Branch and account managers periodically review the banding of accounts by income for individuals and turnover for companies and if they are likely to move band in the coming year then they are added to the appropriate (future) category. Note that this is only partially fact based, the rest being based on subjective input from account managers. The bank offers a range of services including current, loan and deposit accounts, credit and debit cards, EPOS (for business accounts only), foreign exchange, etc. The bank has a number of channels including branches, a call centre service, a web service and the ability to use ATMs for certain transactions. The bank offers a range of transaction types including cash, cheque, standing order, direct debit, interest, service charges, etc. © 2009 Data Management & Warehousing Page 27
  • 28. White Paper - Process Neutral Data Modelling After the close of business on the last working day of each month the starting and ending balances, the average daily balance and any interest is calculated for each account. On a daily basis the exposure (i.e. sum of all account balances) is calculated for each customer along with a risk factor that is a number between 0 and 100 that is influenced by a number of factors that are reviewed from time to time by the risk management department. Risk factors might include sudden large deposits or withdrawals, closure of a number of accounts, long-term non-use of an account, etc. that might influence account managers’ decisions. Every transaction that is made is recorded every day and has three associated dates, the date of the transaction, the date it appeared on the system and the cleared date. De-constructing the example The bank has a number of regions and a central ‘premium’ account function that caters for some business customers. Each region has a number of branches. Branches have a manager. Each branch manager reports to a regional manager. • The bank itself must be held as an organisation. • The regions and central ‘premium’ account function are held as 36 Organisation Units. • The bank and the regions have links. • The branches are held as organisational units. • The regions and the branches have links. • The branches have addresses via a history table. • The branches have electronic addresses via a history table. • There are a number of roles stored as organisation units. • There roles and the individuals have links. • The roles may have addresses via a history table. • The roles may have electronic addresses via a history table. • The individuals may have addresses via a history table. • The individuals have electronic addresses via a history table. At this point only existing major entities and history tables have been used. Also this information would be re-usable in many places just like the conformed dimensions concept of star schemas but with more flexibility. If a customer has a personal account then the account manager is a branch personal account manager, however if the individual has a net worth in excess of USD1M the branch manager acts as the account manager. Personal accounts have contact and statement addresses and a range of telephone numbers, e- mails, etc. • Customers are held as Parties, either Individuals or Organisations. • Customers have addresses via a history table. • Customers have electronic addresses via a history table. • Accounts are held in the Accounts major entity. • Customers are related to accounts via a history table. • Branches are related to accounts via a history table. • Accounts are associated with a role via a history table. • An individual’s net worth is generated elsewhere and stored as a property of the party. 36 See Appendix 2 – Understanding Hierarchies for an explanation as to why the regions are organisational units and not geography. © 2009 Data Management & Warehousing Page 28
  • 29. White Paper - Process Neutral Data Modelling • A high net worth individual is a member of a similarly named segment. • The accounts may have addresses via a history table. • The accounts may have electronic addresses via a history table. If the account belongs to a business with less than USD1M turnover then the account manager is a business account manager at the branch who reports to the branch manager. If the account belongs to a business with a turnover of between USD1M and USD10M then the account manager is an individual at the regional office who reports to the regional manager. If the account belongs to a business with a turnover over USD10M then the account managers at the central office are responsible for the account. Businesses have contact and statement addresses as well as a number of approved individuals who can use the company account, and contact details for them. • Businesses are held as parties. • The business turnover is held as a party property. • The category membership based on turnover is held as a segment. • The businesses may have addresses via a history table. • The businesses may have electronic addresses via a history table. Branch and account managers periodically review the banding of accounts by turnover for both individuals and companies and if they are likely to move band in the coming year then they are added to the appropriate (future) category. Note that this is only partially fact based, the rest being based on subjective input from account managers. • There is a need to allow manual input via a warehouse support application for the party segments. At this point only the PARTY, ADDRESS, ELECTRONIC ADDRESS sub-models and associated _HISTORY tables have been used. The bank offers a range of services including current, loan and deposit accounts, credit and debit cards, epos (for business accounts only), foreign exchange, etc. • The product services are held in the product service major entity. • The product services are associated with an account via a history. The bank has a number of channels including branches, a call centre service, a web service and the ability to use ATMs for certain transactions. • The channels are held in the channels major entity. • The ability to use a channel for a specific product service is held in the history that relates the two major entities. This adds the PRODUCT_SERVICE and CHANNEL major entities into the model. The bank offers a range of transaction types including cash, cheque, standing order, direct debit, interest, service charges, etc. • This requires a TRANSACTION_TYPE table that will be added to the transaction table, which has not yet been defined. After the close of business on the last working day of each month the starting and ending balances, the average daily balance and any interest is calculated for each account. • This is stored as an account property (it may be an event). © 2009 Data Management & Warehousing Page 29
  • 30. White Paper - Process Neutral Data Modelling On a daily basis the exposure (i.e. sum of all account balances) is calculated for each customer along with a risk factor that is a number between 0 and 100 that is influenced by a number of factors that are reviewed from time to time by the risk management department. Risk factors might include sudden large deposits or withdrawals, closure of a number of accounts, long-term non-use of an account, etc. that might influence account managers’ decisions. • The exposure is stored as a party property (or event). • The party risk factor is stored as a party property. Everything that is required to describe the transaction table is now available. Every transaction that is made is recorded every day and has three associated dates, the date of the transaction, the date it appeared on the system and the cleared date. • The Transaction Table will have the following columns o Transaction Date o Transaction System Date o Transaction Cleared Date o From Account o To Account o Transaction Type o Amount This would complete the model for the example. There are some interesting features to examine. The first is that all amounts would be positive. This is because for a credit to an account the ‘from account’ would be the sending party and the ‘to account’ would be the customer’s account. For a debit the ‘to account’ would be the recipient and the ‘from account’ would be the customer’s account. This has a number of effects. Firstly it complies with the DRY (Don’t Repeat Yourself) principle and means that extra data is not stored for the transaction. It also means that a collection of account information not related to any current party (e.g. a customer at another bank) is built up. This information is useful in the analysis of fraud, churn, market share, competitive analysis, etc. For a customer analysis data mart the data can be extracted and converted into the positive credit/negative debt arrangement required by the users. The payment of bank changes and interest would also have accounts and this information in a different data mart could be used to look at profitability, exposure, etc. The process has used seven major entities’ sub-models, an additional type table and an occurrence or transaction table. Storing this information should accommodate and absorb almost any change in business process or source system without the need to change the data warehouse model and will allow multiple data marts to be built from a single data warehouse quickly and easily. In effect the type tables act as metadata for how to use and extend the data model rather than defining the business process explicitly in the data model, hence the name process neutral data modelling. It also demonstrates the ability of the data model to support the requirements process. By knowing the major entities and using a storyboard approach similar to the example above, and familiar as an approach to agile developers, it is possible to quickly and easily identify business, data and query requirements. © 2009 Data Management & Warehousing Page 30
  • 31. White Paper - Process Neutral Data Modelling Party Sub Model including: • Individuals History • Organisations History • Organisation Units • Roles Addresses Sub Model Electronic Addresses Sub Model including: including: • Postal Address • Telephone Numbers • Point Location • E-Mail Addresses • Telex History Accounts Sub Model History History History History Channel Sub Model Product Service Sub Model Retail Banking Transactions Transaction Calendar Types Sub Model Figure 18 - The Example Bank Data Model © 2009 Data Management & Warehousing Page 31
  • 32. White Paper - Process Neutral Data Modelling The model above has been almost fully described in detail by this document since the self- similar modelling for all the sub-model components has been described along with the history tables, most of the retail banking transactions and some of the lifetime attributes of the major entities. To complete the model just needs these additional attributes to be added. Two other effects that will influence the creation of data marts from this model can also be seen. Firstly the creation of dimensions will revolve around the de-normalisation of the attributes that are required from each of the major entities into one of the two dimensions associate with account as these have the hierarchies for the customer, account manager, etc associated with them. The second effect is that of the natural star schema. It is clear from this diagram that the fact tables will be based around the ‘Retail Banking Transactions’ table. As has already been stated there are several data marts that can be built from this fact table, probably at different levels of aggregation and with different dimensions. The occurrence or transaction table above is one of perhaps twenty that a large enterprise would require along with approximately thirty _HISTORY tables. This would be combined with around twenty major entity sub models to create an enterprise data warehouse data model. For those readers who have also read and are familiar with the Data Management & 37 Warehousing white paper ‘How Data Works’ that describes natural star schemas in more detail and also a technique called left to right entity diagrams will see a correlation as follows: Level Description 1 _TYPE and _BAND tables, simple small volume reference data. 2 Major Entities, complex low volume data. 3 Some major entities that are dependent on others along with _PROPERTIES and _SEGMENTS tables, less complex but with greater volume. 4 _HISTORY tables and some occurrence or transaction tables. 5 Occurrence or transaction tables. Significant volume but low complexity data. Figure 19 - Volume & Complexity Correlations 37 Available for download from http://www.datamgmt.com/whitepapers © 2009 Data Management & Warehousing Page 32
  • 33. White Paper - Process Neutral Data Modelling Implementation Issues The use of a process neutral data model and a design pattern is meant to ease the design of a system but there will always be exceptions and things that need further explanation in order to fit them into the solution. Much of this section refers to ETL issues that can only be briefly 38 described in this context. The ‘Party’ Special Case The examples throughout this document have used the PARTY table as a major entity but in practice this is one of the more difficult tables to deal with. The first issue is that in many cases name does not have lifetime value, for example when a woman gets 39 married or divorced and changes her name or when a company renames itself. Also Individual names often have multiple parts (title, forename, surname). There is also a requirement to track some form of state identity number. In the United Kingdom an individual has their National Insurance number and in the United States their social security number, other numbers (e.g. passport, ID card, etc are simply stored as properties). Organisations have other numbers (Companies have registration numbers, charities and trusts have different registration numbers, but VAT numbers are properties as they can and do change). Another minor issue is that people have a date of birth and a date of death. This is simply resolved as date of birth is the Individual Start Date and date of death is the Individual End Date however this terminology can sometimes prove controversial. The solution to the PARTY special case depends on the database technology being used. If the database supports the creation of views and the ‘UNION ALL’ SQL 40 operator then the preferred solution is as follows: Create the INDIVIDUALS table as follows: • PARTY_DWK • PARTY_TYPE_DWK • TITLE • FORENAME 41 • CURRENT_SURNAME • PREVIOUS_SURNAME • MAIDEN_SURNAME • DATE_OF_BIRTH • DATE_OF_DEATH • STATE_ID_NUMBER • Other lifetime attributes as required 38 Data Management & Warehousing provide consultancy on ETL design and techniques to ensure that data warehouses can be loaded effectively regardless of the data modelling approach used. 39 Interestingly, in Scotland, which has different regulations from England & Wales, birth marriage and death certificates (also known as vital records) have, since 1855, understood the importance of knowing the birth names of everyone on the certificate. For example on a wedding certificate you will get the groom’s mother’s maiden name and a married woman’s death certificate will also feature the her maiden name. Effectively the birth name has lifetime value and all other names are additional information. http://www.scotlandspeople.gov.uk/content/help/index.aspx?r=554&628 40 Nearly all business intelligence databases support this functionality. 41 CURRENT_ and PREVIOUS_ are reserved prefixes; see Appendix 1 Data Modelling Standards. © 2009 Data Management & Warehousing Page 33