SlideShare ist ein Scribd-Unternehmen logo
1 von 13
Downloaden Sie, um offline zu lesen
Adapting Data Warehouse Architecture to
          Benefit from Agile methodologies
                                  	
  
Badarinath Boyina & Tom Breur
March 2013


INTRODUCTION
Data warehouse (DW) projects are different from other software
development projects in that a data warehouse is a program, not a
project with a fixed set of start and end dates. Each phase of a DW
project can have a start and end date, but this doesn’t hold for the
DW in its entirety. It should therefore be considered an ongoing
process (Kimball et al, 2008). One could argue that a “purely”
departmental DW project could be limited in scope so that it
resembles a more traditional project, but more often than not at
some point in time a need for query and reporting across business
units arises, and the project will turn into a “normal” DW program
again.

The traditional waterfall methodology doesn’t seem to work very
well to control DW projects due to the nature of ever changing
requirements. Analysts firms like Gartner and Standish Group have
reported appallingly bad success statistics for DW projects. This is
why management might favour Agile methodologies for DW projects
to mitigate risks, and ensure better results.

There is uncertainty in any DW project, and that can come in one of
two “flavours”: means and ends uncertainty (Cohn, 2005). The
former refers to uncertainty as to how to deliver requested
functionality. The latter refers to uncertainty as to what exactly
should be delivered. Any software project has to cope with means
uncertainty, but in DW projects we are also faced with ends
uncertainty. Delivering (part of) the requested functionality
invariably leads to new information requests that then trigger calls
for new, unforeseen functionality. Lawrence Corr (Corr & Stagnito,
2011) refers to the accretive nature of business intelligence
requirements. Such is the intrinsic nature of DW projects, and this
reality and accompanying change needs to be built into the
development process.

One of the characteristics of agile software development is
“continuous refactoring.” This requires special attention in a DW
because new iterations of the data model should not invalidate
historical data that were previously loaded based on a prior data
model. Besides refactoring, changing requirements can also trigger
(unforeseen) expansion of scope or alterations of the current


                                  1
version of the data model. As a result, data conversion (ETL) and
data validation (Testing) become necessary activities in subsequent
data warehouse iterations. Successful adaptation of Agile
development methodologies to data warehousing boils down to
minimising these ETL and Testing activities by choosing an
appropriate (efficient) architecture for the DW system.

This paper looks into specific challenges when applying Agile
development methodologies to traditional DW data models (i.e. 3NF
and dimensional). We will elaborate on a way to overcome these
problems to more successfully adapt Agile methodologies to DW
projects.

We will also look into two-tier versus three-tier DW architecture,
and why three-tier architectures are better suited to adapt Agile
development methodologies to data warehousing. We will highlight
how Inmon’s initial suggestion of modelling the central hub in third
normal form (3NF) provides insufficient flexibility to deal with
changes. Instead we will make a case for a hyper normalized hub to
connect source system data with end-user owned data marts.
Currently two popular data modelling paradigms for such a central
hub DW that employ hyper normalization are Data Vault (Linstedt,
2011) and Anchor Modelling (Rönnbäck,
www.anchormodeling.com).


AGILE METHODOLOGY
Agile software development refers to a group of software
development methodologies based on iterative development, where
requirements and solutions evolve through close collaboration
within self-organizing cross-functional teams. The term “Agile” was
coined in 2001 when the Agile Manifesto was formulated (2001).
Many different flavours of agile can be employed like Extreme
Programming, DSDM, Lean Software Development, Feature Driven
Development, or Scrum.

Software methods are considered more agile when they adhere to
the four principles of the Agile Manifesto (2001):
   1. Individuals and interactions over processes and tools
   2. Working software over comprehensive documentation
   3. Customer collaboration over contract negotiation
   4. Responding to change over following a plan

For this paper we are looking predominantly at the fourth point:
responding to change in the DW. How quickly and easily can one
respond to requirement changes? How little effort will this cost?




                                 2
APPLYING AGILE METHODOLOGIES TO BUILD A DW
The “great debate” that raged between Ralph Kimball and Bill
Inmon in the late 90’s concerned whether one should build DW
systems using either 3NF or dimensional modelling. We will look at
adapting Agile methodologies to fulfil business requirement for both
of these DW modelling techniques. First we will look at 3NF
modelling and later dimensional modelling for a case study below to
highlight the pain points. Then we will look into a possible solution
to eliminate those pain points.


CASE STUDY
Let’s consider a case study for a retail business that (initially) wants
to analyse popularity of products. Let’s assume they have been
analysing their popular products in a certain region at some point in
time, and notice that the popularity of those products is degrading
rapidly 6 months down the line. Now the business wonders whether
this is due to a change of supplier for some of those products, and
they want to see evidence for this. As the data speaks for itself, the
requirement is to bring supplier data into the data warehouse.
Later, as business expands, they decide to procure products from
multiple suppliers in order to gain a competitive edge.

Speaking in Agile terminology, the first User Story is to fulfil the
requirement of analysing product popularity by customer region
over a period of time. The Story card might read something along
the lines of: “As a product marketer I want to compare sales across
regions so that I can know where to concentrate my market
activities to increase sales.”

The second User Story is to enable business to analyse their product
popularity by region and suppliers over the same period of time.
The Story card might read something like: “As a product marketer I
want to compare sales across regions and suppliers so that I can
identify a potential reason for sales degrade is due to change of
supplier.”

The third User Story is to make sure business still is able to do their
analysis as in the above two stories even after the business starts
procuring same products from multiple suppliers. The story card
might read something like: “ As a product marketer I want to
compare sales across regions and suppliers so that I can migrate
procurement to more cost effective parties.”




                                   3
Product Sales By Region (3NF and dimensional data models)
Considering the business requirement, one could come up with the
data model as depicted in Fig. 1 for a DW using 3NF and as depicted
in Fig. 2 for a DW using dimensional modelling.

                            Fig. 1: 3NF




                    Fig. 2: Dimensional model




                                 4
Product Sales By Region and Supplier (3NF and dimensional
data models)
Considering the business requirement of being able to analyse their
sales by supplier, one needs to bring the supplier data into the DW.
The expanded data model (including suppliers) for a 3NF DW would
look like Fig. 3 and for a dimensional would look like Fig. 4.

                   Fig. 3: 3NF with Supplier data




This implies all the child tables of Supplier were impacted, and as a
result the following activities need to be considered:

    ETL for all impacted tables has to be changed.
    Data needs to be reloaded, provided the (valid) history is still
     present in source systems.
    Data marts or reports on the basis of this structure are
     impacted.
    Testing needs to be done from start to end deliverables.

Considering the size of the data model here, it may look like it’s
easy to do all those activities, but just imagine the amount of effort
needed in doing above activities for this small change in a DW
system containing Terabytes of data, or where a lot of time has
passed since the change in requirements: it would require
considerable (re)processing.

When it comes to dimensional modelling one could think of solving
this in one of two ways, as the grain of the fact table hasn’t
changed.

One way to solve this is to have a separate supplier dimension
table; this means that not only is there new ETL for the supplier
dimension, but also changed ETL for the fact table. As a result,
testing needs to be done on existing reports and ETLs.

A second way of solving this problem is to attach the supplier
details to the product dimension. As a result it only impacts the
product dimension ETL.


                                  5
We would imagine that this second option would be the preferred
choice for Agile practitioners as it impacts the existing system
(much) less. So the dimensional model would look like depicted in
Fig. 4 to bring supplier data into the model.

           Fig. 4: Dimensional model with Supplier data




This means the only change here is in the product dimension. So
the ETL for the product dimension needs to be modified and all
existing depending objects need to be retested. However, the
impact of change is less painful in comparison to 3NF. The reason
for this being that in a 3NF model, any change in a parent table
cascades down to all of its child (and grandchildren) tables.




                                 6
Product Sales By Region and Supplier with a many to many
relationship between product and supplier (3NF and
dimensional data models)
Considering that the business has expanded and started procuring
products from multiple suppliers, one has to make sure this many
to many relationship between supplier and product is taken care of
in data storage and respective report/dashboard presentation.

This means no change in the data model for a 3NF DW as shown in
Fig. 5. However, the impact on the dimensional model DW is
significant, as the grain of the fact table has now changed. The
dimensional data model would look like Fig. 6.

                            Fig. 5: 3NF




This means there is little impact on the 3NF data model; however
one may have to test the reports/dashboard due to the business
change.




                                 7
Fig. 6: Dimensional model




As the grain changes, one has to rebuild the fact table which
involves changing the ETL for this fact table, reloading the historical
data, handling the impact of these changes on depending reports,
and as a result one has to do end to end testing.


CASE STUDY CONCLUSION
In both 3NF and dimensional modelling there is a challenge to
respond to requirement changes as it impacts the data structure
from a previous iteration and hence affects ETLs and reports.
Consequently, the time line for delivering changes increases
exponentially as the DW expands in scope and time.

Considering the challenges in implementing data warehouses in a
more Agile way using traditional data models, it makes sense to
look for alternatives to 3NF and dimensional models. The objective
here would be to implement the DW in such a way that it can accept


                                   8
changes more gracefully. It reminds us of the Chinese proverb:
“Unless you change direction, you are apt to end up where you are
headed.”


A HYPER NORMALIZED DATA MODELLING ALTERNATIVE:
DATA VAULT
In the search for an alternative approach, we came across Data
Vault data modelling as originally promoted by Dan Linstedt for DW
systems (Linstedt, 2011, or Hultgren, 2012). Data Vault (DV) has
the concept of separating business keys from descriptive attributes.
Business keys are generally static, whereas descriptive attributes
tend to be dynamic. A business key is the information that the
business (not IT) uses to uniquely identify an important entity like
“Customer”, “Sale”, “Shipment”, etc. This suggests keeping the
history of all descriptive attributes at an atomic level to be able to
meet changing demands inexpensively, irrespective of business
requirements. End-user requirements are reporting requests, not
storage directives.

The Data Vault approach to data warehousing is similar to Bill
Inmon’s approach of a three-tiered architecture (3NF), but (vastly)
different in the way source system facts get stored in the central
hub. Among others, the choice to connect Hub tables via many-to-
many Link tables provides an “insulation” to altering requirements
(data model changes/expansions) that prevents change from
cascading from parent to child tables.

In Data Vault modelling you decompose 3NF tables into a set of
flexible component parts. Business keys are stored independently
from descriptive context and history, and independent from
relations. This practice is called Unified Decomposition (Hultgren,
2012). The business key joins tables that are grouped together.
Generically, the practice of hyper normalizing by breaking out
business keys from descriptive attributes (and history), and from
relations, is called Ensemble Modelling.

Hultgren (2012): “… data vault modeling will result in more tables
and more joins. While this is largely mitigated by repeating
patterns, templates, selective joins, and automation, the truth is
that these things must be built and maintained.”

The highly standardized data model components make for
straightforward and maintainable ETL that is amenable to
automation. Such meta data (model) driven development
dramatically improves speed of delivery and code hygiene.




                                  9
If we think about it that is what data warehouse systems are for: to
record every single event that happens in source systems. Also, this
approach eliminates in part the need for doing preliminary business
requirement gathering to find out for which attributes history is to
be tracked. By “simply” tracking history for every descriptive
attribute, you wind up with a smaller number of highly standardized
loading patterns and therefore ETL that is easier to maintain. Later,
reports can be developed with or without reference to historical
changes (Type I or II Slowly Changing Dimensions, SCD’s).

If one has to build the DW using a DV data model, it would look like
Fig. 7 for User Story 1, like Fig. 8 for User Story 2 and like Fig. 9 for
User Story 3 that is identical to the data model from User Story 2.

DV Model for User Story 1:

                   Fig. 7 DV model for User Story 1




                                   10
DV model for User Story 2:

                Fig. 8: DV model with Supplier data




As you can see in Fig. 8, the changes are not impacting the existing
structure. Rather, adding Supplier data extends the existing
structure. This example also shows why this architecture provides
superior support to incremental development in tiny iterations
(without triggering prohibitive overhead).

As a result the changes remain local and hence the additional ETL
and testing effort merely grows linearly over time. This
demonstrates that the response to changes is more graceful than
would have been experienced in traditional DW data models. The
impact on the existing data model as we progress form User Story 1
to User Story 2 is negligible.




                                 11
DV model for User Story 3:

                Fig. 9: DV model for User Story 3




The impact of business change as we progress from User Story 2 to
User Story 3 on the existing DV data model is non-existent, the
models are identical.




                               12
CONCLUSION
Agile methodologies can indeed be adapted to DW systems.
However, as we have demonstrated, traditional modeling paradigms
(3NF, dimensional) are unfavourably impacted by “late arriving
requirements”, changes in business needs that need to be
incorporated in an existing data model. It therefore becomes
imperative to model the data differently in order to be able to truly
embrace change, and not make the cost of change grow as the Bi
solution expands.

An Agile credo is to deliver value early and continuously. As we
have shown, traditional DW architectures may be able to deliver
early, but their design does not hold up so well against changes
over time. That is why the “continuous” delivery will tend to slow
down as the solution grows. To overcome this problem, we
recommend a revival of three-tiered DW architectures, provided you
model the central hub in some hyper normalized fashion. This is
reflected in a design approach that is resilient to change and that
can grow infinitely at (approximately) linearly rising cost.

The case study we presented clearly shows that one can adapt Agile
methodologies to DW systems successfully. The provision is you
have to design architectures differently to avoid accruing technical
debt so that one can truly embrace change in DW systems.


REFERENCES

Agile Manifesto (2001) www.agilemanifesto.org

Ralph Kimball, Margy Ross, Warren Thornthwaite, Joy Mundy & Bob
Becker (2008) The Data Warehouse Lifecycle Toolkit, 2nd Ed. ISBN#
0470149779

Mike Cohn (2005) Agile Estimating and Planning. ISBN#
0131479415

Lawrence Corr & Jim Stagnitto (2011) Agile Data Warehouse
Design: Collaborative Dimensional Modeling, from Whiteboard to
Star Schema. ISBN# 0956817203

Dan Linstedt (2011) Super Charge Your Data Warehouse:
Invaluable Data Modeling Rules to Implement Your Data Vault.
ISBN# 1463778686

Hans Hultgren (2012) Modeling the Agile Data Warehouse with Data
Vault. ISBN# 9780615723082



                                 13

Weitere ähnliche Inhalte

Was ist angesagt?

whitepaper_advanced_analytics_with_tableau_eng
whitepaper_advanced_analytics_with_tableau_engwhitepaper_advanced_analytics_with_tableau_eng
whitepaper_advanced_analytics_with_tableau_eng
S. Hanau
 
An ontological approach to handle multidimensional schema evolution for data ...
An ontological approach to handle multidimensional schema evolution for data ...An ontological approach to handle multidimensional schema evolution for data ...
An ontological approach to handle multidimensional schema evolution for data ...
ijdms
 

Was ist angesagt? (19)

2 data warehouse life cycle golfarelli
2 data warehouse life cycle golfarelli2 data warehouse life cycle golfarelli
2 data warehouse life cycle golfarelli
 
spss Help
spss Helpspss Help
spss Help
 
Advanced Dimensional Modelling
Advanced Dimensional ModellingAdvanced Dimensional Modelling
Advanced Dimensional Modelling
 
BI Architecture in support of data quality
BI Architecture in support of data qualityBI Architecture in support of data quality
BI Architecture in support of data quality
 
Dimensional data modeling
Dimensional data modelingDimensional data modeling
Dimensional data modeling
 
Data warehouse
Data warehouseData warehouse
Data warehouse
 
Migrating existing projects to Rational solutions
Migrating existing projects to Rational solutionsMigrating existing projects to Rational solutions
Migrating existing projects to Rational solutions
 
Dimensional Modeling
Dimensional ModelingDimensional Modeling
Dimensional Modeling
 
whitepaper_advanced_analytics_with_tableau_eng
whitepaper_advanced_analytics_with_tableau_engwhitepaper_advanced_analytics_with_tableau_eng
whitepaper_advanced_analytics_with_tableau_eng
 
Lecture 03 - The Data Warehouse and Design
Lecture 03 - The Data Warehouse and Design Lecture 03 - The Data Warehouse and Design
Lecture 03 - The Data Warehouse and Design
 
BPM-X Pattern-based model transformations (v2)
BPM-X Pattern-based model transformations (v2)BPM-X Pattern-based model transformations (v2)
BPM-X Pattern-based model transformations (v2)
 
Microsoft Office Performance Point
Microsoft Office Performance PointMicrosoft Office Performance Point
Microsoft Office Performance Point
 
Dataflux Training syllabus Dataflux management studio training syllabus ,Dat...
Dataflux Training  syllabus Dataflux management studio training syllabus ,Dat...Dataflux Training  syllabus Dataflux management studio training syllabus ,Dat...
Dataflux Training syllabus Dataflux management studio training syllabus ,Dat...
 
An ontological approach to handle multidimensional schema evolution for data ...
An ontological approach to handle multidimensional schema evolution for data ...An ontological approach to handle multidimensional schema evolution for data ...
An ontological approach to handle multidimensional schema evolution for data ...
 
Lecture 02 - The Data Warehouse Environment
Lecture 02 - The Data Warehouse Environment Lecture 02 - The Data Warehouse Environment
Lecture 02 - The Data Warehouse Environment
 
Data warehouse logical design
Data warehouse logical designData warehouse logical design
Data warehouse logical design
 
D01 etl
D01 etlD01 etl
D01 etl
 
Bi ar
Bi arBi ar
Bi ar
 
Data warehousing
Data warehousingData warehousing
Data warehousing
 

Ähnlich wie Adapting data warehouse architecture to benefit from agile methodologies

Data quality and bi
Data quality and biData quality and bi
Data quality and bi
jeffd00
 
Bloor Research Comparative costs and uses for data integration platforms
Bloor Research Comparative costs and uses for data integration platformsBloor Research Comparative costs and uses for data integration platforms
Bloor Research Comparative costs and uses for data integration platforms
Fiona Lew
 
Chapter 3 • Nature of Data, Statistical Modeling, and Visuali.docx
Chapter 3 • Nature of Data, Statistical Modeling, and Visuali.docxChapter 3 • Nature of Data, Statistical Modeling, and Visuali.docx
Chapter 3 • Nature of Data, Statistical Modeling, and Visuali.docx
poulterbarbara
 
Designing the business process dimensional model
Designing the business process dimensional modelDesigning the business process dimensional model
Designing the business process dimensional model
Gersiton Pila Challco
 
54 C o m m u n i C at i o n s o F t h e a C m j u.docx
54    C o m m u n i C at i o n s  o F  t h e  a C m       j u.docx54    C o m m u n i C at i o n s  o F  t h e  a C m       j u.docx
54 C o m m u n i C at i o n s o F t h e a C m j u.docx
evonnehoggarth79783
 
1-SDLC - Development Models – Waterfall, Rapid Application Development, Agile...
1-SDLC - Development Models – Waterfall, Rapid Application Development, Agile...1-SDLC - Development Models – Waterfall, Rapid Application Development, Agile...
1-SDLC - Development Models – Waterfall, Rapid Application Development, Agile...
JOHNLEAK1
 

Ähnlich wie Adapting data warehouse architecture to benefit from agile methodologies (20)

Adapting data warehouse architecture to benefit from agile methodologies
Adapting data warehouse architecture to benefit from agile methodologiesAdapting data warehouse architecture to benefit from agile methodologies
Adapting data warehouse architecture to benefit from agile methodologies
 
USING FACTORY DESIGN PATTERNS IN MAP REDUCE DESIGN FOR BIG DATA ANALYTICS
USING FACTORY DESIGN PATTERNS IN MAP REDUCE DESIGN FOR BIG DATA ANALYTICSUSING FACTORY DESIGN PATTERNS IN MAP REDUCE DESIGN FOR BIG DATA ANALYTICS
USING FACTORY DESIGN PATTERNS IN MAP REDUCE DESIGN FOR BIG DATA ANALYTICS
 
Accelerating Machine Learning as a Service with Automated Feature Engineering
Accelerating Machine Learning as a Service with Automated Feature EngineeringAccelerating Machine Learning as a Service with Automated Feature Engineering
Accelerating Machine Learning as a Service with Automated Feature Engineering
 
Data quality and bi
Data quality and biData quality and bi
Data quality and bi
 
Bloor Research Comparative costs and uses for data integration platforms
Bloor Research Comparative costs and uses for data integration platformsBloor Research Comparative costs and uses for data integration platforms
Bloor Research Comparative costs and uses for data integration platforms
 
Canadian Experts Discuss Modern Data Stacks and Cloud Computing for 5 Years o...
Canadian Experts Discuss Modern Data Stacks and Cloud Computing for 5 Years o...Canadian Experts Discuss Modern Data Stacks and Cloud Computing for 5 Years o...
Canadian Experts Discuss Modern Data Stacks and Cloud Computing for 5 Years o...
 
Evaluation of Data Auditability, Traceability and Agility leveraging Data Vau...
Evaluation of Data Auditability, Traceability and Agility leveraging Data Vau...Evaluation of Data Auditability, Traceability and Agility leveraging Data Vau...
Evaluation of Data Auditability, Traceability and Agility leveraging Data Vau...
 
Course Outline Ch 2
Course Outline Ch 2Course Outline Ch 2
Course Outline Ch 2
 
Technical Presentation - TimeWIzard
Technical Presentation - TimeWIzardTechnical Presentation - TimeWIzard
Technical Presentation - TimeWIzard
 
Distributed RDBMS: Data Distribution Policy: Part 3 - Changing Your Data Dist...
Distributed RDBMS: Data Distribution Policy: Part 3 - Changing Your Data Dist...Distributed RDBMS: Data Distribution Policy: Part 3 - Changing Your Data Dist...
Distributed RDBMS: Data Distribution Policy: Part 3 - Changing Your Data Dist...
 
Chapter 3 • Nature of Data, Statistical Modeling, and Visuali.docx
Chapter 3 • Nature of Data, Statistical Modeling, and Visuali.docxChapter 3 • Nature of Data, Statistical Modeling, and Visuali.docx
Chapter 3 • Nature of Data, Statistical Modeling, and Visuali.docx
 
Kpi handbook implementation on bizforce one
Kpi handbook implementation on bizforce oneKpi handbook implementation on bizforce one
Kpi handbook implementation on bizforce one
 
Designing the business process dimensional model
Designing the business process dimensional modelDesigning the business process dimensional model
Designing the business process dimensional model
 
Migrating Data Warehouse Solutions from Oracle to non-Oracle Databases
Migrating Data Warehouse Solutions from Oracle to non-Oracle DatabasesMigrating Data Warehouse Solutions from Oracle to non-Oracle Databases
Migrating Data Warehouse Solutions from Oracle to non-Oracle Databases
 
API Integration
API IntegrationAPI Integration
API Integration
 
Performance Analysis of Leading Application Lifecycle Management Systems for...
Performance Analysis of Leading Application Lifecycle  Management Systems for...Performance Analysis of Leading Application Lifecycle  Management Systems for...
Performance Analysis of Leading Application Lifecycle Management Systems for...
 
54 C o m m u n i C at i o n s o F t h e a C m j u.docx
54    C o m m u n i C at i o n s  o F  t h e  a C m       j u.docx54    C o m m u n i C at i o n s  o F  t h e  a C m       j u.docx
54 C o m m u n i C at i o n s o F t h e a C m j u.docx
 
Enterprise application characteristics
Enterprise application characteristicsEnterprise application characteristics
Enterprise application characteristics
 
1-SDLC - Development Models – Waterfall, Rapid Application Development, Agile...
1-SDLC - Development Models – Waterfall, Rapid Application Development, Agile...1-SDLC - Development Models – Waterfall, Rapid Application Development, Agile...
1-SDLC - Development Models – Waterfall, Rapid Application Development, Agile...
 
IRJET- Testing Improvement in Business Intelligence Area
IRJET- Testing Improvement in Business Intelligence AreaIRJET- Testing Improvement in Business Intelligence Area
IRJET- Testing Improvement in Business Intelligence Area
 

Kürzlich hochgeladen

Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Kürzlich hochgeladen (20)

Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 

Adapting data warehouse architecture to benefit from agile methodologies

  • 1. Adapting Data Warehouse Architecture to Benefit from Agile methodologies   Badarinath Boyina & Tom Breur March 2013 INTRODUCTION Data warehouse (DW) projects are different from other software development projects in that a data warehouse is a program, not a project with a fixed set of start and end dates. Each phase of a DW project can have a start and end date, but this doesn’t hold for the DW in its entirety. It should therefore be considered an ongoing process (Kimball et al, 2008). One could argue that a “purely” departmental DW project could be limited in scope so that it resembles a more traditional project, but more often than not at some point in time a need for query and reporting across business units arises, and the project will turn into a “normal” DW program again. The traditional waterfall methodology doesn’t seem to work very well to control DW projects due to the nature of ever changing requirements. Analysts firms like Gartner and Standish Group have reported appallingly bad success statistics for DW projects. This is why management might favour Agile methodologies for DW projects to mitigate risks, and ensure better results. There is uncertainty in any DW project, and that can come in one of two “flavours”: means and ends uncertainty (Cohn, 2005). The former refers to uncertainty as to how to deliver requested functionality. The latter refers to uncertainty as to what exactly should be delivered. Any software project has to cope with means uncertainty, but in DW projects we are also faced with ends uncertainty. Delivering (part of) the requested functionality invariably leads to new information requests that then trigger calls for new, unforeseen functionality. Lawrence Corr (Corr & Stagnito, 2011) refers to the accretive nature of business intelligence requirements. Such is the intrinsic nature of DW projects, and this reality and accompanying change needs to be built into the development process. One of the characteristics of agile software development is “continuous refactoring.” This requires special attention in a DW because new iterations of the data model should not invalidate historical data that were previously loaded based on a prior data model. Besides refactoring, changing requirements can also trigger (unforeseen) expansion of scope or alterations of the current 1
  • 2. version of the data model. As a result, data conversion (ETL) and data validation (Testing) become necessary activities in subsequent data warehouse iterations. Successful adaptation of Agile development methodologies to data warehousing boils down to minimising these ETL and Testing activities by choosing an appropriate (efficient) architecture for the DW system. This paper looks into specific challenges when applying Agile development methodologies to traditional DW data models (i.e. 3NF and dimensional). We will elaborate on a way to overcome these problems to more successfully adapt Agile methodologies to DW projects. We will also look into two-tier versus three-tier DW architecture, and why three-tier architectures are better suited to adapt Agile development methodologies to data warehousing. We will highlight how Inmon’s initial suggestion of modelling the central hub in third normal form (3NF) provides insufficient flexibility to deal with changes. Instead we will make a case for a hyper normalized hub to connect source system data with end-user owned data marts. Currently two popular data modelling paradigms for such a central hub DW that employ hyper normalization are Data Vault (Linstedt, 2011) and Anchor Modelling (Rönnbäck, www.anchormodeling.com). AGILE METHODOLOGY Agile software development refers to a group of software development methodologies based on iterative development, where requirements and solutions evolve through close collaboration within self-organizing cross-functional teams. The term “Agile” was coined in 2001 when the Agile Manifesto was formulated (2001). Many different flavours of agile can be employed like Extreme Programming, DSDM, Lean Software Development, Feature Driven Development, or Scrum. Software methods are considered more agile when they adhere to the four principles of the Agile Manifesto (2001): 1. Individuals and interactions over processes and tools 2. Working software over comprehensive documentation 3. Customer collaboration over contract negotiation 4. Responding to change over following a plan For this paper we are looking predominantly at the fourth point: responding to change in the DW. How quickly and easily can one respond to requirement changes? How little effort will this cost? 2
  • 3. APPLYING AGILE METHODOLOGIES TO BUILD A DW The “great debate” that raged between Ralph Kimball and Bill Inmon in the late 90’s concerned whether one should build DW systems using either 3NF or dimensional modelling. We will look at adapting Agile methodologies to fulfil business requirement for both of these DW modelling techniques. First we will look at 3NF modelling and later dimensional modelling for a case study below to highlight the pain points. Then we will look into a possible solution to eliminate those pain points. CASE STUDY Let’s consider a case study for a retail business that (initially) wants to analyse popularity of products. Let’s assume they have been analysing their popular products in a certain region at some point in time, and notice that the popularity of those products is degrading rapidly 6 months down the line. Now the business wonders whether this is due to a change of supplier for some of those products, and they want to see evidence for this. As the data speaks for itself, the requirement is to bring supplier data into the data warehouse. Later, as business expands, they decide to procure products from multiple suppliers in order to gain a competitive edge. Speaking in Agile terminology, the first User Story is to fulfil the requirement of analysing product popularity by customer region over a period of time. The Story card might read something along the lines of: “As a product marketer I want to compare sales across regions so that I can know where to concentrate my market activities to increase sales.” The second User Story is to enable business to analyse their product popularity by region and suppliers over the same period of time. The Story card might read something like: “As a product marketer I want to compare sales across regions and suppliers so that I can identify a potential reason for sales degrade is due to change of supplier.” The third User Story is to make sure business still is able to do their analysis as in the above two stories even after the business starts procuring same products from multiple suppliers. The story card might read something like: “ As a product marketer I want to compare sales across regions and suppliers so that I can migrate procurement to more cost effective parties.” 3
  • 4. Product Sales By Region (3NF and dimensional data models) Considering the business requirement, one could come up with the data model as depicted in Fig. 1 for a DW using 3NF and as depicted in Fig. 2 for a DW using dimensional modelling. Fig. 1: 3NF Fig. 2: Dimensional model 4
  • 5. Product Sales By Region and Supplier (3NF and dimensional data models) Considering the business requirement of being able to analyse their sales by supplier, one needs to bring the supplier data into the DW. The expanded data model (including suppliers) for a 3NF DW would look like Fig. 3 and for a dimensional would look like Fig. 4. Fig. 3: 3NF with Supplier data This implies all the child tables of Supplier were impacted, and as a result the following activities need to be considered:  ETL for all impacted tables has to be changed.  Data needs to be reloaded, provided the (valid) history is still present in source systems.  Data marts or reports on the basis of this structure are impacted.  Testing needs to be done from start to end deliverables. Considering the size of the data model here, it may look like it’s easy to do all those activities, but just imagine the amount of effort needed in doing above activities for this small change in a DW system containing Terabytes of data, or where a lot of time has passed since the change in requirements: it would require considerable (re)processing. When it comes to dimensional modelling one could think of solving this in one of two ways, as the grain of the fact table hasn’t changed. One way to solve this is to have a separate supplier dimension table; this means that not only is there new ETL for the supplier dimension, but also changed ETL for the fact table. As a result, testing needs to be done on existing reports and ETLs. A second way of solving this problem is to attach the supplier details to the product dimension. As a result it only impacts the product dimension ETL. 5
  • 6. We would imagine that this second option would be the preferred choice for Agile practitioners as it impacts the existing system (much) less. So the dimensional model would look like depicted in Fig. 4 to bring supplier data into the model. Fig. 4: Dimensional model with Supplier data This means the only change here is in the product dimension. So the ETL for the product dimension needs to be modified and all existing depending objects need to be retested. However, the impact of change is less painful in comparison to 3NF. The reason for this being that in a 3NF model, any change in a parent table cascades down to all of its child (and grandchildren) tables. 6
  • 7. Product Sales By Region and Supplier with a many to many relationship between product and supplier (3NF and dimensional data models) Considering that the business has expanded and started procuring products from multiple suppliers, one has to make sure this many to many relationship between supplier and product is taken care of in data storage and respective report/dashboard presentation. This means no change in the data model for a 3NF DW as shown in Fig. 5. However, the impact on the dimensional model DW is significant, as the grain of the fact table has now changed. The dimensional data model would look like Fig. 6. Fig. 5: 3NF This means there is little impact on the 3NF data model; however one may have to test the reports/dashboard due to the business change. 7
  • 8. Fig. 6: Dimensional model As the grain changes, one has to rebuild the fact table which involves changing the ETL for this fact table, reloading the historical data, handling the impact of these changes on depending reports, and as a result one has to do end to end testing. CASE STUDY CONCLUSION In both 3NF and dimensional modelling there is a challenge to respond to requirement changes as it impacts the data structure from a previous iteration and hence affects ETLs and reports. Consequently, the time line for delivering changes increases exponentially as the DW expands in scope and time. Considering the challenges in implementing data warehouses in a more Agile way using traditional data models, it makes sense to look for alternatives to 3NF and dimensional models. The objective here would be to implement the DW in such a way that it can accept 8
  • 9. changes more gracefully. It reminds us of the Chinese proverb: “Unless you change direction, you are apt to end up where you are headed.” A HYPER NORMALIZED DATA MODELLING ALTERNATIVE: DATA VAULT In the search for an alternative approach, we came across Data Vault data modelling as originally promoted by Dan Linstedt for DW systems (Linstedt, 2011, or Hultgren, 2012). Data Vault (DV) has the concept of separating business keys from descriptive attributes. Business keys are generally static, whereas descriptive attributes tend to be dynamic. A business key is the information that the business (not IT) uses to uniquely identify an important entity like “Customer”, “Sale”, “Shipment”, etc. This suggests keeping the history of all descriptive attributes at an atomic level to be able to meet changing demands inexpensively, irrespective of business requirements. End-user requirements are reporting requests, not storage directives. The Data Vault approach to data warehousing is similar to Bill Inmon’s approach of a three-tiered architecture (3NF), but (vastly) different in the way source system facts get stored in the central hub. Among others, the choice to connect Hub tables via many-to- many Link tables provides an “insulation” to altering requirements (data model changes/expansions) that prevents change from cascading from parent to child tables. In Data Vault modelling you decompose 3NF tables into a set of flexible component parts. Business keys are stored independently from descriptive context and history, and independent from relations. This practice is called Unified Decomposition (Hultgren, 2012). The business key joins tables that are grouped together. Generically, the practice of hyper normalizing by breaking out business keys from descriptive attributes (and history), and from relations, is called Ensemble Modelling. Hultgren (2012): “… data vault modeling will result in more tables and more joins. While this is largely mitigated by repeating patterns, templates, selective joins, and automation, the truth is that these things must be built and maintained.” The highly standardized data model components make for straightforward and maintainable ETL that is amenable to automation. Such meta data (model) driven development dramatically improves speed of delivery and code hygiene. 9
  • 10. If we think about it that is what data warehouse systems are for: to record every single event that happens in source systems. Also, this approach eliminates in part the need for doing preliminary business requirement gathering to find out for which attributes history is to be tracked. By “simply” tracking history for every descriptive attribute, you wind up with a smaller number of highly standardized loading patterns and therefore ETL that is easier to maintain. Later, reports can be developed with or without reference to historical changes (Type I or II Slowly Changing Dimensions, SCD’s). If one has to build the DW using a DV data model, it would look like Fig. 7 for User Story 1, like Fig. 8 for User Story 2 and like Fig. 9 for User Story 3 that is identical to the data model from User Story 2. DV Model for User Story 1: Fig. 7 DV model for User Story 1 10
  • 11. DV model for User Story 2: Fig. 8: DV model with Supplier data As you can see in Fig. 8, the changes are not impacting the existing structure. Rather, adding Supplier data extends the existing structure. This example also shows why this architecture provides superior support to incremental development in tiny iterations (without triggering prohibitive overhead). As a result the changes remain local and hence the additional ETL and testing effort merely grows linearly over time. This demonstrates that the response to changes is more graceful than would have been experienced in traditional DW data models. The impact on the existing data model as we progress form User Story 1 to User Story 2 is negligible. 11
  • 12. DV model for User Story 3: Fig. 9: DV model for User Story 3 The impact of business change as we progress from User Story 2 to User Story 3 on the existing DV data model is non-existent, the models are identical. 12
  • 13. CONCLUSION Agile methodologies can indeed be adapted to DW systems. However, as we have demonstrated, traditional modeling paradigms (3NF, dimensional) are unfavourably impacted by “late arriving requirements”, changes in business needs that need to be incorporated in an existing data model. It therefore becomes imperative to model the data differently in order to be able to truly embrace change, and not make the cost of change grow as the Bi solution expands. An Agile credo is to deliver value early and continuously. As we have shown, traditional DW architectures may be able to deliver early, but their design does not hold up so well against changes over time. That is why the “continuous” delivery will tend to slow down as the solution grows. To overcome this problem, we recommend a revival of three-tiered DW architectures, provided you model the central hub in some hyper normalized fashion. This is reflected in a design approach that is resilient to change and that can grow infinitely at (approximately) linearly rising cost. The case study we presented clearly shows that one can adapt Agile methodologies to DW systems successfully. The provision is you have to design architectures differently to avoid accruing technical debt so that one can truly embrace change in DW systems. REFERENCES Agile Manifesto (2001) www.agilemanifesto.org Ralph Kimball, Margy Ross, Warren Thornthwaite, Joy Mundy & Bob Becker (2008) The Data Warehouse Lifecycle Toolkit, 2nd Ed. ISBN# 0470149779 Mike Cohn (2005) Agile Estimating and Planning. ISBN# 0131479415 Lawrence Corr & Jim Stagnitto (2011) Agile Data Warehouse Design: Collaborative Dimensional Modeling, from Whiteboard to Star Schema. ISBN# 0956817203 Dan Linstedt (2011) Super Charge Your Data Warehouse: Invaluable Data Modeling Rules to Implement Your Data Vault. ISBN# 1463778686 Hans Hultgren (2012) Modeling the Agile Data Warehouse with Data Vault. ISBN# 9780615723082 13