DAMA, Oregon Chapter, 2012 presentation - an introduction to Data Vault modeling. I will be covering parts of the methodology, comparison and contrast of issues in general for the EDW space. Followed by a brief technical introduction of the Data Vault modeling method.
After the presentation i I will be providing a demonstration of the ETL loading layers, LIVE!
You can find more on-line training at: http://LearnDataVault.com/training
3. The Experts Say…
“The Data Vault is the optimal choice
for modeling the EDW in the DW 2.0
framework.” Bill Inmon
“The Data Vault is foundationally
strong and exceptionally scalable
architecture.” Stephen Brobst
“The Data Vault is a technique which some
industry experts have predicted may spark a
revolution as the next big thing in data modeling
for enterprise warehousing....” Doug Laney
3
4. More Notables…
“This enables organizations to take control of
their data warehousing destiny, supporting
better and more relevant data warehouses in
less time than before.” Howard Dresner
“[The Data Vault] captures a practical body of
knowledge for data warehouse development
which both agile and traditional practitioners
will benefit from..” Scott Ambler
4
5. Agenda
• Introduce Yourselves…
• What is a Data Vault? Where does it come from?
• Pros & Cons of Data Modeling for EDW
• Current EDW Issues & Pains
• Consequences of Implementing the Pains…
• How do we “Fix” This?
• Keys to Success
• When “NOT” to use a Data Vault
• Ontologies and Data Vault
• A Working Example
• Query Performance (PIT & Bridge)
• Conclusion (break)
• Live Demo
5
6. Introduce Yourselves
• Your Expectations?
• Your Questions?
• Your Background?
• Areas of Interest?
• What are the top 3 pains your
EDW/BI solution is experiencing?
• About Me…
o http://www.LinkedIn.com/dlinstedt
• Learn More Data Vault on-line at:
o http://LearnDataVault.com/training
6
7. Where did it come from?
What is it?
Defining the Data Vault Space
7
8. Data Warehousing Time Line
The Data Vault Model & Methodology
took 10 years of R&D to become
consistent, flexible, and scalable.
8
9. What IS a Data Vault? (Business Definition)
• Data Vault Model • Data Vault Methodology
o Detail oriented – CMMI, Project Plan
o Historical traceability – Risk, Governance, Versioning
o Uniquely linked set of – Peer Reviews, Release Cycles
normalized tables – Repeatable, Consistent, Optimized
o Supports one or more – Complete with Best Practices for
functional areas of business BI/DW
• Data Vault Architecture
– 3 Tier Architecture (for including
Batch & Unstructured Data)
– 2 Tier Architecture (for Real-Time
only)
9
10. The Data Vault Model
Records a history
Customer of the interaction Product
Sat Sat
Elements: Sat
•Hub
•Link Sat Customer Link Product Sat
•Satellite F(x)
Sat F(x) F(x) Sat
Sat
Hub = List of Unique Business Keys Order Sat
Link = List of Relationships, Associations
Satellites = Descriptive Data
F(x) Sat
Order
10
11. Data Vault Methodology
Follows: SEI/CMMI Level 5, PMP, Six Sigma, TQM, and Agile elements
Optimized business
5 processes, repeatable, scalable, fault-tolerant.
Automatable (generatable)
Metrics, Estimates vs Actuals, Function Point
4 Analysis, Identification of broken processes
Defined Business Processes, Defined
3 Goals, Defined Objectives
Risk assessments / analysis, managed
2 processes, basic alignment efforts
Process unpredictable and
1 poorly controlled
11
12. Data Vault Architecture
SOA Enterprise BI Solution
Star
Sales Schemas
(batch) (real-time)
Finance
Staging (batch) EDW
(Data Vault) Error
Marts
Contracts
Unstructured Complex
Report
Data Business Collections
(Hadoop NoSQL) Rules
FUNDAMENTAL GOALS
•Repeatable •Scalable The business rules are moved closer to the business,
•Consistent •Auditable improving IT reaction time, reducing cost and minimizing
•Fault-tolerant impacts to the enterprise data warehouse (EDW)
•Supports phased release
12
13. Star Schemas, 3NF,
Data Vault:
Pros & Cons
Defining the Data Vault Space
Why NOT use Star Schemas as an EDW?
Why NOT use 3NF as an EDW?
Why NOT use Data Vault as a Data Delivery Model?
13
14. Star Schema Pros/Cons as an EDW
PROS CONS
• Good for multi-dimensional • Not cross-business functional
analysis • Use of junk / helper tables
• Subject oriented answers • Trouble with VLDW
• Excellent for aggregation points • Unable to provide integrated
enterprise information
• Rapid development /
• Can’t handle ODS or
deployment exploration warehouse
• Great for some historical storage requirements
• Trouble with data explosion in
near-real-time environments
• Trouble with updates to type 2
dimension primary keys
• Trouble with late arriving data
in dimensions to support real-
time arriving transactions
• Not granular enough
information to support real-
time data integration 14
15. 3nf Pros/Cons as an EDW
PROS CONS
• Many to many linkages • Time driven PK issues
• Handle lots of information • Parent-child complexities
• Tightly integrated information • Cascading change impacts
• Highly structured • Difficult to load
• Conducive to near-real time • Not conducive to BI tools
loads • Not conducive to drill-down
• Relatively easy to extend • Difficult to architect for an
enterprise
• Not conducive to spiral/scope
controlled implementation
• Physical design usually doesn’t
follow business processes
15
16. Data Vault Pros/Cons as an EDW
PROS CONS
• Supports near-real time and • Not conducive to OLAP
batch feeds
processing
• Supports functional business
linking • Requires business analysis
• Extensible / flexible to be firm
• Provides rapid build / delivery of • Introduces many join
star schema’s operations
• Supports VLDB / VLDW
• Designed for EDW
• Supports data mining and AI
• Provides granular detail
• Incrementally built
16
17. Analogy: The Porsche, the SUV and the Big Rig
• Which would you use to win a race?
• Which would you use to move a house?
• Would you adapt the truck and enter a race with Porches and expect to
win?
17
18. Current EDW Issues and
Pains
Business Rule Processing, Lack of Agility, and
Future proofing your new solution
18
19. Current EDW Project Issues
This is NOT what
you want happening
to your project!
THE GAP!! 19
20. 2 Tier EDW Architecture
Enterprise BI Solution
Sales
(batch)
Staging Complex Star
Finance (EDW) Business Schemas
Rules #2
Conformed Dimensions
Junk Tables
Contracts Complex Staging + History Helper Tables
Business Factless Facts
Rules
+Dependencies
•Quality routines •High risk of incorrect data aggregation
•Cross-system dependencies •Larger system = increased impact
•Source data filtering •Often re-engineered at the SOURCE
•In-process data manipulation •History can be destroyed (completely re-computed)
20
21. #1 Cause of BI Initiative Failure
Let’s take a look at one example…
21
22. Re-Engineering
Business
Rules
Data Flow (Mapping)
Current Sources
Sales
Customer
Source
Join
Finance
Customer
Transactions
Customer
Purchases
** NEW SYSTEM**
22
23. Federated Star Schema Inhibiting Agility
Data Mart 3
High
Data Mart 2
Effort
& Cost
Data Mart 1
Changing and Adjusting conformed dimensions causes an
exponential rise in the cost curve over time
Low RESULT: Business builds their own Data Marts!
Maintenance
Start Time
Cycle Begins
The main driver for this is the maintenance costs, and re-engineering of the existing
system which occurs for each new “federated/conformed” effort. This increases
delivery time, difficulty, and maintenance costs.
23
24. What are the ROOT Causes?
The root causes of RE-ENGINEERING are:
24
26. Deformed Dimensions
• Deformity: The URGE to continue “slamming data” into an existing conformed
dimension until it simply cannot sustain any further changes, the result: a
deformed dimension and a HUGE re-engineering cost / nightmare.
Business Wants a Change!
Business said: Just add that to the existing Dimension, it will be easy right?
Business Change
V1 Business Change
…………………
Business Change
Complex
…………………
…………………
V2 V3
………………… ………………
Load …………………
…………………
………………
………………
…………………
…………………
………………… ……………… …………………
………………… ……………… …………………
……………… …………………
Complex ………………
………………
…………………
…………………
…………………
Load ………………
………………
………………
…………………
…………………
90 days, $125k ………………
……………… Complex
…………………
…………………
……………… …………………
…………………
Load …………………
…………………
…………………
…………………
120 days, $200k …………………
…………………
…………………
…………………
Re-Engineering the …………………
Load Processes EACH TIME!
180 days, $275k
26
27. Dimension-itis
• DimensionItis: Incurable Disease, the symptoms are the creation of new
dimensions because the cost and time to conform existing dimensions
with new attributes rises beyond the business ability to pay…
…………………...
…………………...
…………………...
…………………...
…………………...
…………………...…………………... …………………...
…………………...…………………... …………………...
…………………... …………………...
Business Says:
…………………... …………………...
…………………... …………………...
…………………...
…………………... …………………... …………………...
…………………... …………………...
Avoid the re-engineering …………………...
…………………...
…………………...
…………………...
…………………...
…………………...
…………………...
…………………...
costs, just “copy” the …………………...
…………………...
…………………...
…………………...
…………………...
…………………... …………………...
dimensions and create a new …………………...
…………………...
…………………...
…………………...
…………………...
…………………...
…………………...
…………………...
…………………...
…………………...
…………………...
…………………...
one for
…………………... …………………... …………………...
…………………... …………………...
…………………...
…………………... …………………... …………………...
…………………... …………………... …………………...
…………………... …………………... …………………...
OUR department… …………………...
…………………...
…………………...
…………………...
…………………...
…………………...
…………………...
…………………...
…………………...
…………………...
…………………...
…………………...
…………………...
…………………... …………………...
…………………... …………………...
…………………...
…………………...
…………………... …………………... …………………...
…………………... …………………...
…………………... …………………...
…………………... …………………... …………………... …………………...
…………………...
…………………... …………………...
…………………... …………………...
…………………... …………………...
…………………...
…………………... …………………... …………………...
…………………... …………………...
What happens
…………………... …………………...
…………………... …………………...
…………………... …………………...
…………………... …………………...
…………………... …………………...
…………………... …………………...
…………………...
…………………...
when we (IT) give
…………………...
…………………... …………………...
…………………...
…………………... …………………...
…………………...
…………………... …………………...
…………………...
…………………... …………………...
…………………... …………………...
…………………... …………………...
…………………...
…………………... …………………...
in to this? …
…………………... …………………...
…………………... …………………...
…………………...
…………………...
…………………... …………………...
…………………...
…………………...
…………………... …………………...
…………………...
…………………...
…………………... …………………...
…………………...
…………………...
…………………...
…………………...
…………………...
…………………...
…………………...
27
28. Result: Silo Data Junkyards!
• Business Says: Take the dimension you have, copy it, and change
it… This should be cheap, and easy right?
Business Change
180
To Modify Existing Star = days, $275k SALES
We built our own
because IT costs too
much…
First Star
Customer_ID Customer_ID
FINANCE
Customer_Name Customer_Name
Customer_Addr Customer_Addr
Customer_Addr1 Customer_Addr1
Customer_City Customer_City
Customer_State Customer_State
Customer_Zip Customer_Zip
Customer_Phone Customer_Phone
Customer_Tag Customer_Tag
Customer_Score
Customer_Region
Customer_Stats
Customer_Score
Customer_Region
Customer_Stats
We built our own
Customer_Phone Customer_ID Customer_Phone
Customer_Type Customer_Name
Customer_Addr
Customer_Type
because IT took too
Customer_Addr1
Customer_City long…
Customer_State
Customer_Zip
Customer_Phone
Fact_ABC
Fact_DEF MARKETING
Customer_ID Fact_PDQ
Customer_ID
Customer_Name Fact_MYFACT Customer_Name
Customer_Addr
Customer_Addr1
Customer_City
Customer_State
Customer_Addr
Customer_Addr1
Customer_City
We built our own
Customer_State
Customer_Zip
Customer_Phone
Customer_Tag
Customer_Zip
Customer_Phone
Customer_Tag
because we needed
Customer_Score
Customer_Region
Customer_Stats
Customer_Phone
Customer_Score
Customer_Region
Customer_Stats
customized
Customer_Phone
Customer_Type
Customer_Type
dimension data…
28
29. Accountability In Question?
Corporate Fraud Accountability Title XI consists of seven sections. Section 1101
recommends a name for this title as “Corporate Fraud Accountability Act of 2002”. It
identifies corporate fraud and records tampering as criminal offenses and joins
those offenses to specific penalties. It also revises sentencing guidelines and
strengthens their penalties. This enables the SEC to temporarily freeze large or
unusual payments.
Source HR Mart
1
Business
Source
Rules Sales Mart
Change Staging
2
Data!
Source Finance Mart
3
Are changes to data ON THE WAY IN to the EDW
equivalent to records tampering?
29
30. How do we “fix” this?
Answer: Move the business rules downstream, AND no-longer
be forced to conform dimensions.
30
32. Move the Business Rules Downstream
• No “Conforming” of Dimensions on the way in to the EDW
• Hold on… We do distinguish between HARD and SOFT business
rules…
32
33. Hard & Soft Business Rules
Hard Business Rules Soft Business Rules
• Data Domain Alignment • Any requirement the
(Data Type Matching) business user
• Normalization (where states, that, when
necessary) applied, CHANGES the data
• System Column or CHANGES the meaning
Computation of the data (the grain or
interpretation)
• Simple example that will
knock the socks off your
feet!
33
34. Progressive Agility and Responsiveness of IT
High
Effort
& Cost
Foundational Base Built
New Functional Areas Added
Initial DV Build Out
Low
Maintenance
Start Time
Cycle Begins
Re-Engineering does NOT occur with a Data Vault Model.
This keeps costs down, and maintenance easy. It also reduces
complexity of the existing architecture.
34
35. NO Re-Engineering
Current Sources
Data Vault
Sales
Stage
Customer Copy Hub
Customer
Finance
Stage
Customer Link
Transactions Copy
Transaction
Customer Stage Hub Hub
Purchases Acct Product NO IMPACT!!!
Copy
NO RE-ENGINEERING!
** NEW SYSTEM**
35
37. Key: Flexibility
Adding new components to the EDW has NEAR ZERO impact to:
• Existing Loading Processes
• Existing Data Model
• Existing Reporting & BI Functions
• Existing Source Systems
• Existing Star Schemas and Data Marts
37
38. Case In Point:
Result of flexibility of the Data Vault Model
allowed them to merge 3 companies in 90 days –
that is ALL systems, ALL DATA!
38
39. Key: Scalability in Architecture
Scaling is easy, its based on the following principles
• Hub and spoke design
• MPP Shared-Nothing Architecture
• Scale Free Networks
39
40. Case In Point:
Result of scalability was to produce a Data
Vault model that scaled to 3 Petabytes in
size, and is still growing today!
40
41. Key: Scalability in Team Size
You should be able to SCALE your TEAM as well!
With the Data Vault methodology, you can:
Scale your team when desired, at different points in the project!
41
42. Case In Point:
(Dutch Tax Authority)
Result of scalability was to increase ETL developers for
each new source system, and reassign them when the
system was completely loaded to the Data Vault
42
43. Key: Productivity
Increasing Productivity requires a reduction in complexity.
The Data Vault Model simplifies all of the following:
• ETL Loading Routines
• Real-Time Ingestion of Data
• Data Modeling for the EDW
• Enhancing and Adapting for Change to the Model
• Ease of Monitoring, managing and optimizing processes
43
44. Case in Point:
Result of Productivity was: 2 people in 2 weeks
merged 3 systems, built a full Data Vault EDW, 5
star schemas and 3 reports.
These individuals generated:
• 90% of the ETL code for moving the data set
• 100% of the Staging Data Model
• 75% of the finished EDW data Model
• 75% of the star schema data model
44
45. The Competing Bid?
The competition bid this with 15 people
and 3 months to completion, at a cost of
$250k! (they bid a Very complex system)
Our total cost? $30k and 2 weeks!
45
48. When NOT
to use the Data Vault
A review of some reasons why not to use a Data Vault Model
48
49. When NOT to Use the Data Vault
• You have:
o a small set of point solution requirements
o a very short time-frame for delivery
o To use the data one-time, then throw it away
o a single source system, single source application
o A single business analyst in the entire company
• You do NOT have:
o audit requirements forcing you to keep history
o multiple data center consolidation efforts
o near-real-time to worry about
o massive batch data to integrate
o External data feeds outside your control
o Requirements to do trend analysis of all your data
o Pain – that forces you to reengineer every time you ask for a
change to your current data warehousing systems
49
51. Business Keys = Ontology
Firm Name Business Keys should be
arranged in an ontology
Drug Listing
In order to learn the
Product Number dependencies of the data set
Dose Form Code
NDA Application # NOTE: Different Ontologies
represent different business views of
Drug Label Code the data!
Patent Number
Patent Use Code
51
52. Associations = Ontological Hooks
Firm Name
Firms Generate
Drug Listing
Product Listings
Firms Manufacture Product Number
Products
Listings for Products are
NDA Application #
in NDA Applications
Business Keys are associated by many
linking factors, these links comprise the
associations in the hierarchy.
52
53. Descriptors = Context
Firm
Firm Name
Locations
Firms Generate Listing
Drug Listing
Product Listings Formulation
Firms Manufacture Product Number
Products
Product
Start & End of Ingredients
manufacturing
Descriptors provide the context at a
specific point in time – they are the
warehousing portion of the Data Vault
53
54. A working Example
National Drug Codes + Orange Book of Drug Patent
Applications
http://www.accessdata.fda.gov/scripts/cder/ndc/default.cfm
http://www.fda.gov/Drugs/InformationOnDrugs/ucm129662.htm
54
55. Hub Table Structures
SQN = Sequence (insertion order)
LDTS = Load Date (when the Warehouse first sees the data)
RSRC = Record Source (System + App where the data ORIGINATED)
55
57. Satellite Table Structures
SQN = Sequence (parent identity number)
LDTS = Load Date (when the Warehouse first sees the data)
LEDTS = End of lifecycle for superseded record
RSRC = Record Source (System + App where the data ORIGINATED)
57
58. In Review…
• Data Vault is…
o A Data Warehouse Model & Methodology
o Hub and Spoke Design
o Simple, Easy, Repeatable Structures
o Comprised of Standards, Rules & Procedures
o Made up of Ontological Metadata
o AUTOMATABLE!!!
• Hubs = Business Keys
• Links = Associations / Transactions
• Satellites = Descriptors
58
60. History Teaches Us…
Portfolio
The EDW is designed to handle TODAY’S
1
Today: relationship, as soon as history is loaded, it
M
breaks the model!
Customer
Hub Portfolio
1
Portfolio
5 years M
From now
M M
Customer
Hub Customer
Portfolio
M
10 Years ago
1
Customer This situation forces re-engineering of the
model, load routines, and queries!
60
61. History Teaches Us…
Portfolio
1
Today:
M Hub Portfolio
Customer 1
M
Portfolio
5 years LNK
M
from now Cust-Port
M
M
Customer
1
Hub Customer
Portfolio
M
10 Years ago This design is flexible, handles
1
past, present, and future relationship changes
Customer with NO RE-ENGINEERING!
61
62. Applying the Data Vault to Global DW
Manufacturing EDW Planning in Brazil
in China
Hub
Hub
Link
Sat Sat Link
Sat Sat
Link
Hub Link Hub Hub
Sat Sat Sat Sat Sat Sat Sat Sat
Base EDW Created in Corporate
Financials in USA
62
64. PIT Table Architecture
Satellite: Point In Time
PARENT SEQUENCE Primary
LOAD DATE Key
{Satellite 1 Load Date}
{Satellite 2 Load Date}
{Satellite 3 Load Date}
{…} PIT Sat
{Satellite N Load Date} Sat 1
Sat 2
Hub
PIT Sat Sat 3 Order
Sat 1
Sat 4
Sat 2 Hub Hub Sat 1
Sat 3 Customer Product Sat 2
Link Line
Sat 4
Item
Satellite
Line Item
64
65. PIT Table Example
SAT_CUST_CONTACT_NAME SAT_CUST_CONTACT_CELL SAT_CUST_CONTACT_ADDR
SQN LOAD_DTS NAME SQN LOAD_DTS CELL SQN LOAD_DTS ADDR
1 10-14-2000 Dan L 1 10-14-2000 999-555-1212 1 08-01-2000 26 Prospect
1 11-01-2000 Dan Linedt 1 10-15-2000 999-111-1234 1 09-29-2000 26 Prosp St.
1 12-31-2000 Dan Linstedt 1 10-16-2000 999-252-2834 1 12-17-2000 28 November
1 10-17-2000 999.257-2837 1 01-01-2001 26 Prospect St
1 10-18-2000 999-273-5555
SQN LOAD_DTS SAT_NAME_LDTS SAT_CELL_LDTS SAT_ADDR_LDTS
1 08-01-2000 NULL NULL 08-01-2000
1 09-01-2000 NULL NULL 08-01-2000
1 10-01-2000 NULL NULL 09-29-2000
1 11-01-2000 11-01-2000 10-18-2000 09-29-2000
1 12-01-2000 11-01-2000 10-18-2000 09-29-2000
1 01-01-2001 12-31-2000 10-18-2000 01-01-2001
Snapshot Date
65
66. BridgeTable Architecture
Satellite: Bridge
Primary
UNIQUE SEQUENCE Key
LOAD DATE
{Hub 1 Sequence #}
{Hub 2 Sequence #}
{Hub 3 Sequence #}
{Link 1 Sequence #}
{Link 2 Sequence #}
{…}
{Link N Sequence #}
{Hub 1 Business Key}
{Hub 2 Business Key}
{…} Bridge
{Hub N Business Key}
Sat 1
Sat 2 Hub Hub
Link Link Hub Parts
Sat 3 Seller Product
Sat 4
Satellite Satellite
66
69. Where To Learn More
• The Technical Modeling Book:
http://LearnDataVault.com/
• On-Line Training direct from me:
http://LearnDataVault.com/training
• The Discussion Forums: & events
http://LinkedIn.com – Data Vault Discussions
• Contact me:
http://DanLinstedt.com - web site
DanLinstedt@gmail.com - email
69
70. LIVE
DEMONSTRATION
Physical Demonstration, Loading Processes and Execution
70
Hinweis der Redaktion
You’re not the first, nor will you be the last one to use it.Some of the worlds biggest companies are implementing Data Vaults.From Diamler Motors to Lockheed Martin, to the Department of Defense.JPMorgan and Chase used the Data Vault model to merge 3 companies in 90 days!
Beginning: 5 advanced ETLBy the 1st month, they 5 advanced, and 15 basic/introBy the 6th month, they 5 advanced, but 50 basicBy the end of the 8th month they went to production with 10 MF sourcesAnd their team size was: 12 people (5 advanced, 7 basic – for support).