This was a presentation I gave to IRM UK conference in November 2009. It covers some interesting details around the steps you should take to build your Data Vault, and an overview as to why re-engineering creeps in to your existing silo solutions.
3. A bit about me… 3 Author, Inventor, Speaker – and part time photographer… 25+ years in the IT industry Worked in DoD, US Gov’t, Fortune 50, and so on… Find out more about the Data Vault: http://www.youtube.com/LearnDataVault http://LearnDataVault.com Full profile on http://www.LinkedIn.com/dlinstedt LearnDataVault.com
9. Complete with Best Practices for BI/DWBusiness Keys Span / Cross Lines of Business Sales Contracts Planning Delivery Finance Operations Procurement Functional Area
14. The PAIN!! Issues in Current EDW Projects 10/6/2011 LearnDataVault.com 7
15. EDW Architecture: Generation 1 10/6/2011 LearnDataVault.com 8 Enterprise BI Solution (batch) Sales Staging (EDW) Star Schemas Complex Business Rules Finance Conformed Dimensions Junk Tables Helper Tables Factless Facts Staging + History Contracts Complex Business Rules +Dependencies
16. Kick-Starting Data Warehousing HR Asks IT to build the FIRST Data Warehouse / Prototype 10/6/2011 LearnDataVault.com 9 1. 2. IT Says… OK: $125k and 90 days… 3. HR Says: Great! Get Started
19. The PAIN is RIGHT HERE!! Contracts Sees Success, wants the same for their systems. 10/6/2011 LearnDataVault.com 12 1. 2. IT Says… Ok, but… It won’t be $125k and 90 days… Because we have to “merge it” with HR” it will be $250 and 180 days. 3. Contracts Says: Ouch! That’s not reasonable, but we need it, so go ahead…
20. And HERE…. 10/6/2011 LearnDataVault.com 13 Finance, Sales, and Marketing want in…. IT Says… Ok, but… It won’t be $250k and 90 days… Because we have to “merge it” with HR and Contracts it will be $350k and 250 days. And this continues…. Business Says... “Can’t you just make-a-copy of the Star Schema, and give me my own for cheaper & less time?
21. Silo Building / IT Non-Agility 10/6/2011 LearnDataVault.com 14 First Star SALES We built our own because IT costs too much FINANCE We built our own because IT took too long MARKETING We built our own because we need customized dimension data Why is this happening? What’s Causing this Problem?
25. Why Re-Engineering? 10/6/2011 LearnDataVault.com 16 Adding fields to a conformed dimension…. Adding fields to a shared fact…. Changing code to match new business rules… Require adding/changing Fields in target tables! Require Re-Engineering!
26. Other Pains? 10/6/2011 LearnDataVault.com 17 Dimension-Itis? IT – Non-Agility? Deformed Dimensions? What about the “data” you don’t see? What about the “BAD” data left in the source systems?
27. The Solution Go the Data Vault Route! 10/6/2011 LearnDataVault.com 18
28. EDW Architecture: Generation 2 10/6/2011 LearnDataVault.com 19 SOA Enterprise BI Solution Star Schemas (real-time) Sales (batch) DV EDW (batch) Staging Error Marts Finance Contracts Report Collections Business Rules Downstream! (the Lens Filter)
36. Start new phase1. Fast Load & Fast Integration 3. IT Implementation of Business Rules
37. What are the Facts Jack? 10/6/2011 LearnDataVault.com 22 Generation 1 EDW’s tried to provide “One version of the truth” Generation 2 (Data Vaults) provide… “One version of the facts, for each point in time.”
38. Business Gap Analysis 10/6/2011 LearnDataVault.com 23 The Way Business Perceives it’s business to be running Gap Analysis Operational Reports Gap Analysis Dynamic Cubes (Data Marts) The way the source systems see the business running.
39.
40.
41. Where’s the Solution? 10/6/2011 LearnDataVault.com 26 Re-Engineering Handle Changes Wherever… Whenever… with EASE!
42. The Three vehicles… Pros and Cons of the Modeling Methodologies 10/6/2011 LearnDataVault.com 27
43. 3rd Normal Form Pros/Cons as an EDW PROS (as 3NF) Many to many linkages Handle lots of information Tightly integrated information Highly structured Conducive to near-real time loads Relatively easy to extend 10/6/2011 LearnDataVault.com 28 CONS (as EDW) Time driven PK issues Parent-child complexities Cascading change impacts Difficult to load Not conducive to BI tools Not conducive to drill-down Difficult to architect for an enterprise Not conducive to spiral/scope controlled implementation Physical design usually doesn’t follow business processes
44. Star Schema Pros/Cons as an EDW PROS (as Data Mart) Good for multi-dimensional analysis Subject oriented answers Excellent for aggregation points Rapid development / deployment Great for some historical storage 10/6/2011 LearnDataVault.com 29 CONS (as EDW) Not cross-business functional Use of junk / helper tables Trouble with VLDW Unable to provide integrated enterprise information Can’t handle ODS or exploration warehouse requirements Trouble with data explosion in near-real-time environments Trouble with updates to type 2 dimension primary keys Trouble with late arriving data in dimensions to support real-time arriving transactions Not granular enough information to support real-time data integration
45. Data Vault Pros/Cons as an EDW PROS (as EDW) Supports near-real time and batch feeds Supports functional business linking Extensible / flexible Provides rapid build / delivery of star schema’s Supports VLDB / VLDW Designed for EDW Supports data mining and AI Provides granular detail Incrementally built 10/6/2011 LearnDataVault.com 30 CONS (as EDW) Not conducive to OLAP processing Requires business analysis to be firm Introduces many join operations
46. The Three Vehicles… Which would you use to win a race? Which would you use to move a house? Would you adapt the truck and enter a race with Porches and expect to win? 10/6/2011 LearnDataVault.com 31
47. #1 complaint about DV architecture So you want to deal with Joins do you? 10/6/2011 LearnDataVault.com 32
48.
49. Not enough rows being queried, (the overhead of starting the threads takes longer than an original scan.End Result? The DV Scales to the Petabyte Levels when necessary…
50. Mathematics Behind the Data Vault Model *** The Data Vault is BACKED by Mathematical Principles*** Parallel versus sequential execution models Set Logic I/O Bandwidth & Throughput Compression (for query performance gains) Process Repeatability (tuning & predictability measurements) RAM versus electromagnetic disk (Solid-State Drives are not measured) http://osl.cs.uiuc.edu/docs/IPDPS-TR04/TCA_TR04.pdf 10/6/2011 LearnDataVault.com 34
51. Know when to hold ‘em, know when to fold ‘em When to use DV, and when not… 10/6/2011 LearnDataVault.com 35
60. My incoming data sets don’t changeI Say… That’s wonderful, don’t fix what’s broken. Have a nice day, oh- but call me when or if you ever run into these problems…
71. Step 1 10/6/2011 LearnDataVault.com 39 Identify your business processes, followed by your business keys (that are used to identify the data that flows through the business processes) ** NOTE: Along the way, document your assumptions, document your reasons for choosing keys, and modeling designs, develop a list of questions to be answered by business users…
72. Step 2 10/6/2011 LearnDataVault.com 40 Identify the issues/problems that might be carried with the identified business keys, annotate the risks, and mitigate each one.
73. Step 3 10/6/2011 LearnDataVault.com 41 Identify the units of work, the associations – LINK tables, where keys combine to form a notion, a concept, and a relationship.
74. Step 4 10/6/2011 LearnDataVault.com 42 Identify the descriptive data that belongs to SINGLE Hub Keys, ensure that the data doesn’t represent or rely on a relationship.
75. Step 5 10/6/2011 LearnDataVault.com 43 Identify the Satellite data that depends on relationships – move it to the appropriate LINK table. HINT: If you “want” to put a Foreign Key in a Satellite, you have a clear sign that the Satellite is in the WRONG place, and needs to be assigned to a LINK table rather than a HUB.
76. Step 6 10/6/2011 LearnDataVault.com 44 Scope the Model Down to a managable chunk. Implement the first two Hubs, Hub Satellites, and first Link. BUILD IN INCREMENTS!
77. Step 7 10/6/2011 LearnDataVault.com 45 Setup the key generation load routines, setup the staging area, and begin loading data.
78. Step 8 10/6/2011 LearnDataVault.com 46 Review any “truncation” errors, or any data-type conversion problems, fix the staging area, and remove duplicates.
79. Step 9 10/6/2011 LearnDataVault.com 47 Begin Loading the Data Vault. Load all Hubs, then all Hub Satellites, Then all Links, and finish with All Link Satellites.
80. Step 10 10/6/2011 LearnDataVault.com 48 Reconcile the Data Vault to the source system, then build a first data mart from the results. Bring business value FAST!
88. What did we learn? We often deal with more than 1 system at a time… this was a lab with only one model. We didn’t have any business requirements that we might need to answer questions, but doesn’t that reflect real-life? The data set is extremely dirty (you never have that in your systems right?) Time Zone based data can be a problem Lack of metadata causes integration issues and modeling decisions 10/6/2011 LearnDataVault.com 56
89. The Experts Say… “The Data Vault is the optimal choice for modeling the EDW in the DW 2.0 framework.” Bill Inmon “The Data Vault is foundationally strong and exceptionally scalable architecture.” Stephen Brobst “The Data Vault is a technique which some industry experts have predicted may spark a revolution as the next big thing in data modeling for enterprise warehousing....” Doug Laney 57
90. More Notables… “This enables organizations to take control of their data warehousing destiny, supporting better and more relevant data warehouses in less time than before.” Howard Dresner “[The Data Vault] captures a practical body of knowledge for data warehouse development which both agile and traditional practitioners will benefit from..” Scott Ambler 58
91. Where To Learn More The Technical Modeling Book: http://LearnDataVault.com The Discussion Forums: & eventshttp://LinkedIn.com – Data Vault Discussions Contact me:http://DanLinstedt.com - web siteDanLinstedt@gmail.com - email World wide User Group (Free)http://dvusergroup.com 59