This is a presentation I gave in 2006 for Bill Inmon. The presentation covers Data Vault and how it integrates with Bill Inmon's DW2.0 vision. This is focused on the business intelligence side of the house.
IF you want to use these slides, please put (C) Dan Linstedt, all rights reserved, http://LearnDataVault.com
2. A bit about me… 2 Author, Inventor, Speaker – and part time photographer… 25+ years in the IT industry Worked in DoD, US Gov’t, Fortune 50, and so on… Find out more about the Data Vault: http://www.youtube.com/LearnDataVault http://LearnDataVault.com Full profile on http://www.LinkedIn.com/dlinstedt
3. Agenda Defining The Needs for the Data Vault DW2.0 Architecture DW2.0 Drivers for Data Modeling Divergence of Data Models over Time Data Vault in DW2.0 Defining the Data Vault What does one look like? Modeling in DW2.0 Applying Data Vault to Global DW2.0 Applying Data Vault to Time-Value DW2.0 Compliance in DW2.0 Applying Data Vault to System of Record The Paradox of DW2.0 Volume, Latency, Complexity,Normalization andTransformation ability 10/5/2011 Do Not Duplicate Without Written Permission 3
15. DW2.0 Drivers for Data Modeling 10/5/2011 Do Not Duplicate Without Written Permission 5 Technical Drivers Business Drivers Flexibility Compliance Volume Frequency Data Model Data Model Understandability Granularity Data Models are one of the main integration points between Technical and Business drivers. Business Keys drive understandability, and granularity Normalization drives flexibility, and frequency of load Raw data sets in the EDW/ADW drive compliance and volume
16. Divergence of Data Models over Time Data models (both logical and physical) have diverged from business drivers and direction over time. The Data Models have driven towards physical improvements instead of towards business improvements. The Data Vault Architecture drives data modeling back to the business sides of the house. 10/5/2011 Do Not Duplicate Without Written Permission 6
17. Agenda Defining The Needs for the Data Vault DW2.0 Architecture DW2.0 Drivers for Data Modeling Divergence of Data Models over Time Data Vault in DW2.0 Defining the Data Vault What does one look like? Modeling in DW2.0 Applying Data Vault to Global DW2.0 Applying Data Vault to Time-Value DW2.0 Compliance in DW2.0 Applying Data Vault to System of Record The Paradox of DW2.0 Volume, Latency, Complexity,Normalization andTransformation ability 10/5/2011 Do Not Duplicate Without Written Permission 7 Image is from - What The Bleep Do We Know?
18. Defining the Data Vault 10/5/2011 Do Not Duplicate Without Written Permission 8 The Data Vault is a detail oriented, historical tracking and uniquely linked set of normalized tables that support one or more functional areas of business. It is a hybrid approach encompassing the best of breed between 3rd normal form (3NF) and star schema. The design is flexible, scalable, consistent and adaptable to the needs of the enterprise. It is a data model that is architected specifically to meet the needs of today’s enterprise data warehouses. Defining the Data Vault TDAN.com Article
21. SatelliteSat Customer F(x) Sat The impact of linking disparate systems together, is inside the shaded area.
22. Modeling in DW2.0 Bill Says: DW2.0 must be brought down to a very finite level of detail. The starting point for DW2.0 is the modeling process. The data model applies to the integrated sector, the near line sector, and the archival sector. The way that data warehouses are built is in an incremental manner The Data Vault specializes in: Providing finite grain at the lowest level possible, Mapping business process models to data models Existing in all sectors simultaneously without changes. Flexibility and managing change so that impacts are not a mile-wide and 10 miles deep. 10/5/2011 Do Not Duplicate Without Written Permission 10
23. Elements in a Data Vault Hub Unique List of Business Keys, tracked by the first time the warehouse saw them appear. Link Relationships between business keys, also representing a grain shift, or a hierarchical roll-up. Satellite Data over time, granular, and descriptive about the business key. Also setup according to type of information, and rate of change. 10/5/2011 Do Not Duplicate Without Written Permission 11
24. Applying the Data Vault to Global DW2.0 10/5/2011 Do Not Duplicate Without Written Permission 12 Manufacturing EDW in China Planning in Brazil Hub Hub Link Sat Sat Link Sat Sat Link Hub Link Hub Hub Sat Sat Sat Sat Sat Sat Sat Sat Base EDW Created in Corporate Financials in USA
25. Applying the Data Vault to Time-Value DW2.0 10/5/2011 Do Not Duplicate Without Written Permission 13 Satellite Data Over Time Row 1 Row 2 Row 3 Row 4 Satellite entities in the Data Vault house data over time. They are split by type of information and rate of change. This is an example set of data for a customer name satellite.
26. Batch and Real-Time Data Arrival 10/5/2011 Do Not Duplicate Without Written Permission 14 All Inserts All the time Transaction ID Date Stamp Customer Account # Amount Sat Transaction Type Hub Customer Link Transaction Hub Acct Sat Customer Sat Acct 3, 6 or 12 Hr Load Window Batch Load Customer Info Acct Data
27. Star Schema Real-Time Data Issues 10/5/2011 Do Not Duplicate Without Written Permission 15 Updates are REQUIRED! Transaction ID Date Stamp Customer Account # Amount Type 3, 6 or 12 Hr Load Window Dimension Customer Fact Transaction Dimension Account Batch Load Customer Info Acct Data Cleansing & Quality must occur before the data can reach the target tables, cleansing and quality introduce unwanted latency!
28. Compliance in DW2.0 10/5/2011 Do Not Duplicate Without Written Permission 16 Changes to Source Information Source Systems EDW / ADW Data Vault Data Marts Data Delivery Raw Detail = auditable Loads in Real-Time or in Batch Integrated by Business Key Flexible, allows business changes (with little to no impact) No delay in loading data Data type conformity Semantic Integration True Marts Raw Integration Business Rules User or Auditor Continuous Data Improvement Error Mart Quality Direction of Information Flow Master Data (Operational)
29. Applying the Data Vault to System Of Record 10/5/2011 Do Not Duplicate Without Written Permission 17 Master Data or Conformed Dimensions Normalized EDW Source Systems SOR Definition 2 SOR Definition 3 SOR Definition 1 SOR 1 Data Capture, Data Produced by system algorithms SOR 2 Raw Detailed Integrated Data over time, Integrated by Horizontal (functional) Business Key. Auditable. SOR 3 Current view of the business, merged, quality cleansed, single copy, single source, feeds operational systems.
30. DW2.0 Paradoxes DW2.0 incorporates: Unstructured, Semi-Structured, Real-Time, and Batch Data Global views All of which drive volumes of data. Volume causes latency in transformation. Volume is directly proportional to transformation complexity. Real-Time data arrival is inversely proportional to complexity and volume. Time for “quality, cleansing, and transformation” on the way in to the EDW diminishes as near-real-time is approached, or massive volumes of batch data are found within a shrinking batch window. Transformation can destroy data audit ability and compliance of the EDW / ADW. 10/5/2011 Do Not Duplicate Without Written Permission 18
31. DW2.0 Paradoxes - Imagery 10/5/2011 Do Not Duplicate Without Written Permission 19 Drives DW2.0 Real-Time Transactions Unstructured Data Low-Level Grain Pushes Increases Low Latency Volume Fights Requires Merging, Quality, Cleansing Fights Data Model Denormalization Fights Data Model Normalization & Raw Details Inhibits Requires Inhibits Auditability & Compliance Provides
32. DW2.0 Paradox Hypothesis As we reach near-real time, the ability to transform data and “wait” for parent dependencies directly decreases, the data decay rates increase, and therefore can cause data death if not processed in time. Normalization of the data model increases flexibility, and scalability. The closer we get to near-real-time, the more normalized the data model in the EDW/ADW must become. In order to process high volumes of batch data extremely fast, the “business transformations” must be removed from the load stream of the EDW. 10/5/2011 Do Not Duplicate Without Written Permission 20
33. Data Vault Volumetrics 10/5/2011 Do Not Duplicate Without Written Permission 21 Volumetrics (10% null Data) Upon Initial Investigation, the 12 month growth rate for new customers is 197.4 MB per year…. Now let’s factor in the DELTA’s.
34. Data Vault Growth 10/5/2011 Do Not Duplicate Without Written Permission 22 Volumetrics (10% null Data) – Delta Growth Only Original Dimension: 497.16 MB per Year New Data Vault:317.03 MB Per Year
35. Data Vault VS Dimension Growth 10/5/2011 Do Not Duplicate Without Written Permission 23 How does the extensive growth rate affect queries?
36. Summarization Business: Lack of a single view of a customer, product, service, etc... Lack of visibility into ALL information across the enterprise. Competition does it better, faster, cheaper. Unable to identify and forecast business trends and their impacts. WHERE’S THE KNOWLEDGE? OR IS IT JUST ALL DATA? 10/5/2011 Do Not Duplicate Without Written Permission 24 Technical: Near-Real-Time (Active) Huge Data Volumes Massive Data Dis-Integration Spread-Marts Convergence of Operational and Strategic Questions Duplication of data in the ODS, Warehouse, and Data Marts! Dimension-itis!! ODS Ulcer! Fact Table Granularity JUNK tables, Helper Tables
37. Where To Learn More The Technical Modeling Book: http://LearnDataVault.com The Discussion Forums: & eventshttp://LinkedIn.com – Data Vault Discussions Contact me:http://DanLinstedt.com - web siteDanLinstedt@gmail.com - email World wide User Group (Free)http://dvusergroup.com 25