SlideShare ist ein Scribd-Unternehmen logo
1 von 52
SQL Bits X

Temporal Snapshot Fact
Table
A new approach to data snapshots
Davide Mauri
dmauri@solidq.com
Mentor - SolidQ
EXEC sp_help „Davide Mauri‟
• Microsoft SQL Server MVP
• Works with SQL Server from 6.5, on BI from
  2003
• Specialized in Data Solution Architecture,
  Database Design, Performance Tuning, BI
• President of UGISS (Italian SQL Server UG)
• Mentor @ SolidQ
• Twitter: mauridb
• Blog: http://sqlblog.com/blogs/davide_mauri
Agenda
• The problem
• The possible «classic» solutions
• Limits of «classic» solutions
• A temporal approach
• Implementing the solution
• Technical & Functional Challenges
     (and their resolution )
• Conclusions
The Problem
The request
• Our customer (an insurance company)
  needed a BI solution that would allow
  them to do some “simple” things:
  – Analyze the value/situation of all insurances
    at a specific point in time
  – Analyze all the data from 1970 onwards
  – Analyze data on daily basis
    • Absolutely no weekly or monthly aggregation
The environment
• On average they have 3.000.000
  documents related to insurances that need
  to be stored.
  – Each day.
• Data is mainly stored in a DB2 mainframe
  – Data is (sometimes) stored in a “temporal”
    fashion
    • For each document a new row is created each
      time something changes
    • Each row has “valid_from” and “valid_to” columns
Let‟s do the math
• To keep the daily snapshot
  – of 3.000.000 documents
  – for 365 days
  – for 40 years


• A fact table of near 44 BILLION rows
  would be needed
The possible solutions
• How can the problem be solved?
  – PDW and/or Brute Force was not an option


• Unfortunately none of the three known
  approaches can work here
  – Transactional Fact Table
  – Periodic Snapshot Fact Table
  – Accumulating Snapshot Fact Table


• A new approach is needed!
The classic solutions
 And why they cannot be used here
Transaction Fact Table
• One row per fact occurring at a certain point
  in time
• We don‟t have facts for every day. For
  example, for the document 123, no changes
  were made on 20 August 2011
  – Analysis on that date would not show the
    document 123
  – “LastNonEmpty” aggregation cannot help since
    dates don‟t have children
  – MDX can come to the rescue here but…
     • Very high complexity
     • Potential performance problems
Accumulating Snapshot F.T.
• One fact row per entity in the fact table
  – Each row has a lot of date columns that store
    all the changes that happen during the entity‟s
    lifetime
  – Each column date represents a point in time
    where something changed
     • The number of changes must be known in
       advance
• With so many date columns it can be
  difficult to manage the “Analysis Date”
                                                  11
Periodic Snapshot Fact Table
• The fact table holds a «snapshot» of all
  data for a specific point in time
     – That‟s the perfect solution, since doing a
       snapshot each day would completely solve
       the problem
     – Unfortunately there is just too much data
     – We need to keep the idea but reduce the
       amount of data



BIA-406-S | Temporal Snapshot Fact Table            12
A temporal approach
 And how it can be applied to BI
Temporal Data
• The snapshot fact table is a good starting
  point
  – Keeping a snapshot of a document for each
    day is just a waste of space
  – And also a big performance killer
• Changes to document don‟t happen too
  often
  – Fact tables should be temporal in order to
    avoid data duplication
  – Ideally, each fact row needs a “valid_from”
    and “valid_to” column
Temporal Data
• With Temporal Snapshot Fact Table
  (TSFT) each row represents a fact that
  occurred during a time interval not at a
  point in time.
• Time Intervals will be right-opened in order
  to simplify calculations:
            [valid_from, valid_to)
      Which means:
            valid_from <= x < valid_to
Temporal Data
• Now that the idea is set, we have to solve
  several problems:
  – Functional Challenges
     • SSAS doesn‟t support the concept of intervals, nor
       the between filter
     • The user just wants to define the Analysis Date,
       and doesn‟t want to deal with ranges
  – Technical Challenges
     • Source data may come from several tables and
       has to be consolidated in a small number of TSFT
     • The concept of interval changes the rules of the
       “join” game a little bit…
Technical Challenges
• Source data may come from table with
  different designs:
      temporal tables
      non-temporal tables
• Temporal Tables
  – Ranges from two different temporal tables
    may overlap
• Non-temporal tables
  – There may be some business rules that
    require us to ÂŤbreakÂť the existing ranges
Technical Challenges
• Before starting to solve the problem, let‟s
  generalize it a bit (in order to be able to study
  it theoretically)
• Imagine a company with this environment
  – There are Items & Sub Items
  – Items have 1 or more Sub Items
     • Each Sub Item has a value in money
     • Value may change during its lifetime
  – Customers have to pay an amount equal to the
    sum of all Sub Items of an Item
     • Customers may pay when they want during the year,
       as long as the total amount is paid by the end of the
       year
Data Diving
Technical Challenges
• When joining data, keys are no longer
  sufficient
• Time Intervals must also be managed
  correctly, otherwise incorrect results may
  be created
• Luckily time operators are well-known (but
  not yet implemented)
  – overlaps, contains, meets, ecc. ecc.
Technical Challenges
• CONTAINS (or DURING) Operator:

             b1                  e1

                       i1
                  b2        e2

                       i2

• Means:
           b1<=b2 AND e1>=e2
Technical Challenges
• OVERLAPS Operator:
           b1             e1



                i1
                     b2             e2



                               i2


• Means:
            b1<=e2 AND b2 <=e1
• Since we‟re using right-opened intervals, a
  little change is needed here:
            b1<e2 AND b2 <e1
Technical Challenges
• Before going any further it‟s best we start
  to «see» the data we‟re going to use


                      2008                                                  2009                                                           2010

                     Nov       Dec   Jan   Feb       Mar       Apr   May   Jun   Jul       Aug   Sep       Oct       Nov   Dec       Jan       Feb       Mar

 Item 1                    A                                                           B                                                                 C

          Item 1.1         A               B                         C                                 D                                   B             C

          Item 1.2         A                               B                               C                     D               E                   C

 Item 2                        A                 B                               C                               D                         C
          Item 2.1             A                 B                               C                               D                         C
JOINs & Co.
Technical Challenges
• As you have seen, even with just three
  tables, the joins are utterly complex!

• But bearing this in mind, we can now find
  an easier, logically equivalent, solution.

• Drawing a timeline and visualizing the
  result helps a lot here.
Technical Challenges


  Range for one Item


  Range for the Sub Item


  Payment Dates                           Intervals will represent a
So, on the other way round,
                                             time where nothing
 dates represent a time in
                                                  changed
which something changed!



  ÂŤSummarizedÂť Status
                              1   2   3     4          5       6
Technical Challenges
• In simpler words we need to find the
  minimum set of time intervals for each
  entity
  – The concept of "granularity" must be changed
    in order to take into account the idea of time
    intervals
  – All fact tables, for each stored entity, must
    have the same granularity for the time interval
     • interval granularity must be the same “horizontally”
       (across tables)
     • interval granularity must be specific for each entity
       (across rows: “vertically”)
Technical Challenges
• The problem can be solved by
  ÂŤrefactoringÂť data.
• Just like refactoring code, but working on
  data
  – Rephrased idea: “disciplined technique for
    restructuring existing temporal data, altering
    its internal structure without changing its
    value”
     • http://en.wikipedia.org/wiki/Code_refactoring
Technical Challenges
• Here‟s how:
• Step 1
  – For each item, for
    each temporal table,
    union all the distinct
    ranges
Technical Challenges
• Step 2
  – For each item, take all the
    distinct:
  – Dates from the previous step
  – Dates that are known to
    ÂŤbreakÂť a range
  – The first day of each year
    used in ranges
Technical Challenges
• Step 3
  – For each item, create
    the new ranges, doing
    the following steps
     • Order the rows by date
     • Take the «k» row and
       the ÂŤk+1Âť row
     • Create the range
            [Datek, Datek+1)

  The New LEAD operator is exactly
  what we need! 
Technical Challenges
• When the time granularity is the day, the
  right-opened range helps a lot here.
  – Eg: let‟s say the at some point, for an entity,
    you have the following dates that has to be
    turned into ranges:
Technical Challenges
• With a closed range intervals can be
  ambiguous, especially single day intervals:




  – The meaning is not explicitly clear and we
    would have to take care of it, making the ETL
    phase more complex.
  – It‟s better to make everything explicit
Technical Challenges
• So, to avoid ambiguity, with a closed range
  the resulting set of intervals also needs to
  take time into account:




  – What granularity to use for time?
     • Hour? Minute? Microseconds?
  – Is just a waste of space
  – Date or Int datatype cannot be used
Technical Challenges
• With a right-opened interval everything is
  easier:




  – Date or Int datatype are enough
  – No time used, so no need to deal with it
Technical Challenges
• Step 4
  – «Explode» all the temporal source tables so
    that all data will use the same (new) ranges,
    using the CONTAINS operator:
Technical Challenges
• The original Range:


Becomes the equivalent
set of ranges:



If we would have chosen a daily
snapshot, 425 rows would have been
generated…not only 12!
(That‟s a 35:1 ratio!)
Technical Challenges
• Now we have all the source data
  conforming to a common set of intervals
  – This means that we can just join tables
    without having to do complex temporal joins


• All fact tables will also have the same
  interval granularity

• We solved all the technical problems! 
SSIS In Action
Functional Challenges
• SSAS doesn‟t support time intervals but…

• Using some imagination we can say that
  – An interval is made of 1 to “n” dates
  – A date belongs to 1 to “n” intervals


• We can model this situation with a Many-
  To-Many relationship!
Functional Challenges
• The «interval dimension» will then be
  hidden and the user will just see the
  ÂŤAnalysis DateÂť dimension

• When such date is selected, all the fact
  rows that have an interval containing that
  date will be selected too, giving us the
  solution we‟re looking for 
Functional Challenges

   Date      The user wants
 Dimension   to analyze data
             on 10 Aug 2009



                                                   Fact Table
    Factless                   DateRange
Date-DateRange                 Dimension

                                SSAS uses all
        10 Aug is               the found ranges
        contained in two                            All the (three)
        Ranges                                      rows related to
                                                    those ranges are
                                                    read
All Togheter!
The final recap
 And a quick recipe to make everything easier
Steps for the DWH
• For each entity:
  – Get all “valid_from” and “valid_to” columns
    from all tables
  – Get all the dates that will break intervals from
    all tables
  – Take all the gathered dates one time (remove
    duplicates) and generate new intervals
  – «Unpack» original tables in order to use the
    newly generated intervals
Steps for the CUBE
• Generate the usual Date dimension
  – Generate a «Date Range» dimension
  – Generate a Factless Date-DataRange table
  – Generate the fact table with reference to the
    Date Range dimension
  – Create the M:N relationship
BIA-406-S | Temporal Snapshot Fact Table   47
Conclusion and Improvements
 Some ideas for the future
Conclusion and improvements
• The defined pattern can be applied each time
  a daily analysis is needed and data is not
  additive
  – So each time you need to “snapshot” something

• The approach can also be used if source
  data is not temporal, as long as you can turn
  it into a temporal format

• Performance can be improved by
  partitioning the cube by year or even by
  month
Conclusion and improvements
• At the end is a very simple solution 

• Everything you‟ve seen is the summary of
  several months of work of the Italian
  Team. Thanks guys!
Questions?
Thanks!




Š 2012 SolidQ   52

Weitere ähnliche Inhalte

Was ist angesagt?

Introduction to NoSQL Databases
Introduction to NoSQL DatabasesIntroduction to NoSQL Databases
Introduction to NoSQL Databases
Derek Stainer
 

Was ist angesagt? (20)

The Evolution of a Relational Database Layer over HBase
The Evolution of a Relational Database Layer over HBaseThe Evolution of a Relational Database Layer over HBase
The Evolution of a Relational Database Layer over HBase
 
Introduction to Spark Streaming
Introduction to Spark StreamingIntroduction to Spark Streaming
Introduction to Spark Streaming
 
Etl - Extract Transform Load
Etl - Extract Transform LoadEtl - Extract Transform Load
Etl - Extract Transform Load
 
From Traditional Data Warehouse To Real Time Data Warehouse
From Traditional Data Warehouse To Real Time Data WarehouseFrom Traditional Data Warehouse To Real Time Data Warehouse
From Traditional Data Warehouse To Real Time Data Warehouse
 
Hadoop Backup and Disaster Recovery
Hadoop Backup and Disaster RecoveryHadoop Backup and Disaster Recovery
Hadoop Backup and Disaster Recovery
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 
Realtime Indexing for Fast Queries on Massive Semi-Structured Data
Realtime Indexing for Fast Queries on Massive Semi-Structured DataRealtime Indexing for Fast Queries on Massive Semi-Structured Data
Realtime Indexing for Fast Queries on Massive Semi-Structured Data
 
Introduction to Impala
Introduction to ImpalaIntroduction to Impala
Introduction to Impala
 
Introduction to NoSQL Databases
Introduction to NoSQL DatabasesIntroduction to NoSQL Databases
Introduction to NoSQL Databases
 
[EN] Building modern data pipeline with Snowflake + DBT + Airflow.pdf
[EN] Building modern data pipeline with Snowflake + DBT + Airflow.pdf[EN] Building modern data pipeline with Snowflake + DBT + Airflow.pdf
[EN] Building modern data pipeline with Snowflake + DBT + Airflow.pdf
 
Best Practices for Building a Data Lake with Amazon S3 - August 2016 Monthly ...
Best Practices for Building a Data Lake with Amazon S3 - August 2016 Monthly ...Best Practices for Building a Data Lake with Amazon S3 - August 2016 Monthly ...
Best Practices for Building a Data Lake with Amazon S3 - August 2016 Monthly ...
 
WBS for Agile Business Intelligence Projects
WBS for Agile Business Intelligence ProjectsWBS for Agile Business Intelligence Projects
WBS for Agile Business Intelligence Projects
 
Airbyte @ Airflow Summit - The new modern data stack
Airbyte @ Airflow Summit - The new modern data stackAirbyte @ Airflow Summit - The new modern data stack
Airbyte @ Airflow Summit - The new modern data stack
 
Cassandra
CassandraCassandra
Cassandra
 
Architecting a datalake
Architecting a datalakeArchitecting a datalake
Architecting a datalake
 
Hive User Meeting August 2009 Facebook
Hive User Meeting August 2009 FacebookHive User Meeting August 2009 Facebook
Hive User Meeting August 2009 Facebook
 
Database performance tuning and query optimization
Database performance tuning and query optimizationDatabase performance tuning and query optimization
Database performance tuning and query optimization
 
Apache BookKeeper: A High Performance and Low Latency Storage Service
Apache BookKeeper: A High Performance and Low Latency Storage ServiceApache BookKeeper: A High Performance and Low Latency Storage Service
Apache BookKeeper: A High Performance and Low Latency Storage Service
 
Spark streaming , Spark SQL
Spark streaming , Spark SQLSpark streaming , Spark SQL
Spark streaming , Spark SQL
 
Dimensional Modelling
Dimensional ModellingDimensional Modelling
Dimensional Modelling
 

Andere mochten auch

Agile Data Warehousing
Agile Data WarehousingAgile Data Warehousing
Agile Data Warehousing
Davide Mauri
 

Andere mochten auch (20)

Data modeling facts
Data modeling factsData modeling facts
Data modeling facts
 
SQL Server 2016 Temporal Tables
SQL Server 2016 Temporal TablesSQL Server 2016 Temporal Tables
SQL Server 2016 Temporal Tables
 
SQL Server 2016 What's New For Developers
SQL Server 2016  What's New For DevelopersSQL Server 2016  What's New For Developers
SQL Server 2016 What's New For Developers
 
Dw design fact_tables_types_6
Dw design fact_tables_types_6Dw design fact_tables_types_6
Dw design fact_tables_types_6
 
A 100 dal genocidio georges ruyssen civiltĂ  cattolica
A 100 dal genocidio georges ruyssen   civiltĂ  cattolicaA 100 dal genocidio georges ruyssen   civiltĂ  cattolica
A 100 dal genocidio georges ruyssen civiltĂ  cattolica
 
Datarace: IoT e Big Data (Italian)
Datarace: IoT e Big Data (Italian)Datarace: IoT e Big Data (Italian)
Datarace: IoT e Big Data (Italian)
 
Iris Multi-Class Classifier with Azure ML
Iris Multi-Class Classifier with Azure MLIris Multi-Class Classifier with Azure ML
Iris Multi-Class Classifier with Azure ML
 
Azure Machine Learning
Azure Machine LearningAzure Machine Learning
Azure Machine Learning
 
Data warehousing
Data warehousingData warehousing
Data warehousing
 
Data flow ii extract
Data flow   ii extractData flow   ii extract
Data flow ii extract
 
Real Time Power BI
Real Time Power BIReal Time Power BI
Real Time Power BI
 
AzureML - Creating and Using Machine Learning Solutions (Italian)
AzureML - Creating and Using Machine Learning Solutions (Italian)AzureML - Creating and Using Machine Learning Solutions (Italian)
AzureML - Creating and Using Machine Learning Solutions (Italian)
 
SSIS Monitoring Deep Dive
SSIS Monitoring Deep DiveSSIS Monitoring Deep Dive
SSIS Monitoring Deep Dive
 
Data modeling dimensions
Data modeling dimensionsData modeling dimensions
Data modeling dimensions
 
Agile Data Warehousing
Agile Data WarehousingAgile Data Warehousing
Agile Data Warehousing
 
Research methodology
Research methodologyResearch methodology
Research methodology
 
Harnessing the Hadoop Ecosystem Optimizations in Apache Hive
Harnessing the Hadoop Ecosystem Optimizations in Apache HiveHarnessing the Hadoop Ecosystem Optimizations in Apache Hive
Harnessing the Hadoop Ecosystem Optimizations in Apache Hive
 
Lean Data Warehouse via Data Vault
Lean Data Warehouse via Data VaultLean Data Warehouse via Data Vault
Lean Data Warehouse via Data Vault
 
Dashboarding with Microsoft: Datazen & Power BI
Dashboarding with Microsoft: Datazen & Power BIDashboarding with Microsoft: Datazen & Power BI
Dashboarding with Microsoft: Datazen & Power BI
 
Azure Machine Learning (Italian)
Azure Machine Learning (Italian)Azure Machine Learning (Italian)
Azure Machine Learning (Italian)
 

Ähnlich wie Temporal Snapshot Fact Tables

data structure
data structuredata structure
data structure
Sakshi Jain
 
BigQuery at AppsFlyer - past, present and future
BigQuery at AppsFlyer - past, present and futureBigQuery at AppsFlyer - past, present and future
BigQuery at AppsFlyer - past, present and future
Nir Rubinstein
 
Software Project Management (lecture 4)
Software Project Management (lecture 4)Software Project Management (lecture 4)
Software Project Management (lecture 4)
Syed Muhammad Hammad
 
Software Project Management lecture 9
Software Project Management lecture 9Software Project Management lecture 9
Software Project Management lecture 9
Syed Muhammad Hammad
 

Ähnlich wie Temporal Snapshot Fact Tables (20)

SR&ED for Agile Startups
SR&ED for Agile StartupsSR&ED for Agile Startups
SR&ED for Agile Startups
 
data structure
data structuredata structure
data structure
 
BigQuery at AppsFlyer - past, present and future
BigQuery at AppsFlyer - past, present and futureBigQuery at AppsFlyer - past, present and future
BigQuery at AppsFlyer - past, present and future
 
Biug 20112026 dimensional modeling and mdx best practices
Biug 20112026   dimensional modeling and mdx best practicesBiug 20112026   dimensional modeling and mdx best practices
Biug 20112026 dimensional modeling and mdx best practices
 
Case Study Real Time Olap Cubes
Case Study Real Time Olap CubesCase Study Real Time Olap Cubes
Case Study Real Time Olap Cubes
 
Project Scheduling
Project SchedulingProject Scheduling
Project Scheduling
 
VTU 7TH SEM CSE DATA WAREHOUSING AND DATA MINING SOLVED PAPERS OF DEC2013 JUN...
VTU 7TH SEM CSE DATA WAREHOUSING AND DATA MINING SOLVED PAPERS OF DEC2013 JUN...VTU 7TH SEM CSE DATA WAREHOUSING AND DATA MINING SOLVED PAPERS OF DEC2013 JUN...
VTU 7TH SEM CSE DATA WAREHOUSING AND DATA MINING SOLVED PAPERS OF DEC2013 JUN...
 
Modeling, It's Not Just For Calendars and Energy
Modeling, It's Not Just For Calendars and EnergyModeling, It's Not Just For Calendars and Energy
Modeling, It's Not Just For Calendars and Energy
 
Software Project Management (lecture 4)
Software Project Management (lecture 4)Software Project Management (lecture 4)
Software Project Management (lecture 4)
 
Software Project Management lecture 9
Software Project Management lecture 9Software Project Management lecture 9
Software Project Management lecture 9
 
Oracle R12 Upgrade Lessons Learned
Oracle R12 Upgrade Lessons LearnedOracle R12 Upgrade Lessons Learned
Oracle R12 Upgrade Lessons Learned
 
Soft Landings Conference Case Study 2
Soft Landings Conference Case Study 2Soft Landings Conference Case Study 2
Soft Landings Conference Case Study 2
 
Advanced Lean Training Manual Toolkit.ppt
Advanced Lean Training Manual Toolkit.pptAdvanced Lean Training Manual Toolkit.ppt
Advanced Lean Training Manual Toolkit.ppt
 
Metrics-Based Process Mapping
Metrics-Based Process MappingMetrics-Based Process Mapping
Metrics-Based Process Mapping
 
Oracle Primavera P6 PRO R8 Tips & tricks
Oracle Primavera P6 PRO R8 Tips & tricksOracle Primavera P6 PRO R8 Tips & tricks
Oracle Primavera P6 PRO R8 Tips & tricks
 
Look into communication
Look into communicationLook into communication
Look into communication
 
Asufe juniors-training session2
Asufe juniors-training session2Asufe juniors-training session2
Asufe juniors-training session2
 
From Waterfall to Weekly Releases: A Case Study in using Evo and Kanban (2004...
From Waterfall to Weekly Releases: A Case Study in using Evo and Kanban (2004...From Waterfall to Weekly Releases: A Case Study in using Evo and Kanban (2004...
From Waterfall to Weekly Releases: A Case Study in using Evo and Kanban (2004...
 
Week3 to-design
Week3 to-designWeek3 to-design
Week3 to-design
 
Performant Django - Ara Anjargolian
Performant Django - Ara AnjargolianPerformant Django - Ara Anjargolian
Performant Django - Ara Anjargolian
 

Mehr von Davide Mauri

Mehr von Davide Mauri (19)

Azure serverless Full-Stack kickstart
Azure serverless Full-Stack kickstartAzure serverless Full-Stack kickstart
Azure serverless Full-Stack kickstart
 
Agile Data Warehousing
Agile Data WarehousingAgile Data Warehousing
Agile Data Warehousing
 
Dapper: the microORM that will change your life
Dapper: the microORM that will change your lifeDapper: the microORM that will change your life
Dapper: the microORM that will change your life
 
When indexes are not enough
When indexes are not enoughWhen indexes are not enough
When indexes are not enough
 
Building a Real-Time IoT monitoring application with Azure
Building a Real-Time IoT monitoring application with AzureBuilding a Real-Time IoT monitoring application with Azure
Building a Real-Time IoT monitoring application with Azure
 
SSIS Monitoring Deep Dive
SSIS Monitoring Deep DiveSSIS Monitoring Deep Dive
SSIS Monitoring Deep Dive
 
Azure SQL & SQL Server 2016 JSON
Azure SQL & SQL Server 2016 JSONAzure SQL & SQL Server 2016 JSON
Azure SQL & SQL Server 2016 JSON
 
SQL Server & SQL Azure Temporal Tables - V2
SQL Server & SQL Azure Temporal Tables - V2SQL Server & SQL Azure Temporal Tables - V2
SQL Server & SQL Azure Temporal Tables - V2
 
Azure Stream Analytics
Azure Stream AnalyticsAzure Stream Analytics
Azure Stream Analytics
 
Azure ML: from basic to integration with custom applications
Azure ML: from basic to integration with custom applicationsAzure ML: from basic to integration with custom applications
Azure ML: from basic to integration with custom applications
 
Event Hub & Azure Stream Analytics
Event Hub & Azure Stream AnalyticsEvent Hub & Azure Stream Analytics
Event Hub & Azure Stream Analytics
 
SQL Server 2016 JSON
SQL Server 2016 JSONSQL Server 2016 JSON
SQL Server 2016 JSON
 
Back to the roots - SQL Server Indexing
Back to the roots - SQL Server IndexingBack to the roots - SQL Server Indexing
Back to the roots - SQL Server Indexing
 
Schema less table & dynamic schema
Schema less table & dynamic schemaSchema less table & dynamic schema
Schema less table & dynamic schema
 
BIML: BI to the next level
BIML: BI to the next levelBIML: BI to the next level
BIML: BI to the next level
 
Data juice
Data juiceData juice
Data juice
 
Data Science Overview
Data Science OverviewData Science Overview
Data Science Overview
 
Delayed durability
Delayed durabilityDelayed durability
Delayed durability
 
Hekaton: In-memory tables
Hekaton: In-memory tablesHekaton: In-memory tables
Hekaton: In-memory tables
 

KĂźrzlich hochgeladen

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

KĂźrzlich hochgeladen (20)

TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 

Temporal Snapshot Fact Tables

  • 1. SQL Bits X Temporal Snapshot Fact Table A new approach to data snapshots Davide Mauri dmauri@solidq.com Mentor - SolidQ
  • 2. EXEC sp_help „Davide Mauri‟ • Microsoft SQL Server MVP • Works with SQL Server from 6.5, on BI from 2003 • Specialized in Data Solution Architecture, Database Design, Performance Tuning, BI • President of UGISS (Italian SQL Server UG) • Mentor @ SolidQ • Twitter: mauridb • Blog: http://sqlblog.com/blogs/davide_mauri
  • 3. Agenda • The problem • The possible ÂŤclassicÂť solutions • Limits of ÂŤclassicÂť solutions • A temporal approach • Implementing the solution • Technical & Functional Challenges (and their resolution ) • Conclusions
  • 5. The request • Our customer (an insurance company) needed a BI solution that would allow them to do some “simple” things: – Analyze the value/situation of all insurances at a specific point in time – Analyze all the data from 1970 onwards – Analyze data on daily basis • Absolutely no weekly or monthly aggregation
  • 6. The environment • On average they have 3.000.000 documents related to insurances that need to be stored. – Each day. • Data is mainly stored in a DB2 mainframe – Data is (sometimes) stored in a “temporal” fashion • For each document a new row is created each time something changes • Each row has “valid_from” and “valid_to” columns
  • 7. Let‟s do the math • To keep the daily snapshot – of 3.000.000 documents – for 365 days – for 40 years • A fact table of near 44 BILLION rows would be needed
  • 8. The possible solutions • How can the problem be solved? – PDW and/or Brute Force was not an option • Unfortunately none of the three known approaches can work here – Transactional Fact Table – Periodic Snapshot Fact Table – Accumulating Snapshot Fact Table • A new approach is needed!
  • 9. The classic solutions And why they cannot be used here
  • 10. Transaction Fact Table • One row per fact occurring at a certain point in time • We don‟t have facts for every day. For example, for the document 123, no changes were made on 20 August 2011 – Analysis on that date would not show the document 123 – “LastNonEmpty” aggregation cannot help since dates don‟t have children – MDX can come to the rescue here but… • Very high complexity • Potential performance problems
  • 11. Accumulating Snapshot F.T. • One fact row per entity in the fact table – Each row has a lot of date columns that store all the changes that happen during the entity‟s lifetime – Each column date represents a point in time where something changed • The number of changes must be known in advance • With so many date columns it can be difficult to manage the “Analysis Date” 11
  • 12. Periodic Snapshot Fact Table • The fact table holds a ÂŤsnapshotÂť of all data for a specific point in time – That‟s the perfect solution, since doing a snapshot each day would completely solve the problem – Unfortunately there is just too much data – We need to keep the idea but reduce the amount of data BIA-406-S | Temporal Snapshot Fact Table 12
  • 13. A temporal approach And how it can be applied to BI
  • 14. Temporal Data • The snapshot fact table is a good starting point – Keeping a snapshot of a document for each day is just a waste of space – And also a big performance killer • Changes to document don‟t happen too often – Fact tables should be temporal in order to avoid data duplication – Ideally, each fact row needs a “valid_from” and “valid_to” column
  • 15. Temporal Data • With Temporal Snapshot Fact Table (TSFT) each row represents a fact that occurred during a time interval not at a point in time. • Time Intervals will be right-opened in order to simplify calculations: [valid_from, valid_to) Which means: valid_from <= x < valid_to
  • 16. Temporal Data • Now that the idea is set, we have to solve several problems: – Functional Challenges • SSAS doesn‟t support the concept of intervals, nor the between filter • The user just wants to define the Analysis Date, and doesn‟t want to deal with ranges – Technical Challenges • Source data may come from several tables and has to be consolidated in a small number of TSFT • The concept of interval changes the rules of the “join” game a little bit…
  • 17. Technical Challenges • Source data may come from table with different designs: temporal tables non-temporal tables • Temporal Tables – Ranges from two different temporal tables may overlap • Non-temporal tables – There may be some business rules that require us to ÂŤbreakÂť the existing ranges
  • 18. Technical Challenges • Before starting to solve the problem, let‟s generalize it a bit (in order to be able to study it theoretically) • Imagine a company with this environment – There are Items & Sub Items – Items have 1 or more Sub Items • Each Sub Item has a value in money • Value may change during its lifetime – Customers have to pay an amount equal to the sum of all Sub Items of an Item • Customers may pay when they want during the year, as long as the total amount is paid by the end of the year
  • 20. Technical Challenges • When joining data, keys are no longer sufficient • Time Intervals must also be managed correctly, otherwise incorrect results may be created • Luckily time operators are well-known (but not yet implemented) – overlaps, contains, meets, ecc. ecc.
  • 21. Technical Challenges • CONTAINS (or DURING) Operator: b1 e1 i1 b2 e2 i2 • Means: b1<=b2 AND e1>=e2
  • 22. Technical Challenges • OVERLAPS Operator: b1 e1 i1 b2 e2 i2 • Means: b1<=e2 AND b2 <=e1 • Since we‟re using right-opened intervals, a little change is needed here: b1<e2 AND b2 <e1
  • 23. Technical Challenges • Before going any further it‟s best we start to ÂŤseeÂť the data we‟re going to use 2008 2009 2010 Nov Dec Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Jan Feb Mar Item 1 A B C Item 1.1 A B C D B C Item 1.2 A B C D E C Item 2 A B C D C Item 2.1 A B C D C
  • 25. Technical Challenges • As you have seen, even with just three tables, the joins are utterly complex! • But bearing this in mind, we can now find an easier, logically equivalent, solution. • Drawing a timeline and visualizing the result helps a lot here.
  • 26. Technical Challenges Range for one Item Range for the Sub Item Payment Dates Intervals will represent a So, on the other way round, time where nothing dates represent a time in changed which something changed! ÂŤSummarizedÂť Status 1 2 3 4 5 6
  • 27. Technical Challenges • In simpler words we need to find the minimum set of time intervals for each entity – The concept of "granularity" must be changed in order to take into account the idea of time intervals – All fact tables, for each stored entity, must have the same granularity for the time interval • interval granularity must be the same “horizontally” (across tables) • interval granularity must be specific for each entity (across rows: “vertically”)
  • 28. Technical Challenges • The problem can be solved by ÂŤrefactoringÂť data. • Just like refactoring code, but working on data – Rephrased idea: “disciplined technique for restructuring existing temporal data, altering its internal structure without changing its value” • http://en.wikipedia.org/wiki/Code_refactoring
  • 29. Technical Challenges • Here‟s how: • Step 1 – For each item, for each temporal table, union all the distinct ranges
  • 30. Technical Challenges • Step 2 – For each item, take all the distinct: – Dates from the previous step – Dates that are known to ÂŤbreakÂť a range – The first day of each year used in ranges
  • 31. Technical Challenges • Step 3 – For each item, create the new ranges, doing the following steps • Order the rows by date • Take the ÂŤkÂť row and the ÂŤk+1Âť row • Create the range [Datek, Datek+1) The New LEAD operator is exactly what we need! 
  • 32. Technical Challenges • When the time granularity is the day, the right-opened range helps a lot here. – Eg: let‟s say the at some point, for an entity, you have the following dates that has to be turned into ranges:
  • 33. Technical Challenges • With a closed range intervals can be ambiguous, especially single day intervals: – The meaning is not explicitly clear and we would have to take care of it, making the ETL phase more complex. – It‟s better to make everything explicit
  • 34. Technical Challenges • So, to avoid ambiguity, with a closed range the resulting set of intervals also needs to take time into account: – What granularity to use for time? • Hour? Minute? Microseconds? – Is just a waste of space – Date or Int datatype cannot be used
  • 35. Technical Challenges • With a right-opened interval everything is easier: – Date or Int datatype are enough – No time used, so no need to deal with it
  • 36. Technical Challenges • Step 4 – ÂŤExplodeÂť all the temporal source tables so that all data will use the same (new) ranges, using the CONTAINS operator:
  • 37. Technical Challenges • The original Range: Becomes the equivalent set of ranges: If we would have chosen a daily snapshot, 425 rows would have been generated…not only 12! (That‟s a 35:1 ratio!)
  • 38. Technical Challenges • Now we have all the source data conforming to a common set of intervals – This means that we can just join tables without having to do complex temporal joins • All fact tables will also have the same interval granularity • We solved all the technical problems! 
  • 40. Functional Challenges • SSAS doesn‟t support time intervals but… • Using some imagination we can say that – An interval is made of 1 to “n” dates – A date belongs to 1 to “n” intervals • We can model this situation with a Many- To-Many relationship!
  • 41. Functional Challenges • The ÂŤinterval dimensionÂť will then be hidden and the user will just see the ÂŤAnalysis DateÂť dimension • When such date is selected, all the fact rows that have an interval containing that date will be selected too, giving us the solution we‟re looking for 
  • 42. Functional Challenges Date The user wants Dimension to analyze data on 10 Aug 2009 Fact Table Factless DateRange Date-DateRange Dimension SSAS uses all 10 Aug is the found ranges contained in two All the (three) Ranges rows related to those ranges are read
  • 44. The final recap And a quick recipe to make everything easier
  • 45. Steps for the DWH • For each entity: – Get all “valid_from” and “valid_to” columns from all tables – Get all the dates that will break intervals from all tables – Take all the gathered dates one time (remove duplicates) and generate new intervals – ÂŤUnpackÂť original tables in order to use the newly generated intervals
  • 46. Steps for the CUBE • Generate the usual Date dimension – Generate a ÂŤDate RangeÂť dimension – Generate a Factless Date-DataRange table – Generate the fact table with reference to the Date Range dimension – Create the M:N relationship
  • 47. BIA-406-S | Temporal Snapshot Fact Table 47
  • 48. Conclusion and Improvements Some ideas for the future
  • 49. Conclusion and improvements • The defined pattern can be applied each time a daily analysis is needed and data is not additive – So each time you need to “snapshot” something • The approach can also be used if source data is not temporal, as long as you can turn it into a temporal format • Performance can be improved by partitioning the cube by year or even by month
  • 50. Conclusion and improvements • At the end is a very simple solution  • Everything you‟ve seen is the summary of several months of work of the Italian Team. Thanks guys!

Hinweis der Redaktion

  1. FromBol: “The member value is evaluated as the value of its last child along the time dimension that contains data”
  2. http://www.kimballgroup.com/html/designtipsPDF/DesignTips2002/KimballDT37ModelingPipeline.pdf
  3. Show SQL Queries
  4. Show the SSIS solution to explode the tables
  5. Show why the M2Msolutionworks (using T-SQL queries)Show the olapviewsShow the SSAS solution
  6. Review the complete solution from start to end (maybeusing *much* more data?)