SlideShare ist ein Scribd-Unternehmen logo
1 von 86
Downloaden Sie, um offline zu lesen
A company of Daimler AG
LECTURE @DHBW: DATA WAREHOUSE
PART III: ETL AND DB SPECIFICS
ANDREAS BUCKENHOFER, DAIMLER TSS
ABOUT ME
https://de.linkedin.com/in/buckenhofer
https://twitter.com/ABuckenhofer
https://www.doag.org/de/themen/datenbank/in-memory/
http://wwwlehre.dhbw-stuttgart.de/~buckenhofer/
https://www.xing.com/profile/Andreas_Buckenhofer2
Andreas Buckenhofer
Senior DB Professional
andreas.buckenhofer@daimler.com
Since 2009 at Daimler TSS
Department: Big Data
Business Unit: Analytics
DAIMLER TSS. IT EXCELLENCE: COMPREHENSIVE, INNOVATIVE, CLOSE.
We're a specialist and strategic business partner for innovative IT Solutions within Daimler –
not just another supplier!
As a 100% subsidiary of Daimler, we live the culture of excellence and aspire to take an
innovative and technological lead.
With our outstanding technological and methodical competence we are a competent provider of
services that help those who benefit from them to stand out from the competition. When it
comes to demanding IT questions we create impetus, especially in the core fields car IT and
mobility, information security, analytics, shared services and digital customer
experience.
Data Warehouse / DHBWDaimler TSS GmbH 3
TSS 2 0 2 0 ALWAYS ON THE MOVE.
Daimler TSS GmbH 4
LOCATIONS
Data Warehouse / DHBW
Daimler TSS China
Hub Beijing
6 Employees
Daimler TSS Malaysia
Hub Kuala Lumpur
38 Employees
Daimler TSS India
Hub Bangalore
16 Employees
Daimler TSS Germany
6 Locations
More than1000 Employees
Ulm (Headquarters)
Stuttgart Area
Böblingen, Echterdingen,
Leinfelden, Möhringen
Berlin
After the end of this lecture you will be able to
Understand concepts behind ETL
WHAT YOU WILL LEARN TODAY
Data Warehouse / DHBWDaimler TSS 5
LOGICAL STANDARD DATA WAREHOUSE ARCHITECTURE
Data Warehouse / DHBWDaimler TSS 6
Data Warehouse
FrontendBackend
External data sources
Internal data sources
Staging
Layer
(Input
Layer)
OLTP
OLTP
Core
Warehouse
Layer
(Storage
Layer)
Mart Layer
(Output
Layer)
(Reporting
Layer)
Integration
Layer
(Cleansing
Layer)
Aggregation
Layer
Metadata Management
Security
DWH Manager incl. Monitor
? ? ? ?
Extract – Transform - Load
Other term: Data integration (better, more neutral)
ETL PROCESS
Data Warehouse / DHBWDaimler TSS 7
• capture and copy data from source systems (e.g. operational systems)
• many different types of sources
• Relational, hierarchical DBMSs
• Flat files
• Other internal/external sources
TASKS OF THE ETL PROCESS - EXTRACT
Data Warehouse / DHBWDaimler TSS 8
• Filter data
• Integrate data
• Check and cleanse data
TASKS OF THE ETL PROCESS - TRANSFORM
Data Warehouse / DHBWDaimler TSS 9
• Original meaning: Fast load into staging area
• General meaning: Loading data into staging area or another layer
TASKS OF THE ETL PROCESS - LOAD
Data Warehouse / DHBWDaimler TSS 10
ETL often used for data integration in general (for ETL and ELT)
But: if ELT is mentioned, it is differentiated from ETL
ETL VS ELT
Data Warehouse / DHBWDaimler TSS 11
Source
DB
Target
DB
ETL Server
Source
DB
Target
DB
ELT Server
Data flow
ETL VS ELT
Data Warehouse / DHBWDaimler TSS 12
ETL ELT
Data is transferred to ETL server and transferred back to
DB. High network bandwidth required
Data remains in the DB except for cross Database loads
(e.g. source to target)
Transformations are performed in the ETL Server Transformations are performed (in the source or) in the
target
Proprietary code is executed in the ETL server Generated code, e.g. SQL, PL/SQL, SQLT
Typically used for
• source to target transfer
• Compute intensive transformations
• Small amount of data
Typically used for
• High amounts of data
ETL/ELT TOOL VS MANUAL ETL/ELT
Data Warehouse / DHBWDaimler TSS 13
ETL Tool Manual ETL
Informatica, Talend, Oracle ODI, etc. SQL, PL/SQL, SQLT, etc.
Separate license No additional license
Workflow, error handling, and restart/recovery
functionality included
Workflow, error handling, and restart/recovery
functionality must be implemented manually
Impact analysis and where-used (lineage) functionality
available
Impact analysis and where-used (lineage) functionality
difficult
Faster development, easier maintenance Slower development, more difficult maintenance
Additional (Tool-) Know How required Know How often available
ETL/ELT TOOL VS MANUAL ETL/ELT
Data Warehouse / DHBWDaimler TSS 14
Extract services
Load
services
Operations management services
Scheduler Control Repository Management
Connectors
Sorter
Connector
Sorter
Bulk Loader
Data Profiling servicesSource analysis
Data Quality servicesData cleansing
Data Transformation and Integration services
Data mapping Business rules
Slowly Changing Dimensions
Datatype conversion
Lookups
Job Monitoring Auditing Error Handling
Security
MAPPING - INFORMATICA
Data Warehouse / DHBWDaimler TSS 15
Source Target
Filter
Lookup
MAPPING WITH TRANSFORMATIONS - INFORMATICA
Data Warehouse / DHBWDaimler TSS 16
Sorter
Aggregator Transformation
Union Transformation
WORKFLOW - INFORMATICA
Data Warehouse / DHBWDaimler TSS 17
Decision & coordination step
Session containing Mapping
JOB MONITORING - INFORMATICA
Data Warehouse / DHBWDaimler TSS 18
Extracts from source systems
Initial extract for setting up the data warehouse
• Initial Load
Periodical extracts for adding new/changed information to the data
warehouse
• Incremental Load
Question: How to determine what is new or what has changed in the source
systems?
Task of „monitoring“
MONITORING (DATA CHANGE DETECTION)
Data Warehouse / DHBWDaimler TSS 19
Discovery of all changes vs. determining the net effect at extract/load
time only
• Example: an attribute value can be changed in two ways:
• by one update operation
• by one delete and one insert operation
The net effect of both is the same
However, history information is lost if the net effect is recorded only
MONITORING: NET EFFECT OF CHANGES
Data Warehouse / DHBWDaimler TSS 20
Which techniques can be used to identify changes in a source system
(RDBMS)?
• E.g. in OLTP system
• new products are inserted
• customer address changes
• Product is deleted because it is out of stock
How would you identify such changes? List advantages / disadvantages of
possible solutions
Think about making changes in the source system. Think also about other
solutions without any change in the source system.
EXERCISE MONITORING
Data Warehouse / DHBWDaimler TSS 21
Depend on characteristics of the data sources
The following techniques are based on modern relational DBMS
Types of techniques
Based on DBMS
• Trigger-based
• Log-based discovery
• Replication techniques
Controlled by application
• Timestamp-based discovery
• Snapshot-based discovery
MONITORING TECHNIQUES
Data Warehouse / DHBWDaimler TSS 22
Active monitoring mechanisms
Based on (database) triggers
• Example:
• If new record is inserted in sales transaction table then insert transaction id and
timestamp in change table
Advantage:
• Triggers do not change operational applications
Disadvantage:
• Performance impact on operation systems if triggers are used extensively
• Triggers have to be implemented for every table in the source systems
TRIGGER-BASED
Data Warehouse / DHBWDaimler TSS 23
Sample Trigger Code, Oracle
CREATE [OR REPLACE] TRIGGER <trigger_name>
{BEFORE|AFTER} {INSERT|DELETE|UPDATE}
ON <table_name>
[REFERENCING [NEW AS <new_row_name>] [OLD AS <old_row_name>]]
[FOR EACH ROW [WHEN (<trigger_condition>)]]
<trigger_body>
Trigger is created for each source table in OLTP DB and stores
insert/update/delete changes in a “log/journal table”
• trigger body contains insert statements into log/journal table
TRIGGER-BASED
Data Warehouse / DHBWDaimler TSS 24
Log-based discovery
Also often referenced as CDC (Change Data Capture)
Usage of database transaction logs to determine changes
• DBMSs write transaction logs in order to be able to undo partially executed
transactions
• This information can be used to determine all changes
• Log reader identifies insert, update, delete, truncates and writes the changes as
inserts into staging layer
Transaction Log files can be transferred to other systems to avoid additional
load on source systems
LOG-BASED
Data Warehouse / DHBWDaimler TSS 25
LOG-BASED (SAMPLE PRODUCT ARCHITECTURE IIDR)
Data Warehouse / DHBWDaimler TSS 26
Frontend
Standard
Reports
AdHoc
Reports
IIDR
ReplEngine
Source
Datastore
Source
OLTP
DB
IIDR ReplEngine
DWH
Datastore
DWH
DWH DB
Staging Layer
Core Layer
Mart Layer
Transaction
Logs
Replication techniques
Data replication
• Target tables not necessarily on local system
• Uses typically Transaction Logs
• Log reader identifies insert, update, delete, truncates and writes the changes into
replicated tables (insert remains insert, update remains update, etc)
• Useful for 1:1 copies (e.g. ODS, Operational Data Store) but still challenge to detect
changes for loading the data mart
REPLICATION-BASED
Data Warehouse / DHBWDaimler TSS 27
REPLICATION-BASED (SAMPLE PRODUCT ARCHITECTURE
IIDR)
Data Warehouse / DHBWDaimler TSS 28
Frontend
Standard
Reports
AdHoc
Reports
IIDR
ReplEngine
Source
Datastore
Source
OLTP
DB
IIDR ReplEngine
DWH
Datastore
DWH
DWH DB
Staging Layer
Core Layer
Mart Layer
Transaction
Logs
Timestamp-based discovery
• Every data item in a table is associated with timestamp information about its
validity period
• Changed data can be determined from this timestamp information
TIMESTAMP-BASED
Data Warehouse / DHBWDaimler TSS 29
Sample customer table in OLTP
• Each table gets Change timestamp
• Delta process reads latest data only (e.g. ChangeTimestamp >= <yesterday>)
• Problem: it is not possible to identify deleted rows
TIMESTAMP-BASED
Data Warehouse / DHBWDaimler TSS 30
CustomerID Name Department Change Timestamp
1 Miller DWH 15.01.2015 17:00:01
2 Powell DB 22.03.2016 08:30:22
Data comparison
Comparison of snapshots of the operational data at different points in time
• Compute difference between two latest snapshots
• E.g. unload all data from a table into a file and diff newest file content with latest file
content
Can be very complex
Sometimes the only possibility, for instance for legacy applications
High performance impact on source
SNAPSHOT-BASED
Data Warehouse / DHBWDaimler TSS 31
MONITORING TECHNIQUES COMPARISON
Data Warehouse / DHBWDaimler TSS 32
Trigger-based Replication
techniques
Log-based
discovery
Timestamp-
based discovery
Snapshot-based
discovery
Performance
impact on source
system
Medium Low Low Medium High
Performance
impact on target
system
Low Low Low Low High
Load on network Low Low Low Low High
Data loss if
nologging
operations
No Yes Yes No No
MONITORING TECHNIQUES COMPARISON
Data Warehouse / DHBWDaimler TSS 33
Trigger-based Replication
techniques
Log-based
discovery
Timestamp-
based discovery
Snapshot-based
discovery
Identify DELETE
operations
Yes Yes Yes No Yes
Identify ALL
changes (changes
between
extractions)
Yes Yes Yes No No
Direct Access
• Source writes data into target or
• Target reads data from source
• Security concerns
• High coupling / dependencies
DATA TRANSPORT – DIRECT ACCESS
Data Warehouse / DHBWDaimler TSS 34
Source Target
File transfer (or other transport medium)
• csv, json, xml, binary, etc
• Transfer data by scp, rfts (reliable file transfer system), ESB (enterprise service
bus), SOA (service oriented architecture), etc
• Often high amounts of data, therefore bulk transfer of compressed data most
widely used
• Better decoupling of source and target
DATA TRANSPORT – FILE TRANSFER
Data Warehouse / DHBWDaimler TSS 35
Source Targetfiles
Extraction intervals
• Periodically – in regular intervals
• Every day, week, etc.
• Instantly / Continuous
• Every change is directly propagated into the data warehouse
• „real time data warehouse“
• Depends on the requirements on timeliness of the data warehouse data
EXTRACTION INTERVALS
Data Warehouse / DHBWDaimler TSS 36
Triggered by a specific request
• Addition of a new product
• Query which involves more recent data
Triggered by specific events
• Number of changes in operational data exceeds threshold
EXTRACTION INTERVALS
Data Warehouse / DHBWDaimler TSS 37
• Profile Existing Data Sources, Extracted Data
• Analyze data structure, content, and quality
• Find data relationships across systems
• Often badly documented or missing foreign keys
• Uncover data issues that can affect subsequent transformation steps
• Missing values
• Duplicates
• Inconsistencies
PREREQUISITE OF ETL - UNDERSTANDING THE DATA
Data Warehouse / DHBWDaimler TSS 38
DATA QUALITY ISSUES
Data Warehouse / DHBWDaimler TSS 39
CustomerNo Name Birthdate Age Gender Zip code
1 Miller, Tom 33.01.2001 15 M NULL
1 John Mayor 15.01.2001 15 M 98144
2 Mrs. Bush 31.10.1988 22 Q 00000
3 Martin 31.10.1988 22 M 75890
PK / Unique Key violated Data not uniform Not valid
Inconsistent Wrong value
Unknown / missing
FK violated
DATA QUALITY ISSUES AND POSSIBLE SOLUTIONS IN
THE SOURCE RDBMS
Data Warehouse / DHBWDaimler TSS 40
Issue Solution
Wrong data e.g. 31.02.2016 Proper data type definition
Wrong values, e.g. number out of range Check constraint
Missing values NOT NULL constraint
Violated references FOREIGN KEY constraint
Duplicates PRIMARY or UNIQUE KEY constraint
Inconsistent data ACID transactions, business logic, additional checks
DATA QUALITY ISSUES AND POSSIBLE SOLUTIONS IN
THE SOURCE RDBMS
Data Warehouse / DHBWDaimler TSS 41
Issue Solution
Wrong data e.g. 31.02.2016 Proper data type definition
Wrong values, e.g. number out of range Check constraint
Missing values NOT NULL constraint
Violated references FOREIGN KEY constraint
Duplicates PRIMARY or UNIQUE KEY constraint
Inconsistent data ACID transactions, business logic, additional checks
Correcting the data
• Automatically during ETL
• E.g., address of a customer if a correct reference table exists
• Manually after ETL is finished
• ETL stored bad data in error log tables or files
• ETL flags bad data (e.g. invalid)
DATA QUALITY ISSUES: WORKAROUNDS IN DWH
Data Warehouse / DHBWDaimler TSS 42
Correcting the data
• In the source systems
• Common master data management across all operational applications
• Dedicated systems are “master” of e.g. customer data
• Correcting the data at the source is best approach but slow and often not feasible
DATA QUALITY ISSUES: CORRECT DATA IN THE SOURCE
Data Warehouse / DHBWDaimler TSS 43
• Column is null
• Reject data
• Use default values
• Missing values can represent
• an unknown value Iike date of birth of a customer
• a missing value like engine_id for a car (logical not null constraint)
• Dimension tables can include some dummy values:
DATA QUALITY ISSUES: MISSING DATA
Data Warehouse / DHBWDaimler TSS 44
DimensionTable_X Description
-1 Unknown
-2 Missing
• Data is inaccurate
e.g. wrong date 32.12.2015
or wrong number 55U
• Reject data
• Replace with value that represents „Invalid“
• Dimension tables can include some dummy values:
DATA QUALITY ISSUES: MISSING DATA
Data Warehouse / DHBWDaimler TSS 45
DimensionTable_X Description
-1 Unknown
-2 Missing
-3 Invalid
• Data has conflicts, e.g. wrong postal code 80995 Stuttgart
• Reject data
• Replace one of the values with a value that represents „Invalid“ or with
corrected value
Which value to replace? Rules necessary
DATA QUALITY ISSUES: CONFLICTING DATA
Data Warehouse / DHBWDaimler TSS 46
• Data is inconsistent, e.g. Order date after payment date or unlikely high
price for a product
• Can be discovered by statistical and data mining methods
DATA QUALITY ISSUES: INCONSISTENT DATA
Data Warehouse / DHBWDaimler TSS 47
• Data is duplicated, e.g. „Martin Miller” vs “Miller, Martin” vs “M.Miller”
• Multiple representations for one entity
• Different keys
• Different encodings
• Duplicate detection can be very difficult / tricky
• Products are available for e.g.
address duplicate detection
address validation (Kingstreet = does this address actually exist?)
address harmonization (Kingstr, Kingstreet, King Street, etc)
• Standardize / Harmonize data during ETL flow: “unification” for better
duplicate detection
DATA QUALITY ISSUES: DUPLICATES
Data Warehouse / DHBWDaimler TSS 48
• Unification of data types
• Character string date „20.01.2006“ 20.01.2006
• Character string number „12345“ 12345
• Unification of encodings
• For instance for gender F and M
• Lookup-tables contain the mapping from old to new encodings
• Combination of different attributes to one attribute
• day, month, year date
TRANSFORM - UNIFICATION OF DATA
Data Warehouse / DHBWDaimler TSS 49
• Split of one attribute into two or more
• Name first name, last name (“Herr Prof. Dr. Hans M. vom und zum Stein”)
• Unification of names can become very challenging “Herr Prof. Dr. Hans M. vom
und zum Stein” or “Werner Martin” or “Mariae Gloria … Wilhelmine Huberta Gräfin
von Schönburg-Glauchau“
• Product name - „Cola, 0.33 l“
Product short name - „Cola“, size in liters - 0.33
TRANSFORM - UNIFICATION OF DATA
Data Warehouse / DHBWDaimler TSS 50
• Unification of dates and timestamps
• Rules for representing incomplete date information
If only month and year are known
• Dates and timestamps with regard to one specific timezone
Important for multi-national organizations
UTC Coordinated Universal Time without daylight saving zone
• What can happen if clock is changed to winter time if no UTC is used?
- Update arrives at 02:15 in staging layer (CDC / log-based monitor)
- Clock is changed to winter time: -1h
- Update of the same row arrives at 02:10 in staging layer (CDC / log-based)
- How can batch load running the next night discover which update is the most
recent one?
TRANSFORM - UNIFICATION OF DATA
Data Warehouse / DHBWDaimler TSS 51
• Computation of derived values
• Profit = sales price – purchase price
Without clear definition, different interpretations possible
• Net or gross sales price?
• Net or gross purchase price?
• Aggregations
• Revenue of the year computed from revenues of the day
Without clear definition, different interpretations possible
• Calendar year?
• Fiscal year?
TRANSFORM - UNIFICATION OF DATA
Data Warehouse / DHBWDaimler TSS 52
• Specification between source and target columns
• Source tables + columns
• Target table + columns
• Join rules
• Filter criteria
• Transformation rules
DATA MAPPING
Data Warehouse / DHBWDaimler TSS 53
• Efficient load operations are important
• bulk load: Single row processing vs set based processing
• Online load
• Data warehouse (especially Data Mart) is still accessible
• Offline load
• Data warehouse (especially Data Mart) is offline
• For updates that require the recomputation of a cube
• Offline load is often a Tool limit because the Tool locks data structures. But offline
load could be faster.
LOAD
Data Warehouse / DHBWDaimler TSS 54
• Specific Bulk load operations provided by RDBMS, e.g. External tables in
Oracle or LOAD command in DB2
• Single row vs set based processing
BULK PROCESSING
Data Warehouse / DHBWDaimler TSS 55
Single row processing Set based processing
Cursor curs = SELECT * FROM <source>
WHILE NOT EOF(curs)
FETCH NEXT ROW INTO myRoW;
INSERT INTO <target> VALUES(myRow);
LOOP
INSERT INTO <target>
SELECT * from <source>
Error handling easy All or nothing if there are errors
Slow for high amounts of data Performs well for small and high amounts of data
More coding Less code = less errors
ETL-JOB PARALLELISM FOR LOADING DATA INTO CORE
WAREHOUSE LAYER
Data Warehouse / DHBWDaimler TSS 56
HUBloaded
LINKundHUB-
SATloaded
LINK-SATloaded
DataVault
Load
Classical
Load
?
? ?
Integration of new JobsTime Windows for Loads, e.g 00:00-06:00
• Complex
• Many dependencies
• Many sequential jobs
• Systematic / Methodic
• Few, well defined dependencies
• Massive parallel
Draw a flow diagram how to load a HUB, LINK and SAT table and describe the
SQL statements
EXERCISE: LOAD DATA VAULT TABLE
Data Warehouse / DHBWDaimler TSS 57
EXERCISE: LOAD HUB TABLE
Data Warehouse / DHBWDaimler TSS 58
Source
data exist
Load distinct
business keys
Does
business
Key exist in
HUB?
Insert row into
HUB
Conflict if PK
HashKey
collision!
no
Reject
data
Data loaded into
HUB
yes
INSERT INTO core.fahrzeug (vehicle_hk, fin, loaddate, recordsource)
SELECT DISTINCT f.fahrzeug_hashkey
, f.fin_bk
, f.loaddate
, f.recordsource
FROM staging.fahrzeugdaten f
WHERE f.fin_bk NOT in (SELECT fin FROM core.hub_fahrzeug)
AND f.loaddate = <date to load>;
EXERCISE: LOAD HUB TABLE
Data Warehouse / DHBWDaimler TSS 59
EXERCISE: LOAD LINK TABLE
Data Warehouse / DHBWDaimler TSS 60
Source
data exist
Load distinct
business keys
Does Hash
Key
relationship
exist in
HUB?
Insert row into
LINK
Conflict if PK
HashKey
collision!
no
Reject
data
Data loaded into
LINK
yes
INSERT INTO core.link_verbaut (verbaut_hk, motor_hk, vehicle_hk, loaddate, recordsource)
SELECT DISTINCT h.verbaut_hk
, f.motor_hashkey
, f.fahrzeug_hashkey
, f.loaddate
, f.recordsource
FROM staging.fahrzeugdaten f
WHERE (f.motor_hashkey, f.fahrzeug_hashkey) NOT in (SELECT motor_hk, vehicle_hk FROM
core.link_verbaut v)
AND loaddate = <date to load>;
EXERCISE: LOAD LINK TABLE
Data Warehouse / DHBWDaimler TSS 61
EXERCISE: LOAD SAT TABLE
Data Warehouse / DHBWDaimler TSS 62
Source
data exist
Load
distinct
source
data
MD5-
HASH
Diff
identical?
Insert row into
SAT
no
Reject
data
Data loaded into
SAT
yes
Load
current/
latest row
from SAT
table
INSERT INTO core.sat_fahrzeug_text (vehicle_hk, loaddate, recordsource, md5_hash, codeleiste, kommentar)
SELECT DISTINCT f.fahrzeug_hashkey
, f.loaddate
, f.recordsource
, f.md5hash
, f.codeleiste
, f.kommentar
FROM staging.fahrzeugdaten f
LEFT OUTER JOIN (select s.vehicle_hk, s.md5_hash from s_fahrzeug s JOIN (select i.VEHICLE_HK, max(i.loaddate) as loaddate from
s_fahrzeug i GROUP BY i.VEHICLE_HK) m
ON s.vehicle_hk = m.vehicle_hk AND s.loaddate = m.loaddate) k ON f.fahrzeug_hashkey = k.vehicle_hk
WHERE (k.md5_hash is null OR f.md5hash <> k.md5_hash)
AND f.loaddate = <date to load>;
EXERCISE: LOAD SAT TABLE
Data Warehouse / DHBWDaimler TSS 63
LOGICAL STANDARD DATA WAREHOUSE ARCHITECTURE
Data Warehouse / DHBWDaimler TSS 64
Data Warehouse
FrontendBackend
External data sources
Internal data sources
Staging
Layer
(Input
Layer)
OLTP
OLTP
Core
Warehouse
Layer
(Storage
Layer)
Mart Layer
(Output
Layer)
(Reporting
Layer)
Integration
Layer
(Cleansing
Layer)
Aggregation
Layer
Metadata Management
Security
DWH Manager incl. Monitor
? ? ? ?
DB SPECIFICS FOR DWH
• After the end of this lecture you will be able to
• Understand DB techniques that are specific for DWH
• Bitemporal data
• Indexing, Partitioning, Parallelism
WHAT YOU WILL LEARN TODAY
Data Warehouse / DHBWDaimler TSS 66
TEMPORAL DATA STORAGE (BITEMPORAL DATA)
Data Warehouse / DHBWDaimler TSS 67
10.09. 20.09. 30.09. 10.10.
Time
Price: 15EUR Price: 16EUR
New Price of 16EUR is
entered into the DB
Valid
Time
(20.09.)
Transaction
Time
(10.09.)
Valid time is the time period during which a fact is true in the real world.
Transaction time is the time period during which a fact stored in the
database was known.
Bitemporal data combines both Valid and Transaction Time.
Source: (Wikipedia, https://en.wikipedia.org/wiki/Temporal_database)
TEMPORAL DATA STORAGE (BITEMPORAL DATA)
Data Warehouse / DHBWDaimler TSS 68
• SQL standard SQL:2011
• But different implementations by RDBMSes like Oracle, DB2, SQL Server
and others
• Different syntax!
• Different coverage of standard!
• Very useful for slowly changing dimensions type 2, but also for other
purposes
TEMPORAL DATA STORAGE (BITEMPORAL DATA)
Data Warehouse / DHBWDaimler TSS 69
CREATE TABLE customer_address
( customerID INTEGER NOT NULL
, name VARCHAR(100)
, city VARCHAR(100)
, valid_start DATE NOT NULL
, valid_end DATE NOT NULL
, PERIOD BUSINESS_TIME(valid_start, valid_end)
, PRIMARY KEY(customerID, BUSINESS_TIME WITHOUT OVERLAPS) );
DB2 VALID TIME EXAMPLE
Data Warehouse / DHBWDaimler TSS 70
INSERT INTO customer_address VALUES
(1, 'Miller', 'Seattle', '01.01.2013', '31.12.2013');
UPDATE customer_address FOR PORTION OF BUSINESS_TIME
FROM '22.05.2013' TO '31.12.2013'
SET city = 'San Diego' WHERE customerID = 1;
DB2 VALID TIME EXAMPLE
Data Warehouse / DHBWDaimler TSS 71
customerID Name City Valid_start Valid_end
1 Miller Seattle 01.01.2013 22.05.2013
1 Miller San Diego 22.05.2013 31.12.2013
SELECT *
FROM customer_address
FOR BUSINESS_TIME AS OF '17.05.2013';
DB2 VALID TIME EXAMPLE
Data Warehouse / DHBWDaimler TSS 72
CREATE TABLE customer_info(
customerId INTEGER NOT NULL,
comment VARCHAR(1000) NOT NULL,
sys_start TIMESTAMP(12) NOT NULL GENERATED ALWAYS AS ROW BEGIN,
sys_end TIMESTAMP(12) NOT NULL GENERATED ALWAYS AS ROW END,
PERIOD SYSTEM_TIME (sys_start, sys_end)
);
DB2 TRANSACTION TIME EXAMPLE
Data Warehouse / DHBWDaimler TSS 73
Transaction on 15.10.2013:
INSERT INTO customer_info VALUES( 1, 'comment 1');
Transaction on 31.10.2013
UPDATE customer_address SET comment = 'comment 2'
WHERE customerID = 1;
DB2 TRANSACTION TIME EXAMPLE
Data Warehouse / DHBWDaimler TSS 74
CustomerId comment Sys_start Sys_end
1 Comment 2 31.10.2013 31.12.2999
SELECT *
FROM customer_info FOR SYSTEM_TIME AS OF '17.10.2013';
Data comes from a history table:
Valid Time and Transaction Time can be combined = Bitemporal table
DB2 TRANSACTION TIME EXAMPLE
Data Warehouse / DHBWDaimler TSS 75
CustomerId comment Sys_start Sys_end
1 Comment 1 15.10.2013 31.10.2013
• Very important performance improvement technique
• Good for many reads with high selectivity, write penalty
• B-trees most common
INDEXING - WHY
Data Warehouse / DHBWDaimler TSS 76
root
branch branch
leaf leaf leaf
…
…
Table
• DBs index Primary Keys by default
• Dimension table columns
that are regularly used in where clauses
are candidates
• Maybe foreign Key columns in Fact table
(see also later Star Transformation)
INDEXING A STAR SCHEMA – WHICH COLUMNS ARE
CANDIDATES FOR AN INDEX?
Data Warehouse / DHBWDaimler TSS 77
• Fact table has normally much more rows compared to dimension tables
• Common join techniques would need to join first dimension table with the
fact table
• Alternative technique: evaluate all dimensions
(cartesian join)
• Then join into fact table in last step
• Oracle uses Bitmap indexes on
foreign key columns in fact tables to achieve
Star Join; not supported by many DBs
STAR TRANSFORMATION
Data Warehouse / DHBWDaimler TSS 78
• Suppose you have a fact table containing data for last 10 years with
millions of rows but you are interested in only in
• Data from yesterday
• From last 2 years
How could you improve performance?
EXERCISE: PERFORMANCE
Data Warehouse / DHBWDaimler TSS 79
• Suppose you have a fact table containing data for last 10 years with
millions of rows but you are interested in only in
• Columnar In-memory DB may be an option in general (the option has already been
discussed during the lecture)
• Data from yesterday
• Indexing might be a good choice as not much rows are read
• From last 2 years
• Indexing most likely is a bad choice as reading a rather high amount of data via
an index quickly becomes inefficient
• Partitioning
EXERCISE: PERFORMANCE
Data Warehouse / DHBWDaimler TSS 80
PARTITIONING
Data Warehouse / DHBWDaimler TSS 81
Col1 Col2 Col3 col4
1 A AA AAA
2 B BB BBB
3 C CC CCC
Col1 Col2
1 A
2 B
3 C
Col3 col4
AA AAA
BB BBB
CC CCC
Col1 Col2 Col3 col4
3 C CC CCC
Col1 Col2 Col3 col4
1 A AA AAA
2 B BB BBB
Vertical partitioning (sharding) Horizontal partitioning
• Very powerful feature in a DWH to reduce workload
• Split table into logical smaller tables
• Avoidance of full table scans
• How could a table be split?
HORIZONTAL PARTITIONING
Data Warehouse / DHBWDaimler TSS 82
• By range
• Most common
• Use date field like order data to partition table into months, days, etc
• By list
• Use field that has limited number of different values, e.g. split customer data by
country if end users most likely select customers from within a country
• By hash
• Use a filed that most likely splits the data in evenly distributed chunks
HORIZONTAL PARTITIONING – SPLITTING OPTIONS
Data Warehouse / DHBWDaimler TSS 83
• Statements are normally executed on one CPU
• Parallelism allows the DB to distribute the execution to several CPUs
• Powerful combination with partitioning
• Parallelism is limited by the number of CPUs: if parallelism is too high,
performance will degrade
• Intra-query parallelism and inter-query parallelism
PARALLELISM
Data Warehouse / DHBWDaimler TSS 84
• Relational columnar In-Memory DB
• Materialized Views / Query Tables
ALREADY COVERED IN A PREVIOUS LECTURE
Data Warehouse / DHBWDaimler TSS 85
Daimler TSS GmbH
Wilhelm-Runge-Straße 11, 89081 Ulm / Telefon +49 731 505-06 / Fax +49 731 505-65 99
tss@daimler.com / Internet: www.daimler-tss.com/ Intranet-Portal-Code: @TSS
Domicile and Court of Registry: Ulm / HRB-Nr.: 3844 / Management: Christoph Röger (CEO), Steffen Bäuerle
Data Warehouse / DHBWDaimler TSS 86
THANK YOU

Weitere ähnliche Inhalte

Was ist angesagt?

Designing high performance datawarehouse
Designing high performance datawarehouseDesigning high performance datawarehouse
Designing high performance datawarehouse
Uday Kothari
 
introduction to datawarehouse
introduction to datawarehouseintroduction to datawarehouse
introduction to datawarehouse
kiran14360
 

Was ist angesagt? (20)

Designing high performance datawarehouse
Designing high performance datawarehouseDesigning high performance datawarehouse
Designing high performance datawarehouse
 
Module 02 teradata basics
Module 02 teradata basicsModule 02 teradata basics
Module 02 teradata basics
 
TDWI Roundtable: The HANA EDW
TDWI Roundtable: The HANA EDWTDWI Roundtable: The HANA EDW
TDWI Roundtable: The HANA EDW
 
DATA WAREHOUSE IMPLEMENTATION BY SAIKIRAN PANJALA
DATA WAREHOUSE IMPLEMENTATION BY SAIKIRAN PANJALADATA WAREHOUSE IMPLEMENTATION BY SAIKIRAN PANJALA
DATA WAREHOUSE IMPLEMENTATION BY SAIKIRAN PANJALA
 
An Introduction To BI
An Introduction To BIAn Introduction To BI
An Introduction To BI
 
introduction to datawarehouse
introduction to datawarehouseintroduction to datawarehouse
introduction to datawarehouse
 
Datawarehouse and OLAP
Datawarehouse and OLAPDatawarehouse and OLAP
Datawarehouse and OLAP
 
HANA Modeling
HANA Modeling HANA Modeling
HANA Modeling
 
Data warehouse system and its concepts
Data warehouse system and its conceptsData warehouse system and its concepts
Data warehouse system and its concepts
 
Introduction to Data Warehousing
Introduction to Data WarehousingIntroduction to Data Warehousing
Introduction to Data Warehousing
 
SAP BOBJ Rapid Mart Overview & Implementation
SAP BOBJ Rapid Mart Overview & ImplementationSAP BOBJ Rapid Mart Overview & Implementation
SAP BOBJ Rapid Mart Overview & Implementation
 
SAP BOBJ Rapid Marts Overview I
SAP BOBJ Rapid Marts Overview ISAP BOBJ Rapid Marts Overview I
SAP BOBJ Rapid Marts Overview I
 
Tableau Architecture
Tableau ArchitectureTableau Architecture
Tableau Architecture
 
Data warehousing
Data warehousingData warehousing
Data warehousing
 
SAS/Tableau integration
SAS/Tableau integrationSAS/Tableau integration
SAS/Tableau integration
 
SAS Programming For Beginners | SAS Programming Tutorial | SAS Tutorial | SAS...
SAS Programming For Beginners | SAS Programming Tutorial | SAS Tutorial | SAS...SAS Programming For Beginners | SAS Programming Tutorial | SAS Tutorial | SAS...
SAS Programming For Beginners | SAS Programming Tutorial | SAS Tutorial | SAS...
 
Oracle: DW Design
Oracle: DW DesignOracle: DW Design
Oracle: DW Design
 
Delta machenism with db connect
Delta machenism with db connectDelta machenism with db connect
Delta machenism with db connect
 
Dwh faqs
Dwh faqsDwh faqs
Dwh faqs
 
Chapter 2
Chapter 2Chapter 2
Chapter 2
 

Andere mochten auch

maqueta del folleto de presentacion turistica de galicia
maqueta del folleto de presentacion turistica de galiciamaqueta del folleto de presentacion turistica de galicia
maqueta del folleto de presentacion turistica de galicia
Vanesa Borlaff Rubio
 
El ventolin y espumeru
El ventolin y espumeruEl ventolin y espumeru
El ventolin y espumeru
veronicapa
 
BREAKFAST dotClub Auction Deck
BREAKFAST dotClub Auction DeckBREAKFAST dotClub Auction Deck
BREAKFAST dotClub Auction Deck
David Herman
 
CV SERVANO Paul_01.05.2015
CV SERVANO Paul_01.05.2015CV SERVANO Paul_01.05.2015
CV SERVANO Paul_01.05.2015
Paul Servano
 

Andere mochten auch (20)

Metadaten und Data Vault (Meta Vault)
Metadaten und Data Vault (Meta Vault)Metadaten und Data Vault (Meta Vault)
Metadaten und Data Vault (Meta Vault)
 
CDC und Data Vault für den Aufbau eines DWH in der Automobilindustrie
CDC und Data Vault für den Aufbau eines DWH in der AutomobilindustrieCDC und Data Vault für den Aufbau eines DWH in der Automobilindustrie
CDC und Data Vault für den Aufbau eines DWH in der Automobilindustrie
 
Lambdaarchitektur für BigData
Lambdaarchitektur für BigDataLambdaarchitektur für BigData
Lambdaarchitektur für BigData
 
Real-life Customer Cases using Data Vault and Data Warehouse Automation
Real-life Customer Cases using Data Vault and Data Warehouse AutomationReal-life Customer Cases using Data Vault and Data Warehouse Automation
Real-life Customer Cases using Data Vault and Data Warehouse Automation
 
Business Intelligence Overview
Business Intelligence OverviewBusiness Intelligence Overview
Business Intelligence Overview
 
Modern data warehouse
Modern data warehouseModern data warehouse
Modern data warehouse
 
A First Look at San Francisco’s New ETL Job Platform
A First Look at San Francisco’s New ETL Job PlatformA First Look at San Francisco’s New ETL Job Platform
A First Look at San Francisco’s New ETL Job Platform
 
Caching: In-Memory Column Store oder im BI Server
Caching: In-Memory Column Store oder im BI ServerCaching: In-Memory Column Store oder im BI Server
Caching: In-Memory Column Store oder im BI Server
 
Operational Data Vault
Operational Data VaultOperational Data Vault
Operational Data Vault
 
Data warehouse design
Data warehouse designData warehouse design
Data warehouse design
 
Supporting Data Services Marketplace using Data Virtualization
Supporting Data Services Marketplace using Data VirtualizationSupporting Data Services Marketplace using Data Virtualization
Supporting Data Services Marketplace using Data Virtualization
 
maqueta del folleto de presentacion turistica de galicia
maqueta del folleto de presentacion turistica de galiciamaqueta del folleto de presentacion turistica de galicia
maqueta del folleto de presentacion turistica de galicia
 
El ventolin y espumeru
El ventolin y espumeruEl ventolin y espumeru
El ventolin y espumeru
 
‏‏ Aftermath Of Kassam Hits Case Studies [מצב תאימות]
‏‏ Aftermath Of Kassam Hits   Case Studies [מצב תאימות]‏‏ Aftermath Of Kassam Hits   Case Studies [מצב תאימות]
‏‏ Aftermath Of Kassam Hits Case Studies [מצב תאימות]
 
BREAKFAST dotClub Auction Deck
BREAKFAST dotClub Auction DeckBREAKFAST dotClub Auction Deck
BREAKFAST dotClub Auction Deck
 
Claves para promocionar tu negocio en internet
Claves para promocionar tu negocio en internetClaves para promocionar tu negocio en internet
Claves para promocionar tu negocio en internet
 
CV SERVANO Paul_01.05.2015
CV SERVANO Paul_01.05.2015CV SERVANO Paul_01.05.2015
CV SERVANO Paul_01.05.2015
 
2013 Ezwim Telecom Monitor
2013 Ezwim Telecom Monitor2013 Ezwim Telecom Monitor
2013 Ezwim Telecom Monitor
 
catálogo manos
catálogo manoscatálogo manos
catálogo manos
 
Amarante leyes, reglamentos, disposiciones de salud vinculados a emergencias ...
Amarante leyes, reglamentos, disposiciones de salud vinculados a emergencias ...Amarante leyes, reglamentos, disposiciones de salud vinculados a emergencias ...
Amarante leyes, reglamentos, disposiciones de salud vinculados a emergencias ...
 

Ähnlich wie Part 3 - Data Warehousing Lecture at BW Cooperative State University (DHBW)

Building the DW - ETL
Building the DW - ETLBuilding the DW - ETL
Building the DW - ETL
ganblues
 
oracle_soultion_oracledataintegrator_goldengate_2021
oracle_soultion_oracledataintegrator_goldengate_2021oracle_soultion_oracledataintegrator_goldengate_2021
oracle_soultion_oracledataintegrator_goldengate_2021
ssuser8ccb5a
 
CERN_DIS_ODI_OGG_final_oracle_golde.pptx
CERN_DIS_ODI_OGG_final_oracle_golde.pptxCERN_DIS_ODI_OGG_final_oracle_golde.pptx
CERN_DIS_ODI_OGG_final_oracle_golde.pptx
camyla81
 
The thinking persons guide to data warehouse design
The thinking persons guide to data warehouse designThe thinking persons guide to data warehouse design
The thinking persons guide to data warehouse design
Calpont
 

Ähnlich wie Part 3 - Data Warehousing Lecture at BW Cooperative State University (DHBW) (20)

ETL_Methodology.pptx
ETL_Methodology.pptxETL_Methodology.pptx
ETL_Methodology.pptx
 
Datawarehousing & DSS
Datawarehousing & DSSDatawarehousing & DSS
Datawarehousing & DSS
 
Building the DW - ETL
Building the DW - ETLBuilding the DW - ETL
Building the DW - ETL
 
Extract, Transform and Load.pptx
Extract, Transform and Load.pptxExtract, Transform and Load.pptx
Extract, Transform and Load.pptx
 
Should ETL Become Obsolete
Should ETL Become ObsoleteShould ETL Become Obsolete
Should ETL Become Obsolete
 
Best Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon RedshiftBest Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon Redshift
 
ETL Technologies.pptx
ETL Technologies.pptxETL Technologies.pptx
ETL Technologies.pptx
 
Lecture 18
Lecture 18Lecture 18
Lecture 18
 
Best Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon Redshift Best Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon Redshift
 
Informix warehouse and accelerator overview
Informix warehouse and accelerator overviewInformix warehouse and accelerator overview
Informix warehouse and accelerator overview
 
Best Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon RedshiftBest Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon Redshift
 
Data ware house architecture
Data ware house architectureData ware house architecture
Data ware house architecture
 
oracle_soultion_oracledataintegrator_goldengate_2021
oracle_soultion_oracledataintegrator_goldengate_2021oracle_soultion_oracledataintegrator_goldengate_2021
oracle_soultion_oracledataintegrator_goldengate_2021
 
CERN_DIS_ODI_OGG_final_oracle_golde.pptx
CERN_DIS_ODI_OGG_final_oracle_golde.pptxCERN_DIS_ODI_OGG_final_oracle_golde.pptx
CERN_DIS_ODI_OGG_final_oracle_golde.pptx
 
Data warehousing guidelines for bi and BAM solutions
Data warehousing guidelines for bi and BAM solutionsData warehousing guidelines for bi and BAM solutions
Data warehousing guidelines for bi and BAM solutions
 
DataEng Mad - 03.03.2020 - Tibero 30-min Presentation.pdf
DataEng Mad - 03.03.2020 - Tibero 30-min Presentation.pdfDataEng Mad - 03.03.2020 - Tibero 30-min Presentation.pdf
DataEng Mad - 03.03.2020 - Tibero 30-min Presentation.pdf
 
Test labs 2016. Тестирование data warehouse
Test labs 2016. Тестирование data warehouse Test labs 2016. Тестирование data warehouse
Test labs 2016. Тестирование data warehouse
 
Tuning ETL's for Better BI
Tuning ETL's for Better BITuning ETL's for Better BI
Tuning ETL's for Better BI
 
The thinking persons guide to data warehouse design
The thinking persons guide to data warehouse designThe thinking persons guide to data warehouse design
The thinking persons guide to data warehouse design
 
Data warehouse
Data warehouseData warehouse
Data warehouse
 

Kürzlich hochgeladen

Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...
gajnagarg
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
amitlee9823
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
gajnagarg
 
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men 🔝Sambalpur🔝 Esc...
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men  🔝Sambalpur🔝   Esc...➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men  🔝Sambalpur🔝   Esc...
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men 🔝Sambalpur🔝 Esc...
amitlee9823
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
amitlee9823
 
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
amitlee9823
 
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
amitlee9823
 
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night StandCall Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...
gajnagarg
 

Kürzlich hochgeladen (20)

Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
 
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
 
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men 🔝Sambalpur🔝 Esc...
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men  🔝Sambalpur🔝   Esc...➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men  🔝Sambalpur🔝   Esc...
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men 🔝Sambalpur🔝 Esc...
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
 
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
 
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night StandCall Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 

Part 3 - Data Warehousing Lecture at BW Cooperative State University (DHBW)

  • 1. A company of Daimler AG LECTURE @DHBW: DATA WAREHOUSE PART III: ETL AND DB SPECIFICS ANDREAS BUCKENHOFER, DAIMLER TSS
  • 3. DAIMLER TSS. IT EXCELLENCE: COMPREHENSIVE, INNOVATIVE, CLOSE. We're a specialist and strategic business partner for innovative IT Solutions within Daimler – not just another supplier! As a 100% subsidiary of Daimler, we live the culture of excellence and aspire to take an innovative and technological lead. With our outstanding technological and methodical competence we are a competent provider of services that help those who benefit from them to stand out from the competition. When it comes to demanding IT questions we create impetus, especially in the core fields car IT and mobility, information security, analytics, shared services and digital customer experience. Data Warehouse / DHBWDaimler TSS GmbH 3 TSS 2 0 2 0 ALWAYS ON THE MOVE.
  • 4. Daimler TSS GmbH 4 LOCATIONS Data Warehouse / DHBW Daimler TSS China Hub Beijing 6 Employees Daimler TSS Malaysia Hub Kuala Lumpur 38 Employees Daimler TSS India Hub Bangalore 16 Employees Daimler TSS Germany 6 Locations More than1000 Employees Ulm (Headquarters) Stuttgart Area Böblingen, Echterdingen, Leinfelden, Möhringen Berlin
  • 5. After the end of this lecture you will be able to Understand concepts behind ETL WHAT YOU WILL LEARN TODAY Data Warehouse / DHBWDaimler TSS 5
  • 6. LOGICAL STANDARD DATA WAREHOUSE ARCHITECTURE Data Warehouse / DHBWDaimler TSS 6 Data Warehouse FrontendBackend External data sources Internal data sources Staging Layer (Input Layer) OLTP OLTP Core Warehouse Layer (Storage Layer) Mart Layer (Output Layer) (Reporting Layer) Integration Layer (Cleansing Layer) Aggregation Layer Metadata Management Security DWH Manager incl. Monitor ? ? ? ?
  • 7. Extract – Transform - Load Other term: Data integration (better, more neutral) ETL PROCESS Data Warehouse / DHBWDaimler TSS 7
  • 8. • capture and copy data from source systems (e.g. operational systems) • many different types of sources • Relational, hierarchical DBMSs • Flat files • Other internal/external sources TASKS OF THE ETL PROCESS - EXTRACT Data Warehouse / DHBWDaimler TSS 8
  • 9. • Filter data • Integrate data • Check and cleanse data TASKS OF THE ETL PROCESS - TRANSFORM Data Warehouse / DHBWDaimler TSS 9
  • 10. • Original meaning: Fast load into staging area • General meaning: Loading data into staging area or another layer TASKS OF THE ETL PROCESS - LOAD Data Warehouse / DHBWDaimler TSS 10
  • 11. ETL often used for data integration in general (for ETL and ELT) But: if ELT is mentioned, it is differentiated from ETL ETL VS ELT Data Warehouse / DHBWDaimler TSS 11 Source DB Target DB ETL Server Source DB Target DB ELT Server Data flow
  • 12. ETL VS ELT Data Warehouse / DHBWDaimler TSS 12 ETL ELT Data is transferred to ETL server and transferred back to DB. High network bandwidth required Data remains in the DB except for cross Database loads (e.g. source to target) Transformations are performed in the ETL Server Transformations are performed (in the source or) in the target Proprietary code is executed in the ETL server Generated code, e.g. SQL, PL/SQL, SQLT Typically used for • source to target transfer • Compute intensive transformations • Small amount of data Typically used for • High amounts of data
  • 13. ETL/ELT TOOL VS MANUAL ETL/ELT Data Warehouse / DHBWDaimler TSS 13 ETL Tool Manual ETL Informatica, Talend, Oracle ODI, etc. SQL, PL/SQL, SQLT, etc. Separate license No additional license Workflow, error handling, and restart/recovery functionality included Workflow, error handling, and restart/recovery functionality must be implemented manually Impact analysis and where-used (lineage) functionality available Impact analysis and where-used (lineage) functionality difficult Faster development, easier maintenance Slower development, more difficult maintenance Additional (Tool-) Know How required Know How often available
  • 14. ETL/ELT TOOL VS MANUAL ETL/ELT Data Warehouse / DHBWDaimler TSS 14 Extract services Load services Operations management services Scheduler Control Repository Management Connectors Sorter Connector Sorter Bulk Loader Data Profiling servicesSource analysis Data Quality servicesData cleansing Data Transformation and Integration services Data mapping Business rules Slowly Changing Dimensions Datatype conversion Lookups Job Monitoring Auditing Error Handling Security
  • 15. MAPPING - INFORMATICA Data Warehouse / DHBWDaimler TSS 15 Source Target Filter Lookup
  • 16. MAPPING WITH TRANSFORMATIONS - INFORMATICA Data Warehouse / DHBWDaimler TSS 16 Sorter Aggregator Transformation Union Transformation
  • 17. WORKFLOW - INFORMATICA Data Warehouse / DHBWDaimler TSS 17 Decision & coordination step Session containing Mapping
  • 18. JOB MONITORING - INFORMATICA Data Warehouse / DHBWDaimler TSS 18
  • 19. Extracts from source systems Initial extract for setting up the data warehouse • Initial Load Periodical extracts for adding new/changed information to the data warehouse • Incremental Load Question: How to determine what is new or what has changed in the source systems? Task of „monitoring“ MONITORING (DATA CHANGE DETECTION) Data Warehouse / DHBWDaimler TSS 19
  • 20. Discovery of all changes vs. determining the net effect at extract/load time only • Example: an attribute value can be changed in two ways: • by one update operation • by one delete and one insert operation The net effect of both is the same However, history information is lost if the net effect is recorded only MONITORING: NET EFFECT OF CHANGES Data Warehouse / DHBWDaimler TSS 20
  • 21. Which techniques can be used to identify changes in a source system (RDBMS)? • E.g. in OLTP system • new products are inserted • customer address changes • Product is deleted because it is out of stock How would you identify such changes? List advantages / disadvantages of possible solutions Think about making changes in the source system. Think also about other solutions without any change in the source system. EXERCISE MONITORING Data Warehouse / DHBWDaimler TSS 21
  • 22. Depend on characteristics of the data sources The following techniques are based on modern relational DBMS Types of techniques Based on DBMS • Trigger-based • Log-based discovery • Replication techniques Controlled by application • Timestamp-based discovery • Snapshot-based discovery MONITORING TECHNIQUES Data Warehouse / DHBWDaimler TSS 22
  • 23. Active monitoring mechanisms Based on (database) triggers • Example: • If new record is inserted in sales transaction table then insert transaction id and timestamp in change table Advantage: • Triggers do not change operational applications Disadvantage: • Performance impact on operation systems if triggers are used extensively • Triggers have to be implemented for every table in the source systems TRIGGER-BASED Data Warehouse / DHBWDaimler TSS 23
  • 24. Sample Trigger Code, Oracle CREATE [OR REPLACE] TRIGGER <trigger_name> {BEFORE|AFTER} {INSERT|DELETE|UPDATE} ON <table_name> [REFERENCING [NEW AS <new_row_name>] [OLD AS <old_row_name>]] [FOR EACH ROW [WHEN (<trigger_condition>)]] <trigger_body> Trigger is created for each source table in OLTP DB and stores insert/update/delete changes in a “log/journal table” • trigger body contains insert statements into log/journal table TRIGGER-BASED Data Warehouse / DHBWDaimler TSS 24
  • 25. Log-based discovery Also often referenced as CDC (Change Data Capture) Usage of database transaction logs to determine changes • DBMSs write transaction logs in order to be able to undo partially executed transactions • This information can be used to determine all changes • Log reader identifies insert, update, delete, truncates and writes the changes as inserts into staging layer Transaction Log files can be transferred to other systems to avoid additional load on source systems LOG-BASED Data Warehouse / DHBWDaimler TSS 25
  • 26. LOG-BASED (SAMPLE PRODUCT ARCHITECTURE IIDR) Data Warehouse / DHBWDaimler TSS 26 Frontend Standard Reports AdHoc Reports IIDR ReplEngine Source Datastore Source OLTP DB IIDR ReplEngine DWH Datastore DWH DWH DB Staging Layer Core Layer Mart Layer Transaction Logs
  • 27. Replication techniques Data replication • Target tables not necessarily on local system • Uses typically Transaction Logs • Log reader identifies insert, update, delete, truncates and writes the changes into replicated tables (insert remains insert, update remains update, etc) • Useful for 1:1 copies (e.g. ODS, Operational Data Store) but still challenge to detect changes for loading the data mart REPLICATION-BASED Data Warehouse / DHBWDaimler TSS 27
  • 28. REPLICATION-BASED (SAMPLE PRODUCT ARCHITECTURE IIDR) Data Warehouse / DHBWDaimler TSS 28 Frontend Standard Reports AdHoc Reports IIDR ReplEngine Source Datastore Source OLTP DB IIDR ReplEngine DWH Datastore DWH DWH DB Staging Layer Core Layer Mart Layer Transaction Logs
  • 29. Timestamp-based discovery • Every data item in a table is associated with timestamp information about its validity period • Changed data can be determined from this timestamp information TIMESTAMP-BASED Data Warehouse / DHBWDaimler TSS 29
  • 30. Sample customer table in OLTP • Each table gets Change timestamp • Delta process reads latest data only (e.g. ChangeTimestamp >= <yesterday>) • Problem: it is not possible to identify deleted rows TIMESTAMP-BASED Data Warehouse / DHBWDaimler TSS 30 CustomerID Name Department Change Timestamp 1 Miller DWH 15.01.2015 17:00:01 2 Powell DB 22.03.2016 08:30:22
  • 31. Data comparison Comparison of snapshots of the operational data at different points in time • Compute difference between two latest snapshots • E.g. unload all data from a table into a file and diff newest file content with latest file content Can be very complex Sometimes the only possibility, for instance for legacy applications High performance impact on source SNAPSHOT-BASED Data Warehouse / DHBWDaimler TSS 31
  • 32. MONITORING TECHNIQUES COMPARISON Data Warehouse / DHBWDaimler TSS 32 Trigger-based Replication techniques Log-based discovery Timestamp- based discovery Snapshot-based discovery Performance impact on source system Medium Low Low Medium High Performance impact on target system Low Low Low Low High Load on network Low Low Low Low High Data loss if nologging operations No Yes Yes No No
  • 33. MONITORING TECHNIQUES COMPARISON Data Warehouse / DHBWDaimler TSS 33 Trigger-based Replication techniques Log-based discovery Timestamp- based discovery Snapshot-based discovery Identify DELETE operations Yes Yes Yes No Yes Identify ALL changes (changes between extractions) Yes Yes Yes No No
  • 34. Direct Access • Source writes data into target or • Target reads data from source • Security concerns • High coupling / dependencies DATA TRANSPORT – DIRECT ACCESS Data Warehouse / DHBWDaimler TSS 34 Source Target
  • 35. File transfer (or other transport medium) • csv, json, xml, binary, etc • Transfer data by scp, rfts (reliable file transfer system), ESB (enterprise service bus), SOA (service oriented architecture), etc • Often high amounts of data, therefore bulk transfer of compressed data most widely used • Better decoupling of source and target DATA TRANSPORT – FILE TRANSFER Data Warehouse / DHBWDaimler TSS 35 Source Targetfiles
  • 36. Extraction intervals • Periodically – in regular intervals • Every day, week, etc. • Instantly / Continuous • Every change is directly propagated into the data warehouse • „real time data warehouse“ • Depends on the requirements on timeliness of the data warehouse data EXTRACTION INTERVALS Data Warehouse / DHBWDaimler TSS 36
  • 37. Triggered by a specific request • Addition of a new product • Query which involves more recent data Triggered by specific events • Number of changes in operational data exceeds threshold EXTRACTION INTERVALS Data Warehouse / DHBWDaimler TSS 37
  • 38. • Profile Existing Data Sources, Extracted Data • Analyze data structure, content, and quality • Find data relationships across systems • Often badly documented or missing foreign keys • Uncover data issues that can affect subsequent transformation steps • Missing values • Duplicates • Inconsistencies PREREQUISITE OF ETL - UNDERSTANDING THE DATA Data Warehouse / DHBWDaimler TSS 38
  • 39. DATA QUALITY ISSUES Data Warehouse / DHBWDaimler TSS 39 CustomerNo Name Birthdate Age Gender Zip code 1 Miller, Tom 33.01.2001 15 M NULL 1 John Mayor 15.01.2001 15 M 98144 2 Mrs. Bush 31.10.1988 22 Q 00000 3 Martin 31.10.1988 22 M 75890 PK / Unique Key violated Data not uniform Not valid Inconsistent Wrong value Unknown / missing FK violated
  • 40. DATA QUALITY ISSUES AND POSSIBLE SOLUTIONS IN THE SOURCE RDBMS Data Warehouse / DHBWDaimler TSS 40 Issue Solution Wrong data e.g. 31.02.2016 Proper data type definition Wrong values, e.g. number out of range Check constraint Missing values NOT NULL constraint Violated references FOREIGN KEY constraint Duplicates PRIMARY or UNIQUE KEY constraint Inconsistent data ACID transactions, business logic, additional checks
  • 41. DATA QUALITY ISSUES AND POSSIBLE SOLUTIONS IN THE SOURCE RDBMS Data Warehouse / DHBWDaimler TSS 41 Issue Solution Wrong data e.g. 31.02.2016 Proper data type definition Wrong values, e.g. number out of range Check constraint Missing values NOT NULL constraint Violated references FOREIGN KEY constraint Duplicates PRIMARY or UNIQUE KEY constraint Inconsistent data ACID transactions, business logic, additional checks
  • 42. Correcting the data • Automatically during ETL • E.g., address of a customer if a correct reference table exists • Manually after ETL is finished • ETL stored bad data in error log tables or files • ETL flags bad data (e.g. invalid) DATA QUALITY ISSUES: WORKAROUNDS IN DWH Data Warehouse / DHBWDaimler TSS 42
  • 43. Correcting the data • In the source systems • Common master data management across all operational applications • Dedicated systems are “master” of e.g. customer data • Correcting the data at the source is best approach but slow and often not feasible DATA QUALITY ISSUES: CORRECT DATA IN THE SOURCE Data Warehouse / DHBWDaimler TSS 43
  • 44. • Column is null • Reject data • Use default values • Missing values can represent • an unknown value Iike date of birth of a customer • a missing value like engine_id for a car (logical not null constraint) • Dimension tables can include some dummy values: DATA QUALITY ISSUES: MISSING DATA Data Warehouse / DHBWDaimler TSS 44 DimensionTable_X Description -1 Unknown -2 Missing
  • 45. • Data is inaccurate e.g. wrong date 32.12.2015 or wrong number 55U • Reject data • Replace with value that represents „Invalid“ • Dimension tables can include some dummy values: DATA QUALITY ISSUES: MISSING DATA Data Warehouse / DHBWDaimler TSS 45 DimensionTable_X Description -1 Unknown -2 Missing -3 Invalid
  • 46. • Data has conflicts, e.g. wrong postal code 80995 Stuttgart • Reject data • Replace one of the values with a value that represents „Invalid“ or with corrected value Which value to replace? Rules necessary DATA QUALITY ISSUES: CONFLICTING DATA Data Warehouse / DHBWDaimler TSS 46
  • 47. • Data is inconsistent, e.g. Order date after payment date or unlikely high price for a product • Can be discovered by statistical and data mining methods DATA QUALITY ISSUES: INCONSISTENT DATA Data Warehouse / DHBWDaimler TSS 47
  • 48. • Data is duplicated, e.g. „Martin Miller” vs “Miller, Martin” vs “M.Miller” • Multiple representations for one entity • Different keys • Different encodings • Duplicate detection can be very difficult / tricky • Products are available for e.g. address duplicate detection address validation (Kingstreet = does this address actually exist?) address harmonization (Kingstr, Kingstreet, King Street, etc) • Standardize / Harmonize data during ETL flow: “unification” for better duplicate detection DATA QUALITY ISSUES: DUPLICATES Data Warehouse / DHBWDaimler TSS 48
  • 49. • Unification of data types • Character string date „20.01.2006“ 20.01.2006 • Character string number „12345“ 12345 • Unification of encodings • For instance for gender F and M • Lookup-tables contain the mapping from old to new encodings • Combination of different attributes to one attribute • day, month, year date TRANSFORM - UNIFICATION OF DATA Data Warehouse / DHBWDaimler TSS 49
  • 50. • Split of one attribute into two or more • Name first name, last name (“Herr Prof. Dr. Hans M. vom und zum Stein”) • Unification of names can become very challenging “Herr Prof. Dr. Hans M. vom und zum Stein” or “Werner Martin” or “Mariae Gloria … Wilhelmine Huberta Gräfin von Schönburg-Glauchau“ • Product name - „Cola, 0.33 l“ Product short name - „Cola“, size in liters - 0.33 TRANSFORM - UNIFICATION OF DATA Data Warehouse / DHBWDaimler TSS 50
  • 51. • Unification of dates and timestamps • Rules for representing incomplete date information If only month and year are known • Dates and timestamps with regard to one specific timezone Important for multi-national organizations UTC Coordinated Universal Time without daylight saving zone • What can happen if clock is changed to winter time if no UTC is used? - Update arrives at 02:15 in staging layer (CDC / log-based monitor) - Clock is changed to winter time: -1h - Update of the same row arrives at 02:10 in staging layer (CDC / log-based) - How can batch load running the next night discover which update is the most recent one? TRANSFORM - UNIFICATION OF DATA Data Warehouse / DHBWDaimler TSS 51
  • 52. • Computation of derived values • Profit = sales price – purchase price Without clear definition, different interpretations possible • Net or gross sales price? • Net or gross purchase price? • Aggregations • Revenue of the year computed from revenues of the day Without clear definition, different interpretations possible • Calendar year? • Fiscal year? TRANSFORM - UNIFICATION OF DATA Data Warehouse / DHBWDaimler TSS 52
  • 53. • Specification between source and target columns • Source tables + columns • Target table + columns • Join rules • Filter criteria • Transformation rules DATA MAPPING Data Warehouse / DHBWDaimler TSS 53
  • 54. • Efficient load operations are important • bulk load: Single row processing vs set based processing • Online load • Data warehouse (especially Data Mart) is still accessible • Offline load • Data warehouse (especially Data Mart) is offline • For updates that require the recomputation of a cube • Offline load is often a Tool limit because the Tool locks data structures. But offline load could be faster. LOAD Data Warehouse / DHBWDaimler TSS 54
  • 55. • Specific Bulk load operations provided by RDBMS, e.g. External tables in Oracle or LOAD command in DB2 • Single row vs set based processing BULK PROCESSING Data Warehouse / DHBWDaimler TSS 55 Single row processing Set based processing Cursor curs = SELECT * FROM <source> WHILE NOT EOF(curs) FETCH NEXT ROW INTO myRoW; INSERT INTO <target> VALUES(myRow); LOOP INSERT INTO <target> SELECT * from <source> Error handling easy All or nothing if there are errors Slow for high amounts of data Performs well for small and high amounts of data More coding Less code = less errors
  • 56. ETL-JOB PARALLELISM FOR LOADING DATA INTO CORE WAREHOUSE LAYER Data Warehouse / DHBWDaimler TSS 56 HUBloaded LINKundHUB- SATloaded LINK-SATloaded DataVault Load Classical Load ? ? ? Integration of new JobsTime Windows for Loads, e.g 00:00-06:00 • Complex • Many dependencies • Many sequential jobs • Systematic / Methodic • Few, well defined dependencies • Massive parallel
  • 57. Draw a flow diagram how to load a HUB, LINK and SAT table and describe the SQL statements EXERCISE: LOAD DATA VAULT TABLE Data Warehouse / DHBWDaimler TSS 57
  • 58. EXERCISE: LOAD HUB TABLE Data Warehouse / DHBWDaimler TSS 58 Source data exist Load distinct business keys Does business Key exist in HUB? Insert row into HUB Conflict if PK HashKey collision! no Reject data Data loaded into HUB yes
  • 59. INSERT INTO core.fahrzeug (vehicle_hk, fin, loaddate, recordsource) SELECT DISTINCT f.fahrzeug_hashkey , f.fin_bk , f.loaddate , f.recordsource FROM staging.fahrzeugdaten f WHERE f.fin_bk NOT in (SELECT fin FROM core.hub_fahrzeug) AND f.loaddate = <date to load>; EXERCISE: LOAD HUB TABLE Data Warehouse / DHBWDaimler TSS 59
  • 60. EXERCISE: LOAD LINK TABLE Data Warehouse / DHBWDaimler TSS 60 Source data exist Load distinct business keys Does Hash Key relationship exist in HUB? Insert row into LINK Conflict if PK HashKey collision! no Reject data Data loaded into LINK yes
  • 61. INSERT INTO core.link_verbaut (verbaut_hk, motor_hk, vehicle_hk, loaddate, recordsource) SELECT DISTINCT h.verbaut_hk , f.motor_hashkey , f.fahrzeug_hashkey , f.loaddate , f.recordsource FROM staging.fahrzeugdaten f WHERE (f.motor_hashkey, f.fahrzeug_hashkey) NOT in (SELECT motor_hk, vehicle_hk FROM core.link_verbaut v) AND loaddate = <date to load>; EXERCISE: LOAD LINK TABLE Data Warehouse / DHBWDaimler TSS 61
  • 62. EXERCISE: LOAD SAT TABLE Data Warehouse / DHBWDaimler TSS 62 Source data exist Load distinct source data MD5- HASH Diff identical? Insert row into SAT no Reject data Data loaded into SAT yes Load current/ latest row from SAT table
  • 63. INSERT INTO core.sat_fahrzeug_text (vehicle_hk, loaddate, recordsource, md5_hash, codeleiste, kommentar) SELECT DISTINCT f.fahrzeug_hashkey , f.loaddate , f.recordsource , f.md5hash , f.codeleiste , f.kommentar FROM staging.fahrzeugdaten f LEFT OUTER JOIN (select s.vehicle_hk, s.md5_hash from s_fahrzeug s JOIN (select i.VEHICLE_HK, max(i.loaddate) as loaddate from s_fahrzeug i GROUP BY i.VEHICLE_HK) m ON s.vehicle_hk = m.vehicle_hk AND s.loaddate = m.loaddate) k ON f.fahrzeug_hashkey = k.vehicle_hk WHERE (k.md5_hash is null OR f.md5hash <> k.md5_hash) AND f.loaddate = <date to load>; EXERCISE: LOAD SAT TABLE Data Warehouse / DHBWDaimler TSS 63
  • 64. LOGICAL STANDARD DATA WAREHOUSE ARCHITECTURE Data Warehouse / DHBWDaimler TSS 64 Data Warehouse FrontendBackend External data sources Internal data sources Staging Layer (Input Layer) OLTP OLTP Core Warehouse Layer (Storage Layer) Mart Layer (Output Layer) (Reporting Layer) Integration Layer (Cleansing Layer) Aggregation Layer Metadata Management Security DWH Manager incl. Monitor ? ? ? ?
  • 66. • After the end of this lecture you will be able to • Understand DB techniques that are specific for DWH • Bitemporal data • Indexing, Partitioning, Parallelism WHAT YOU WILL LEARN TODAY Data Warehouse / DHBWDaimler TSS 66
  • 67. TEMPORAL DATA STORAGE (BITEMPORAL DATA) Data Warehouse / DHBWDaimler TSS 67 10.09. 20.09. 30.09. 10.10. Time Price: 15EUR Price: 16EUR New Price of 16EUR is entered into the DB Valid Time (20.09.) Transaction Time (10.09.)
  • 68. Valid time is the time period during which a fact is true in the real world. Transaction time is the time period during which a fact stored in the database was known. Bitemporal data combines both Valid and Transaction Time. Source: (Wikipedia, https://en.wikipedia.org/wiki/Temporal_database) TEMPORAL DATA STORAGE (BITEMPORAL DATA) Data Warehouse / DHBWDaimler TSS 68
  • 69. • SQL standard SQL:2011 • But different implementations by RDBMSes like Oracle, DB2, SQL Server and others • Different syntax! • Different coverage of standard! • Very useful for slowly changing dimensions type 2, but also for other purposes TEMPORAL DATA STORAGE (BITEMPORAL DATA) Data Warehouse / DHBWDaimler TSS 69
  • 70. CREATE TABLE customer_address ( customerID INTEGER NOT NULL , name VARCHAR(100) , city VARCHAR(100) , valid_start DATE NOT NULL , valid_end DATE NOT NULL , PERIOD BUSINESS_TIME(valid_start, valid_end) , PRIMARY KEY(customerID, BUSINESS_TIME WITHOUT OVERLAPS) ); DB2 VALID TIME EXAMPLE Data Warehouse / DHBWDaimler TSS 70
  • 71. INSERT INTO customer_address VALUES (1, 'Miller', 'Seattle', '01.01.2013', '31.12.2013'); UPDATE customer_address FOR PORTION OF BUSINESS_TIME FROM '22.05.2013' TO '31.12.2013' SET city = 'San Diego' WHERE customerID = 1; DB2 VALID TIME EXAMPLE Data Warehouse / DHBWDaimler TSS 71 customerID Name City Valid_start Valid_end 1 Miller Seattle 01.01.2013 22.05.2013 1 Miller San Diego 22.05.2013 31.12.2013
  • 72. SELECT * FROM customer_address FOR BUSINESS_TIME AS OF '17.05.2013'; DB2 VALID TIME EXAMPLE Data Warehouse / DHBWDaimler TSS 72
  • 73. CREATE TABLE customer_info( customerId INTEGER NOT NULL, comment VARCHAR(1000) NOT NULL, sys_start TIMESTAMP(12) NOT NULL GENERATED ALWAYS AS ROW BEGIN, sys_end TIMESTAMP(12) NOT NULL GENERATED ALWAYS AS ROW END, PERIOD SYSTEM_TIME (sys_start, sys_end) ); DB2 TRANSACTION TIME EXAMPLE Data Warehouse / DHBWDaimler TSS 73
  • 74. Transaction on 15.10.2013: INSERT INTO customer_info VALUES( 1, 'comment 1'); Transaction on 31.10.2013 UPDATE customer_address SET comment = 'comment 2' WHERE customerID = 1; DB2 TRANSACTION TIME EXAMPLE Data Warehouse / DHBWDaimler TSS 74 CustomerId comment Sys_start Sys_end 1 Comment 2 31.10.2013 31.12.2999
  • 75. SELECT * FROM customer_info FOR SYSTEM_TIME AS OF '17.10.2013'; Data comes from a history table: Valid Time and Transaction Time can be combined = Bitemporal table DB2 TRANSACTION TIME EXAMPLE Data Warehouse / DHBWDaimler TSS 75 CustomerId comment Sys_start Sys_end 1 Comment 1 15.10.2013 31.10.2013
  • 76. • Very important performance improvement technique • Good for many reads with high selectivity, write penalty • B-trees most common INDEXING - WHY Data Warehouse / DHBWDaimler TSS 76 root branch branch leaf leaf leaf … … Table
  • 77. • DBs index Primary Keys by default • Dimension table columns that are regularly used in where clauses are candidates • Maybe foreign Key columns in Fact table (see also later Star Transformation) INDEXING A STAR SCHEMA – WHICH COLUMNS ARE CANDIDATES FOR AN INDEX? Data Warehouse / DHBWDaimler TSS 77
  • 78. • Fact table has normally much more rows compared to dimension tables • Common join techniques would need to join first dimension table with the fact table • Alternative technique: evaluate all dimensions (cartesian join) • Then join into fact table in last step • Oracle uses Bitmap indexes on foreign key columns in fact tables to achieve Star Join; not supported by many DBs STAR TRANSFORMATION Data Warehouse / DHBWDaimler TSS 78
  • 79. • Suppose you have a fact table containing data for last 10 years with millions of rows but you are interested in only in • Data from yesterday • From last 2 years How could you improve performance? EXERCISE: PERFORMANCE Data Warehouse / DHBWDaimler TSS 79
  • 80. • Suppose you have a fact table containing data for last 10 years with millions of rows but you are interested in only in • Columnar In-memory DB may be an option in general (the option has already been discussed during the lecture) • Data from yesterday • Indexing might be a good choice as not much rows are read • From last 2 years • Indexing most likely is a bad choice as reading a rather high amount of data via an index quickly becomes inefficient • Partitioning EXERCISE: PERFORMANCE Data Warehouse / DHBWDaimler TSS 80
  • 81. PARTITIONING Data Warehouse / DHBWDaimler TSS 81 Col1 Col2 Col3 col4 1 A AA AAA 2 B BB BBB 3 C CC CCC Col1 Col2 1 A 2 B 3 C Col3 col4 AA AAA BB BBB CC CCC Col1 Col2 Col3 col4 3 C CC CCC Col1 Col2 Col3 col4 1 A AA AAA 2 B BB BBB Vertical partitioning (sharding) Horizontal partitioning
  • 82. • Very powerful feature in a DWH to reduce workload • Split table into logical smaller tables • Avoidance of full table scans • How could a table be split? HORIZONTAL PARTITIONING Data Warehouse / DHBWDaimler TSS 82
  • 83. • By range • Most common • Use date field like order data to partition table into months, days, etc • By list • Use field that has limited number of different values, e.g. split customer data by country if end users most likely select customers from within a country • By hash • Use a filed that most likely splits the data in evenly distributed chunks HORIZONTAL PARTITIONING – SPLITTING OPTIONS Data Warehouse / DHBWDaimler TSS 83
  • 84. • Statements are normally executed on one CPU • Parallelism allows the DB to distribute the execution to several CPUs • Powerful combination with partitioning • Parallelism is limited by the number of CPUs: if parallelism is too high, performance will degrade • Intra-query parallelism and inter-query parallelism PARALLELISM Data Warehouse / DHBWDaimler TSS 84
  • 85. • Relational columnar In-Memory DB • Materialized Views / Query Tables ALREADY COVERED IN A PREVIOUS LECTURE Data Warehouse / DHBWDaimler TSS 85
  • 86. Daimler TSS GmbH Wilhelm-Runge-Straße 11, 89081 Ulm / Telefon +49 731 505-06 / Fax +49 731 505-65 99 tss@daimler.com / Internet: www.daimler-tss.com/ Intranet-Portal-Code: @TSS Domicile and Court of Registry: Ulm / HRB-Nr.: 3844 / Management: Christoph Röger (CEO), Steffen Bäuerle Data Warehouse / DHBWDaimler TSS 86 THANK YOU

Hinweis der Redaktion

  1. Mission: Wir sind Spezialist und strategischer Business-Partner für innovative IT-Gesamtlösungen im Daimler-Konzern – not just another supplier! more than another supplier!    
  2. Stammdatenmanagement (englisch Master Data Management, MDM) umfasst alle strategischen, organisatorischen, methodischen und technologischen Aktivitäten in Bezug auf die Stammdaten eines Unternehmens.
  3. Beifahrersitz