SlideShare a Scribd company logo
1 of 41
Abdul Khaliq
ETL
Need To Know:
• What is data Ware house?
• Why They Are Necessary?
• How They Are Constructed?
Data Warehouse:
• A physical repository where relational data are specially organized to
provide enterprise-wide, cleansed data in a standardized format
– Subject-oriented: e.g. customers, patients, students, products
– Integrated: Consistent naming conventions, formats, encoding
structures; from multiple data sources
– Time-variant: Can study trends and changes
– Non-updatable: Read-only, periodically refreshed
What is data Ware house?
– A data warehouse centralizes data that are scattered
throughout disparate operational systems and makes them
available for Decision Support.
• A subject-oriented, integrated, time-variant, non-updatable collection of
data used in support of management decision-making processes
Why They Are Necessary?
Operational System(OLTP) Informational System(OLAP)
Needs The reconciliation of data!!
5
Reconciled data: detailed,
current data intended to be the
single, authoritative source for
all decision support.
Data Reconciliation
How They Are Constructed?
How They Are Constructed?
8
• Extract
• Transform
• Load
The ETL Process
Data is:
extracted from an OLTP database(Relational)
transformed to match the data warehouse schema
loaded into the data warehouse
Typical operational data is:
– Transient – not historical
– Not normalized (perhaps due
to de-normalization for
performance)
– Restricted in scope – not
comprehensive
– Sometimes poor quality –
inconsistencies and errors
Purpose Of ETL Process:
After ETL, data should be:
Detailed – not summarized yet
Historical – periodic
Normalized – 3rd normal form or
higher
Comprehensive – enterprise-wide
perspective
Timely – data should be current
enough to assist decision-making
Quality controlled – accurate with
full integrity
10
11
Extraction
The Extract step covers the data extraction from the source system and
makes it accessible for further processing. The main objective of the
extract step is to retrieve all the required data from the source system
with as little resources as possible.
Logically data can be extracted in to
ways before physical data extraction
Full Extraction: Full extraction is used when the data needs to be extracted and
loaded for the first time. In full extraction, the data from the source is extracted
completely. This extraction reflects the current data available in the source system.
Incremental Extraction: In incremental extraction, the changes in source data
need to be tracked since the last successful extraction. Only these changes in data
will be extracted and then loaded. These changes can be detected from the source
data which have the last changed timestamp. Also a change table can be created
in the source system, which keeps track of the changes in the source data.
One more method to get the incremental changes is to extract the complete source
data and then do a difference (minus operation) between the current extraction and
last extraction. This approach causes a performance issue.
Logical Extraction:
The data can be extracted physically by two methods:
Online Extraction: In online extraction the data is extracted directly from the
source system. The extraction process connects to the source system and extracts
the source data. Here the data is extracted directly from the Source for processing
in the staging area, that’s why it’s called online extraction. During Extraction we
connect directly to the source system and then access the source tables. There is
no need of any external staging area
Offline Extraction: The data from the source system is dumped outside of the
source system into a flat file. This flat file is used to extract the data. The flat file
can be created by a routine process daily. Here the data is not extracted directly
from the source, but instead it’s taken from another external area which keeps the
copy of source. The external area can be Flat files, or some dump files in a
specific format. So when we need to process the data we can fetch the records
from the external source instead of the actual source.
Physical extraction:
What data Should be extracted?
• The selection and analysis of the source system is usually broken into
two major phases:
– The data discovery phase
– The anomaly detection phase
Extraction - Data Discovery Phase
• Data Discovery Phase
key criterion for the success of the data warehouse is the cleanliness
and cohesiveness of the data within it
• Once you understand what the target needs to look like, you need to
identify and examine the data sources
Data Discovery Phase
• It is up to the ETL team to drill down further into the data requirements to determine each
and every source system, table, and attribute required to load the data warehouse
– Collecting and Documenting Source Systems
– Keeping track of source systems
– Determining the System of Record - Point of originating of data
– Definition of the system-of-record is important because in most enterprises data is stored
redundantly across many different systems.
– Enterprises do this to make nonintegrated systems share data. It is very common that the same
piece of data is copied, moved, manipulated, transformed, altered, cleansed, or made corrupt
throughout the enterprise, resulting in varying versions of the same data
Data Content Analysis - Extraction
• Understanding the content of the data is crucial for determining the best approach for retrieval
- NULL values. An unhandled NULL value can destroy any ETL process. NULL values pose
the biggest risk when they are in foreign key columns. Joining two or more tables based on a
column that contains NULL values will cause data loss! Remember, in a relational database
NULL is not equal to NULL. That is why those joins fail. Check for NULL values in every foreign
key in the source database. When NULL values are present, you must outer join the tables
- Dates in non date fields. Dates are very peculiar elements because they are the only logical
elements that can come in various formats, literally containing different values and having the
exact same meaning. Fortunately, most database systems support most of the various formats
for display purposes but store them in a single standard format
19
20
Data Transformation
• Data transformation is the component of data reconcilation that converts
data from the format of the source operational systems to the format of
enterprise data warehouse.
• Data transformation consists of a variety of different functions:
– record-level functions,
– field-level functions and
– more complex transformation.
21
Record-level functions & Field-level functions
• Record-level functions
– Selection: data partitioning
– Joining: data combining
– Normalization
– Aggregation: data summarization, Aggregates
• Field-level functions
– Single-field transformation: from one field to one field
– Multi-field transformation: from many fields to one, or one field to many
Transformation
• Main step where the ETL adds value
• Actually changes data and provides guidance whether data can be
used for its intended purposes
• Performed in staging area
23
Need Of Staging Area:
It makes it possible to restart
At least, some of the phases independently from the others.
For example, if the transformation step fails,
it should not be necessary to restart the Extract step.
The staging area should is be accessed by the load ETL process only.
It should never be available to anyone else,
particularly not to end users as it is not intended for data presentation to
the end-user may contain incomplete or in-the-middle-of-the-processing data.
Staging means that the data is simply dumped to the
location (called the Staging Area)
so that it can then be read by the next processing
phase.
The staging area is used during ETL
process to store intermediate results of
processing.
Data Quality paradigm
• Correct
• Unambiguous/Clear
• Consistent
• Complete
• Data quality checks are run at 2 places - after extraction and
after cleaning and confirming additional check are run at this
point
Transformation
26
Transformation - Cleaning Data
• Anomaly Detection
– Data sampling – count(*) of the rows for a department column
• Column Property Enforcement
– Null Values in columns
– Numeric values that fall outside of expected high and lows
– Columns whose lengths are exceptionally short/long
– Columns with certain values outside of discrete valid value sets
• The cleaning step is one of the most important as it ensures the quality of the data in
the data warehouse. Cleaning should perform basic data unification rules, such as:
• Making identifiers unique (sex categories Male/Female/Unknown, M/F/null,
Man/Woman/Not Available are translated to standard Male/Female/Unknown)
• Convert null values into standardized Not Available/Not Provided value
• Convert phone numbers, ZIP codes to a standardized form
• Validate address fields, convert them into proper naming, e.g. Street/St/St./Str./Str.
• Validate address fields against each other (State/Country, City/State, City/ZIP code,
City/Street).
Cleansing
• Fixing errors: misspellings, erroneous dates, incorrect field usage,
mismatched addresses, missing data, duplicate data, inconsistencies
Cleansing
Also: decoding, reformatting, time stamping,
conversion, key generation, merging, error
detection/logging, locating missing data
Cleansing Further Leads to ETL
process
Staged Data
Cleaning
And
Confirming
Errors
Stop
Loading
Yes
No
Transformation - Confirming
• Structure Enforcement
– Tables have proper primary and foreign keys
– Obey referential integrity
• Data and Rule value enforcement
– Simple business rules
– Logical data checks
32
During the load step, it is necessary to ensure that the load is performed
correctly and with as little resources as possible. The target of the Load
process is often a database. In order to make the load process efficient,
it is helpful to disable any constraints and indexes before the load and
enable them back only after the load completes. The referential integrity
needs to be maintained by ETL tool to ensure consistency.
Data Loading:
Loading Can Be…..
Full Load is the entire data dump load
taking place the very first time. In this we
give the last extract date as empty so that
all the data gets loaded
Incremental - Where delta or difference
between target and source data is dumped
at regular intervals. Here we give the last
extract date such that only records after
this date are loaded.
Why Incremental?
Speed. Opting to do a full load on larger datasets will take a
great amount of time and other server resources. Ideally all
the data loads are performed overnight with the expectation
of completing them before users can see the data the next
day. The overnight window may not be enough time for the
full load to complete.
Preserving history. When dealing with a OLTP source that is
not designed to keep history, a full load will remove history
from the destination as well, since full load will remove all the
records first, remember! So a full load will not allow you to
preserve history in the data warehouse.
Full Load vs. Incremental Load:
Full Load Incremental Load
Truncates all rows and
loads from scratch.
New records and updated
ones are loaded
Requires more time. Requires less time.
Can easily be guaranteed Difficult. ETL must check
for new/updated rows.
Can be lost. Retained.
Loading Includes:
• Loading Dimensions
• Loading Facts
Dimensions
• Qualifying characteristics that provide additional perspectives to a
given fact
– DSS data is almost always viewed in relation to other data
• Dimensions are normally stored in dimension tables
Facts
• Numeric measurements (values) that represent a specific
business aspect or activity
• Stored in a fact table at the center of the star scheme
• Contains facts that are linked through their dimensions
• Can be computed or derived at run time
• Updated periodically with data from operational databases
E
T
L
Thank You !

More Related Content

What's hot

Date warehousing concepts
Date warehousing conceptsDate warehousing concepts
Date warehousing concepts
pcherukumalla
 
Enabling a Data Mesh Architecture with Data Virtualization
Enabling a Data Mesh Architecture with Data VirtualizationEnabling a Data Mesh Architecture with Data Virtualization
Enabling a Data Mesh Architecture with Data Virtualization
Denodo
 
Webinar future dataintegration-datamesh-and-goldengatekafka
Webinar future dataintegration-datamesh-and-goldengatekafkaWebinar future dataintegration-datamesh-and-goldengatekafka
Webinar future dataintegration-datamesh-and-goldengatekafka
Jeffrey T. Pollock
 

What's hot (20)

Kettle – Etl Tool
Kettle – Etl ToolKettle – Etl Tool
Kettle – Etl Tool
 
Etl process in data warehouse
Etl process in data warehouseEtl process in data warehouse
Etl process in data warehouse
 
Azure Data Factory V2; The Data Flows
Azure Data Factory V2; The Data FlowsAzure Data Factory V2; The Data Flows
Azure Data Factory V2; The Data Flows
 
Data warehousing - Dr. Radhika Kotecha
Data warehousing - Dr. Radhika KotechaData warehousing - Dr. Radhika Kotecha
Data warehousing - Dr. Radhika Kotecha
 
Introduction To Msbi By Yasir
Introduction To Msbi By YasirIntroduction To Msbi By Yasir
Introduction To Msbi By Yasir
 
Date warehousing concepts
Date warehousing conceptsDate warehousing concepts
Date warehousing concepts
 
Enabling a Data Mesh Architecture with Data Virtualization
Enabling a Data Mesh Architecture with Data VirtualizationEnabling a Data Mesh Architecture with Data Virtualization
Enabling a Data Mesh Architecture with Data Virtualization
 
ETL VS ELT.pdf
ETL VS ELT.pdfETL VS ELT.pdf
ETL VS ELT.pdf
 
ETL Testing Overview
ETL Testing OverviewETL Testing Overview
ETL Testing Overview
 
data-mesh-101.pptx
data-mesh-101.pptxdata-mesh-101.pptx
data-mesh-101.pptx
 
Webinar future dataintegration-datamesh-and-goldengatekafka
Webinar future dataintegration-datamesh-and-goldengatekafkaWebinar future dataintegration-datamesh-and-goldengatekafka
Webinar future dataintegration-datamesh-and-goldengatekafka
 
Talend ETL Tutorial | Talend Tutorial For Beginners | Talend Online Training ...
Talend ETL Tutorial | Talend Tutorial For Beginners | Talend Online Training ...Talend ETL Tutorial | Talend Tutorial For Beginners | Talend Online Training ...
Talend ETL Tutorial | Talend Tutorial For Beginners | Talend Online Training ...
 
OLTP vs OLAP
OLTP vs OLAPOLTP vs OLAP
OLTP vs OLAP
 
Dw & etl concepts
Dw & etl conceptsDw & etl concepts
Dw & etl concepts
 
Data Warehousing Trends, Best Practices, and Future Outlook
Data Warehousing Trends, Best Practices, and Future OutlookData Warehousing Trends, Best Practices, and Future Outlook
Data Warehousing Trends, Best Practices, and Future Outlook
 
Mapping Data Flows Training deck Q1 CY22
Mapping Data Flows Training deck Q1 CY22Mapping Data Flows Training deck Q1 CY22
Mapping Data Flows Training deck Q1 CY22
 
Etl techniques
Etl techniquesEtl techniques
Etl techniques
 
Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0
 
Building a modern data warehouse
Building a modern data warehouseBuilding a modern data warehouse
Building a modern data warehouse
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)
 

Similar to Etl - Extract Transform Load

Ch1 data-warehousing
Ch1 data-warehousingCh1 data-warehousing
Ch1 data-warehousing
Ahmad Shlool
 

Similar to Etl - Extract Transform Load (20)

Extract, Transform and Load.pptx
Extract, Transform and Load.pptxExtract, Transform and Load.pptx
Extract, Transform and Load.pptx
 
data warehousing need and characteristics. types of data w data warehouse arc...
data warehousing need and characteristics. types of data w data warehouse arc...data warehousing need and characteristics. types of data w data warehouse arc...
data warehousing need and characteristics. types of data w data warehouse arc...
 
Chapter 6.pptx
Chapter 6.pptxChapter 6.pptx
Chapter 6.pptx
 
DW (1).ppt
DW (1).pptDW (1).ppt
DW (1).ppt
 
Various Applications of Data Warehouse.ppt
Various Applications of Data Warehouse.pptVarious Applications of Data Warehouse.ppt
Various Applications of Data Warehouse.ppt
 
Get started with data migration
Get started with data migrationGet started with data migration
Get started with data migration
 
ETL Testing - Introduction to ETL testing
ETL Testing - Introduction to ETL testingETL Testing - Introduction to ETL testing
ETL Testing - Introduction to ETL testing
 
ETL Testing - Introduction to ETL testing
ETL Testing - Introduction to ETL testingETL Testing - Introduction to ETL testing
ETL Testing - Introduction to ETL testing
 
ETL Testing - Introduction to ETL Testing
ETL Testing - Introduction to ETL TestingETL Testing - Introduction to ETL Testing
ETL Testing - Introduction to ETL Testing
 
Data warehouse introduction
Data warehouse introductionData warehouse introduction
Data warehouse introduction
 
ETL-Datawarehousing.ppt.pptx
ETL-Datawarehousing.ppt.pptxETL-Datawarehousing.ppt.pptx
ETL-Datawarehousing.ppt.pptx
 
Data warehouse physical design
Data warehouse physical designData warehouse physical design
Data warehouse physical design
 
Datawarehousing
DatawarehousingDatawarehousing
Datawarehousing
 
Datastage Introduction To Data Warehousing
Datastage Introduction To Data Warehousing Datastage Introduction To Data Warehousing
Datastage Introduction To Data Warehousing
 
Data warehouse
Data warehouse Data warehouse
Data warehouse
 
Ch1 data-warehousing
Ch1 data-warehousingCh1 data-warehousing
Ch1 data-warehousing
 
Ch1 data-warehousing
Ch1 data-warehousingCh1 data-warehousing
Ch1 data-warehousing
 
Data Management
Data ManagementData Management
Data Management
 
AIS PPt.pptx
AIS PPt.pptxAIS PPt.pptx
AIS PPt.pptx
 
Database migration
Database migrationDatabase migration
Database migration
 

Recently uploaded

Verification of thevenin's theorem for BEEE Lab (1).pptx
Verification of thevenin's theorem for BEEE Lab (1).pptxVerification of thevenin's theorem for BEEE Lab (1).pptx
Verification of thevenin's theorem for BEEE Lab (1).pptx
chumtiyababu
 
Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 

Recently uploaded (20)

A Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna MunicipalityA Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna Municipality
 
Engineering Drawing focus on projection of planes
Engineering Drawing focus on projection of planesEngineering Drawing focus on projection of planes
Engineering Drawing focus on projection of planes
 
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced LoadsFEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
 
Design For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the startDesign For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the start
 
Unleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapUnleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leap
 
NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...
NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...
NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...
 
Online food ordering system project report.pdf
Online food ordering system project report.pdfOnline food ordering system project report.pdf
Online food ordering system project report.pdf
 
Generative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTGenerative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPT
 
Thermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VThermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - V
 
Moment Distribution Method For Btech Civil
Moment Distribution Method For Btech CivilMoment Distribution Method For Btech Civil
Moment Distribution Method For Btech Civil
 
data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdf
 
Online electricity billing project report..pdf
Online electricity billing project report..pdfOnline electricity billing project report..pdf
Online electricity billing project report..pdf
 
Verification of thevenin's theorem for BEEE Lab (1).pptx
Verification of thevenin's theorem for BEEE Lab (1).pptxVerification of thevenin's theorem for BEEE Lab (1).pptx
Verification of thevenin's theorem for BEEE Lab (1).pptx
 
AIRCANVAS[1].pdf mini project for btech students
AIRCANVAS[1].pdf mini project for btech studentsAIRCANVAS[1].pdf mini project for btech students
AIRCANVAS[1].pdf mini project for btech students
 
Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...
Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...
Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...
 
Wadi Rum luxhotel lodge Analysis case study.pptx
Wadi Rum luxhotel lodge Analysis case study.pptxWadi Rum luxhotel lodge Analysis case study.pptx
Wadi Rum luxhotel lodge Analysis case study.pptx
 
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptxHOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
 
Hostel management system project report..pdf
Hostel management system project report..pdfHostel management system project report..pdf
Hostel management system project report..pdf
 
PE 459 LECTURE 2- natural gas basic concepts and properties
PE 459 LECTURE 2- natural gas basic concepts and propertiesPE 459 LECTURE 2- natural gas basic concepts and properties
PE 459 LECTURE 2- natural gas basic concepts and properties
 
Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
 

Etl - Extract Transform Load

  • 2. Need To Know: • What is data Ware house? • Why They Are Necessary? • How They Are Constructed?
  • 3. Data Warehouse: • A physical repository where relational data are specially organized to provide enterprise-wide, cleansed data in a standardized format – Subject-oriented: e.g. customers, patients, students, products – Integrated: Consistent naming conventions, formats, encoding structures; from multiple data sources – Time-variant: Can study trends and changes – Non-updatable: Read-only, periodically refreshed What is data Ware house?
  • 4. – A data warehouse centralizes data that are scattered throughout disparate operational systems and makes them available for Decision Support. • A subject-oriented, integrated, time-variant, non-updatable collection of data used in support of management decision-making processes Why They Are Necessary? Operational System(OLTP) Informational System(OLAP) Needs The reconciliation of data!!
  • 5. 5 Reconciled data: detailed, current data intended to be the single, authoritative source for all decision support. Data Reconciliation
  • 6. How They Are Constructed?
  • 7. How They Are Constructed?
  • 8. 8 • Extract • Transform • Load The ETL Process Data is: extracted from an OLTP database(Relational) transformed to match the data warehouse schema loaded into the data warehouse
  • 9. Typical operational data is: – Transient – not historical – Not normalized (perhaps due to de-normalization for performance) – Restricted in scope – not comprehensive – Sometimes poor quality – inconsistencies and errors Purpose Of ETL Process: After ETL, data should be: Detailed – not summarized yet Historical – periodic Normalized – 3rd normal form or higher Comprehensive – enterprise-wide perspective Timely – data should be current enough to assist decision-making Quality controlled – accurate with full integrity
  • 10. 10
  • 11. 11
  • 12. Extraction The Extract step covers the data extraction from the source system and makes it accessible for further processing. The main objective of the extract step is to retrieve all the required data from the source system with as little resources as possible. Logically data can be extracted in to ways before physical data extraction
  • 13. Full Extraction: Full extraction is used when the data needs to be extracted and loaded for the first time. In full extraction, the data from the source is extracted completely. This extraction reflects the current data available in the source system. Incremental Extraction: In incremental extraction, the changes in source data need to be tracked since the last successful extraction. Only these changes in data will be extracted and then loaded. These changes can be detected from the source data which have the last changed timestamp. Also a change table can be created in the source system, which keeps track of the changes in the source data. One more method to get the incremental changes is to extract the complete source data and then do a difference (minus operation) between the current extraction and last extraction. This approach causes a performance issue. Logical Extraction:
  • 14. The data can be extracted physically by two methods: Online Extraction: In online extraction the data is extracted directly from the source system. The extraction process connects to the source system and extracts the source data. Here the data is extracted directly from the Source for processing in the staging area, that’s why it’s called online extraction. During Extraction we connect directly to the source system and then access the source tables. There is no need of any external staging area Offline Extraction: The data from the source system is dumped outside of the source system into a flat file. This flat file is used to extract the data. The flat file can be created by a routine process daily. Here the data is not extracted directly from the source, but instead it’s taken from another external area which keeps the copy of source. The external area can be Flat files, or some dump files in a specific format. So when we need to process the data we can fetch the records from the external source instead of the actual source. Physical extraction:
  • 15. What data Should be extracted? • The selection and analysis of the source system is usually broken into two major phases: – The data discovery phase – The anomaly detection phase
  • 16. Extraction - Data Discovery Phase • Data Discovery Phase key criterion for the success of the data warehouse is the cleanliness and cohesiveness of the data within it • Once you understand what the target needs to look like, you need to identify and examine the data sources
  • 17. Data Discovery Phase • It is up to the ETL team to drill down further into the data requirements to determine each and every source system, table, and attribute required to load the data warehouse – Collecting and Documenting Source Systems – Keeping track of source systems – Determining the System of Record - Point of originating of data – Definition of the system-of-record is important because in most enterprises data is stored redundantly across many different systems. – Enterprises do this to make nonintegrated systems share data. It is very common that the same piece of data is copied, moved, manipulated, transformed, altered, cleansed, or made corrupt throughout the enterprise, resulting in varying versions of the same data
  • 18. Data Content Analysis - Extraction • Understanding the content of the data is crucial for determining the best approach for retrieval - NULL values. An unhandled NULL value can destroy any ETL process. NULL values pose the biggest risk when they are in foreign key columns. Joining two or more tables based on a column that contains NULL values will cause data loss! Remember, in a relational database NULL is not equal to NULL. That is why those joins fail. Check for NULL values in every foreign key in the source database. When NULL values are present, you must outer join the tables - Dates in non date fields. Dates are very peculiar elements because they are the only logical elements that can come in various formats, literally containing different values and having the exact same meaning. Fortunately, most database systems support most of the various formats for display purposes but store them in a single standard format
  • 19. 19
  • 20. 20 Data Transformation • Data transformation is the component of data reconcilation that converts data from the format of the source operational systems to the format of enterprise data warehouse. • Data transformation consists of a variety of different functions: – record-level functions, – field-level functions and – more complex transformation.
  • 21. 21 Record-level functions & Field-level functions • Record-level functions – Selection: data partitioning – Joining: data combining – Normalization – Aggregation: data summarization, Aggregates • Field-level functions – Single-field transformation: from one field to one field – Multi-field transformation: from many fields to one, or one field to many
  • 22. Transformation • Main step where the ETL adds value • Actually changes data and provides guidance whether data can be used for its intended purposes • Performed in staging area
  • 23. 23
  • 24. Need Of Staging Area: It makes it possible to restart At least, some of the phases independently from the others. For example, if the transformation step fails, it should not be necessary to restart the Extract step. The staging area should is be accessed by the load ETL process only. It should never be available to anyone else, particularly not to end users as it is not intended for data presentation to the end-user may contain incomplete or in-the-middle-of-the-processing data. Staging means that the data is simply dumped to the location (called the Staging Area) so that it can then be read by the next processing phase. The staging area is used during ETL process to store intermediate results of processing.
  • 25. Data Quality paradigm • Correct • Unambiguous/Clear • Consistent • Complete • Data quality checks are run at 2 places - after extraction and after cleaning and confirming additional check are run at this point Transformation
  • 26. 26
  • 27. Transformation - Cleaning Data • Anomaly Detection – Data sampling – count(*) of the rows for a department column • Column Property Enforcement – Null Values in columns – Numeric values that fall outside of expected high and lows – Columns whose lengths are exceptionally short/long – Columns with certain values outside of discrete valid value sets
  • 28. • The cleaning step is one of the most important as it ensures the quality of the data in the data warehouse. Cleaning should perform basic data unification rules, such as: • Making identifiers unique (sex categories Male/Female/Unknown, M/F/null, Man/Woman/Not Available are translated to standard Male/Female/Unknown) • Convert null values into standardized Not Available/Not Provided value • Convert phone numbers, ZIP codes to a standardized form • Validate address fields, convert them into proper naming, e.g. Street/St/St./Str./Str. • Validate address fields against each other (State/Country, City/State, City/ZIP code, City/Street). Cleansing
  • 29. • Fixing errors: misspellings, erroneous dates, incorrect field usage, mismatched addresses, missing data, duplicate data, inconsistencies Cleansing Also: decoding, reformatting, time stamping, conversion, key generation, merging, error detection/logging, locating missing data
  • 30. Cleansing Further Leads to ETL process Staged Data Cleaning And Confirming Errors Stop Loading Yes No
  • 31. Transformation - Confirming • Structure Enforcement – Tables have proper primary and foreign keys – Obey referential integrity • Data and Rule value enforcement – Simple business rules – Logical data checks
  • 32. 32
  • 33. During the load step, it is necessary to ensure that the load is performed correctly and with as little resources as possible. The target of the Load process is often a database. In order to make the load process efficient, it is helpful to disable any constraints and indexes before the load and enable them back only after the load completes. The referential integrity needs to be maintained by ETL tool to ensure consistency. Data Loading:
  • 34. Loading Can Be….. Full Load is the entire data dump load taking place the very first time. In this we give the last extract date as empty so that all the data gets loaded Incremental - Where delta or difference between target and source data is dumped at regular intervals. Here we give the last extract date such that only records after this date are loaded.
  • 35. Why Incremental? Speed. Opting to do a full load on larger datasets will take a great amount of time and other server resources. Ideally all the data loads are performed overnight with the expectation of completing them before users can see the data the next day. The overnight window may not be enough time for the full load to complete. Preserving history. When dealing with a OLTP source that is not designed to keep history, a full load will remove history from the destination as well, since full load will remove all the records first, remember! So a full load will not allow you to preserve history in the data warehouse.
  • 36. Full Load vs. Incremental Load: Full Load Incremental Load Truncates all rows and loads from scratch. New records and updated ones are loaded Requires more time. Requires less time. Can easily be guaranteed Difficult. ETL must check for new/updated rows. Can be lost. Retained.
  • 37. Loading Includes: • Loading Dimensions • Loading Facts
  • 38. Dimensions • Qualifying characteristics that provide additional perspectives to a given fact – DSS data is almost always viewed in relation to other data • Dimensions are normally stored in dimension tables
  • 39. Facts • Numeric measurements (values) that represent a specific business aspect or activity • Stored in a fact table at the center of the star scheme • Contains facts that are linked through their dimensions • Can be computed or derived at run time • Updated periodically with data from operational databases
  • 40. E T L