Diese Präsentation wurde erfolgreich gemeldet.
Die SlideShare-Präsentation wird heruntergeladen. ×

Introduction to ETL process

Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Wird geladen in …3
×

Hier ansehen

1 von 27 Anzeige

Weitere Verwandte Inhalte

Diashows für Sie (20)

Ähnlich wie Introduction to ETL process (20)

Anzeige

Weitere von Omid Vahdaty (20)

Aktuellste (20)

Anzeige

Introduction to ETL process

  1. 1. Introduction to ETL process Omid Vahdaty
  2. 2. Assuming ● ETL = Extract transform load ● SQL knowledge ● DW concepts
  3. 3. Concepts ● Dimensions ● Facts ● Aggregate facts ● Data mart
  4. 4. BI vs ETL? ● ETL is from DB to DB ○ Tools: Talend ○ Informatica ○ SAP BODS ○ Oracle DATA integrator ○ Microsoft SSIS ● BI is ○ AD hoc queries ○ Dash boarding ○ Tools: SAP BO , IBM cognos, Jasper soft , Tablue , Oracle BI.
  5. 5. ETL ● Extract data from DB via jobs. ● Transform - ○ change the format of data before loading. ○ Cleaning the data ○ Remove bad data or fix it. ○ Data integrity ● Load - simply load the data.
  6. 6. ETL Tool layers 1. Staging - where extracted data is saved 2. Integration - process of data is loaded 3. Access - where the data will be queried,.
  7. 7. ETL tasks ● Understand the data to be used for reporting ● Review the Data Model ● Source to target mapping ● Data checks on source data ● Packages and schema validation ● Data verification in the target system ● Verification of data transformation calculations and aggregation rules ● Sample data comparison between the source and the target system ● Data integrity and quality checks in the target system ● Performance testing on data
  8. 8. ETL testing Validation of data movement from the source to the target system. Verification of data count in the source and the target system. Verifying data extraction, transformation as per requirement and expectation. Verifying if table relations – joins and keys – are preserved during the transformation.
  9. 9. Database testing Verifying if primary and foreign keys are maintained. Verifying if the columns in a table have valid data values. Verifying data accuracy in columns. Example − Number of months column shouldn’t have a value greater than 12. Verifying missing data in columns. Check if there are null columns which actually should have a valid value.
  10. 10. ETL testing categories ● Source 2 target ○ count testing ○ data validation testing (duplicates? Data integrity ) ○ Data transformation ○ Constraint testing (null, unique, keys, ranges) ● Change /delta testing ● End Report test
  11. 11. ETL Challenges ● Data loss during ETL ● Incorrect, incomplete or duplicate data. ● DW system contains historical data, so the data volume is too large and extremely complex to perform ETL testing in the target system. ● Performance ● Checking Critical columns ● Support Date time format and time zone conversation ● Supported text encoding ● Ignoring headers in CSV ● Incorrect column number due to separator usage in text field
  12. 12. Extract validation ● Count check ● Reconcile records with the source data ● Data type check ● Ensure no spam data loaded ● Remove duplicate data ● Check all the keys are in place
  13. 13. Transform validation ● Data threshold validation check, for example, age value shouldn’t be more than 100. ● Record count check, before and after the transformation logic applied. ● Data flow validation from the staging area to the intermediate tables. ● Surrogate key check.
  14. 14. Load verification Record count check from the intermediate table to the target system. Ensure the key field data is not missing or Null. Check if the aggregate values and calculated measures are loaded in the fact tables. Check modeling views based on the target tables. Check if CDC has been applied on the incremental load table. Data check in dimension table and history table check. Check the BI reports based on the loaded fact and dimension table and as per the expected results.
  15. 15. Data duplication validation ● Example: Select Cust_Id, Cust_NAME, Quantity, COUNT (*) FROM Customer GROUP BY Cust_Id, Cust_NAME, Quantity HAVING COUNT (*) >1; ● Reasons for duplicate data: ○ If no primary key is defined, then duplicate values may come. ○ Due to incorrect mapping or environmental issues. ○ Manual errors while transferring data from the source to the target system.
  16. 16. Data Integrity testing ● number check, ● date check, ● null check, ● precision check ● invalid characters, ● incorrect upper/lower case order,
  17. 17. Detailed use cases for testing: https://www.tutorialspoint.com/etl_testing/etl_testing_scenarios.htm
  18. 18. Best practices ● Analyze data ● Fix bad data in the source ● Find a compatible ETL tool ● Monitor ETL job ● Apply Incremental ETL techniques when timestamp available.
  19. 19. Courses & books ● https://www.udemy.com/automatingetl/ ● https://books.google.co.il/books?id=TCLfzU2ilVkC&pg=PA205&lpg=PA205&d q=example+of+time+series+etl&source=bl&ots=86zwrsmHtF&sig=ssNKHMS ph9L2_N_wBI5OVmB1rqg&hl=en&sa=X&redir_esc=y#v=onepage&q=time%2 0serias&f=false
  20. 20. Courses & books ● http://www.robertomarchetto.com/talend_data_integration_free_book ● Basic Time series ETL by Omid: https://docs.google.com/document/d/1KoFMeFtxXDGiZIswcGS1o8zmp2ZlPG bX0yj6ceQ_zlU/edit?usp=sharing
  21. 21. Exercise: Original table, create, drop, and add new data. /* drop table t; create table t( i int IDENTITY(1,1) NOT NULL, d datetime ); */ -- assuming unique datevalues in t, not null insert into t (d) values (getdate()); SELECT * from t where d>=DATEADD(minute, -1, GETDATE());
  22. 22. Exercise: Staging table insert into t_staging (i,d) SELECT * from t where d>=DATEADD(minute, -1, GETDATE()) ; insert into t_presentation (d,i) select distinct(d),i from t_staging order by d desc; truncate table t_staging;
  23. 23. Exercise: presentation select count (*) from t_presentation; select count (*) from t; select * from t_presentation order by d desc
  24. 24. Talent
  25. 25. Talend sources Add mssql jdbc : https://www.talendforge.org/forum/viewtopic.php?id=54068 How to connect 2 components: https://www.talendforge.org/forum/viewtopic.php?id=6493 How to Create loop of a job (FYI, right click on project name, create project. Under Job design - far left, upper corner) https://help.talend.com/display/TalendOpenStudioComponentsReferenceGuide62EN/tLoop Running in parallel: https://www.talendbyexample.com/talend-job-parallelization-reference.html Running Several Queries for ETL such insert into, truncate http://www.vikramtakkar.com/2013/05/example-to-execute-multiple-sql-queries.html

×