SlideShare ist ein Scribd-Unternehmen logo
1 von 23
Downloaden Sie, um offline zu lesen
This document is confidential and contains proprietary information, including trade secrets of CitiusTech. Neither the document nor any of the information
contained in it may be reproduced or disclosed to any unauthorized person under any circumstances without the express written permission of CitiusTech.
Testing Strategies for Data Lake
Hosted on Hadoop
August 2019 | Authors: Vaibhav Shahane and Vaibhavi Indap
CitiusTech Thought
Leadership
2
Agenda
▪ Overview
▪ Data Lake v/s Data Warehouse
▪ Structured Data
▪ Semi-structured Data
▪ References
3
Overview
▪ Data lake is a repository to store massive amount of data in its native
form
▪ Businesses have disparate sources of data which are difficult to
analyze unless brought together on a single platform (common pool
of data)
▪ Data lake allows business decision makers/data analysts/data
scientists to get holistic view of data coming in from heterogeneous
sources
4
Data Lake v/s Data Warehouse
Similarities Differences
▪ Data Lakes maintain heterogeneous
sources in a single pool
▪ It provides better access to enterprise
wide data to analysts and scientists
▪ Data is highly organized and structured in data
warehouses
▪ Data lake uses flat structure and original
schema
▪ Data present in data warehouses is
transformed, aggregated and may lose its
original schema
▪ Data warehouses provide transactional
solutions by enabling analysts to drill
down/up/through specific areas of business
▪ Data lakes answer questions that aren’t
structured but need discovery using iterative
algorithms and/or complex mathematical
functions
5
Structured Data (1/9)
Testing Gates
▪ Data Lakes creation and testing can take place around the following areas:
• Schema validations
• Data Masking validations
• Data Reconciliation at each load frequency
• ELT Framework (Extract, Load and Transform)
• On premise Vs. On cloud validations (in case data lake is being hosted on cloud)
• Data Quality and Standardization validations
• Data partitioning and compaction
6
Structured Data (2/9)
Schema Validation
▪ Data from heterogeneous sources is present on data lakes and table schema of all the tables is
preserved in the source
▪ As a part of schema validation, QA team can cover the following pointers:
• Data type
• Data length
• Null/Not Null constraints
• Delimiters (pay attention to delimiters coming as part of data)
• Special characters (visible and invisible) For e.g., Hex code ‘c2 ad’ is a soft hyphen but
appears as a white space which doesn’t go away when TRIM function is applied during data
comparison
▪ Source metadata and metadata from data lake can be extracted from respective metastores and
compared
• If source is on SSMS, metadata can be retrieved using sp_help stored procedure.
• If data lake is on HDFS, Hive has its own metadata tables such as DBS, TBLS, COLUMNS_V2,
SDS, etc.
▪ Visit https://utf8-chartable.de/unicode-utf8-table.pl for more characters
7
Structured Data (3/9)
Data Masking Validations
▪ Source systems might have PHI/PPI data in unmasked form unless client has anonymized it
beforehand
▪ Data masking logic implementation can be tested based on pre-agreed masking logic
▪ Masking logic can be written in SQL query / Excel formula and output compared with the data
masked by ETL code which has flowed into the data lake
▪ E.g. Unmasked SSN 123-456-7891 needs to be masked as XXX-XX-7891
▪ E.g. Unmasked email abc.def@xyz.com needs to be masked as axx.xxx@xyz.com
▪ Pay attention to unmasked data that is not coming in the expected format which causes masking
logic to fail
8
Data Reconciliation at Each Load Frequency
▪ Data Reconciliation is a testing gate wherein the data which is loaded in target is compared
against data in the source to ensure that no data is dropped/corrupted in the migration process
▪ Record count for each table between source and data lake:
• Initial load
• Incremental load
▪ Truncate load v/s. Append load:
• Truncate load: Data in target table is truncated every time a new data feed is about to be
loaded
• Append load: A new data feed is appended to already existing data in target table
▪ Data validation for each table between source and data lake: Data in each row and column in
source table to be compared with data lake
• To use MS Excel, batching requires to be done to get handful of data
• To compare entire dataset, custom built automation tool can be used
▪ Duplicates in data lake: SQL queries were used to identify if any duplicates were introduced
during data lake load
Structured Data (4/9)
9
Structured Data (5/9)
ELT Framework Validation
▪ Logging of type of source informs if the source data is from a table or flat file
▪ Logging of source connection string (DB link, file path, etc.):
• Indicates the database connection string if source data is coming from database table
• If source is a flat file, this informs the landing location of the file and where to read the data
from
▪ Generation of batch IDs on a fresh run and on rerun upon failure helps identify the data loaded
in a batch every day
▪ Flags such as primary key flag, truncate load flag, critical table flag, upload to cloud flag, etc. help
define the behavior of ELT jobs
▪ Logging of records processed, loaded and rejected for each table show number of records that
are extracted from source, rejected/loaded into target data lake
▪ Polling frequency, trigger check, email notification etc. indicate the frequency to poll for
incoming file/data, trigger next batch, send notifications of batch status, etc.
10
Structured Data (6/9)
On-Premise Vs. On-Cloud Validation
▪ Additional testing is required when sources are hosted on-premise and data lake is being created
on cloud
▪ Validate data types supported by on cloud application (through which data analysts/scientists will
be querying data)
• For e.g., Azure SQL warehouse doesn’t support timestamp/text data types. One needs to
cast source data to datetime2/ varchar respectively
• For e.g., Impala does not support DOUBLE data type which has to be converted to NUMERIC
▪ Validate user group access (who can see what type of data)
▪ Validate masked/unmasked views based on type of users
▪ Validate attribute names with spaces/special characters/reserved keywords between on-premise
and on-cloud
• For e.g., Source attribute named LOCATION and DIV are reserved keyword in Impala hence, it
must be changed to LOC and DIVISION to preserve the meaning
▪ Validate external tables created on HDFS files published through ELT jobs
• For e.g., Validate whether the external table is pointing to correct location on HDFS where
the files are being published by ETL jobs
▪ Validate the data consistency between on-premise and on-cloud
• For e.g., Use custom built validation tools to compare each attribute
11
Structured Data (7/9)
Data Quality and Standardization Validation
▪ Data in data lake needs to be cleansed and standardized for better analysis
▪ Actual data isn’t removed or updated, but flags (Warning / Invalid) can highlight the data quality
▪ Validate various DQ/DS rules with SQL queries on source data and compare output with DQ/DS
tools
▪ For e.g., Null Check DQ rule on MRN (Medical Record Number): When data is processed through
a tool for DQ, all the records with NULL MRN are flagged as WARNING/INVALID along with a
remark column that states MRN is NULL
▪ For e.g., Missing Parent DQ rule on Encounter ID with respect to Patient: When encounter
doesn’t have associated patient in patient table, Encounter ID is flagged as WARNING/INVALID
along with a remark column that states PATIENT MISSING
▪ For e.g., Race Data Standardization:
• Race data in source with codes such as 1001, 1002 need to be standardized with
corresponding description as Hispanic, Asian, etc.
• Based on requirements, standardization can be achieved on reference table or transaction
(data) tables as well
12
Structured Data (8/9)
Data Partitioning and Compaction
▪ When data lake is being created on cloud using Hadoop file system (HDFS), it is preferred to
store data in partitions (based on user-provided partition criteria) and use compaction in order
to minimize multiple read operations from underlying HDFS
▪ Validate whether the data in source is getting partitioned appropriately while being stored in a
data lake on HDFS. For e.g., Partitioning based on Encounter_Create_Date:
• This will create a folder structure in the output by year and will contain encounters in the
file partitioned by year
• Data retrieval will be faster when analysts/scientists query on specified date range since
data is already stored in such partitions
▪ Compaction includes two areas:
• Merging of multiple smaller file chunks into a predefined file size to avoid multiple read
operations
• Converting text format into Parquet/ZIP format to achieve file size reduction
13
Structured Data (9/9)
Data Partitioning and Compaction
▪ Validate size of the files being published on HDFS by ELT jobs by logging into Impala
• For e.g., ELT jobs produced files in .txt format on cloud which had the file size of 3.19 GB,
got reduced to 516 MB after Parquet conversion
▪ Validate merging of multiple smaller files into one or more large files
• For e.g., DQ-DS tool might produce multiple small-sized files (based on storage availability
on underlying data nodes) which can be seen at the output location. Utility can be written
to merge all these files into a single file in Parquet format
14
Structured Data – Challenges
Test Strategy ▪ ELT jobs are unable to manage continuously varying data with special
characters
▪ Delimiters as a part of incoming data
▪ PHI data in a format different than test data may cause masking logic to fail
▪ Backdated inserts not captured in incremental runs
Tools and
Technology
▪ Limitations on data types, reserved keywords, special characters handled by
cloud applications
▪ Date format conversion based on time zone selected during installation
▪ Data retrieval challenges and cost involved for downloading data for
analysis
Configuration /
Environment
▪ Late arriving flat files
▪ Any service breakdown in production cause parts of end-to-end workflow
to break
Others ▪ Project timelines need to accommodate any unknowns in the data during
production deployment and/or few weeks after deployment
15
Semi-structured Data (1/6)
Flow Diagram Of Semi-Structured (JSON) Message Ingestion into Data lake
16
Semi-structured Data (2/6)
Testing Gates
▪ Data Lakes creation and testing for semi-structured data can take place in the following areas:
• JSON Message Validation
• Data Reconciliation
• ELT Framework (Extract, Load and Transform)
• Data Quality and Standardization Validations
17
Semi-structured Data (3/6)
JSON Message Validation
▪ Data lake can be integrated with Kafka Messaging System that produces JSON messages in semi-
structured format
▪ As a part of JSON message validation, QA team can cover the following pointers:
• Compare the JSON Message with the JSON schema provided as part of requirement
• Data Type Check
• Null / Not Null constraints Check
▪ For instance:
JSON Schema JSON Message
{
"ServiceLevel":
{
"type": ["string", "null" ]
},
"ServiceType":
{
"type": ["string"]
}
}
{
"ServiceLevel": "One",
"ServiceType": "Skilled Nurse"
}
18
Semi-structured Data (4/6)
Data Reconciliation
▪ Data Reconciliation is a testing gate wherein the data loaded in target tables is compared against
the data in the source JSON messages to ensure no data is trimmed/corrupted/missed in the
migration process. One can use the following strategies to ensure the same:
▪ Record count for each table between source and data lake
• Simple JSON: Single JSON message ingested in lake is loaded as single row in target table
• Complex JSON: More than one row is loaded in the target table depending on the level of
hierarchy and nesting present in the JSON message
▪ Data validation for each table between source and data lake. Data in each row and column in
JSON message to be compared with data lake
• To use openjson function in SQL Server for parsing the JSON messages and converting them
into structured format
• To compare the parsed output of openjson with the data loaded in target tables using
Python
19
Semi-structured Data (5/6)
ETL Framework Validation
▪ Raw layer Validation on HDFS: Indicates whether the source JSON messages ingested through
Kafka are loaded in HDFS in raw form, before the processing tool selects the messages and loads
it into the target tables
▪ Logging of Kafka details: Informs about Kafka topic, partition, offset, and hostname
▪ Generation of Data lake UIDs helps in identifying JSON messages
▪ Logging of records processed, loaded and rejected as part of each JSON ingestion shows the
amount of records ingested through Kafka, processed, and failed in JSON schema validation with
error logging
▪ Email notification: This shows JSONs ingested through Kafka on hourly basis, JSON count loaded
in raw layer, and daily report which shows JSON wise count, failures and successful ingestion
20
Semi-structured Data (6/6)
Data Quality and Standardization Validation
▪ The data in data lake needs to be cleansed and standardized for better analysis
▪ In this case, the actual data need not be removed or updated but both valid and erroneous data
are logged into the audit tables
▪ Validate JSON messages against JSON schema
• For e.g., Null Check: If non-nullable attribute is assigned null value in the input JSON
message, then ingestion of such message fails, and the error gets logged in audit table with
non-nullable attribute details
21
Semi-structured Data – Challenges
Test Strategy ▪ ELT jobs are unable to manage continuously varying data with special
characters
▪ Test data preparation as per test scenarios
▪ Manual validation of single JSON loaded into multiple tables
▪ Live reconciliation of messages produced through Kafka due to continuous
streaming
Tools and
Technology
▪ Limitations on reserved keywords and special character handling
Configuration /
Environment
▪ Multiple services running simultaneously on cluster results in choking of
JSON messages in QA environment
▪ Any service breakdown in production cause parts of end-to-end workflow
to break
Others ▪ Project timelines need to accommodate any unknowns in the data during
production deployment and/or few weeks after deployment
22
References
▪ https://www.cloudmoyo.com/blog/difference-between-a-data-warehouse-and-a-data-lake/
▪ https://utf8-chartable.de/unicode-utf8-table.pl
▪ https://www.cigniti.com/blog/5-big-data-testing-challenges/
▪ https://www.confluent.io/blog/the-value-of-apache-kafka-in-big-data-ecosystem/
▪ https://kafka.apache.org/
▪ https://dzone.com/articles/json-drivers-parsing-hierarchical-data
About CitiusTech
3,500+
Healthcare IT professionals worldwide
1,500+
Healthcare software engineering
800+
HL7 certified professionals
30%+
CAGR over last 5 years
110+
Healthcare customers
▪ Healthcare technology companies
▪ Hospitals, IDNs & medical groups
▪ Payers and health plans
▪ ACO, MCO, HIE, HIX, NHIN and RHIO
▪ Pharma & Life Sciences companies
23
Thank You
Authors:
Vaibhav Shahane
Vaibhavi Indap
Technical Lead
thoughtleaders@citiustech.com

Weitere ähnliche Inhalte

Was ist angesagt?

Loan Origination Reference Architecture Deep Dive
Loan Origination Reference Architecture Deep DiveLoan Origination Reference Architecture Deep Dive
Loan Origination Reference Architecture Deep Dive
Mike Walker
 
Data Sharing with Snowflake
Data Sharing with SnowflakeData Sharing with Snowflake
Data Sharing with Snowflake
Snowflake Computing
 

Was ist angesagt? (20)

Master data management (mdm) & plm in context of enterprise product management
Master data management (mdm) & plm in context of enterprise product managementMaster data management (mdm) & plm in context of enterprise product management
Master data management (mdm) & plm in context of enterprise product management
 
Building an Effective Data Warehouse Architecture
Building an Effective Data Warehouse ArchitectureBuilding an Effective Data Warehouse Architecture
Building an Effective Data Warehouse Architecture
 
Big Data Maturity Scorecard
Big Data Maturity ScorecardBig Data Maturity Scorecard
Big Data Maturity Scorecard
 
Owning Your Own (Data) Lake House
Owning Your Own (Data) Lake HouseOwning Your Own (Data) Lake House
Owning Your Own (Data) Lake House
 
AnalytiX DS - Master Deck
AnalytiX DS - Master DeckAnalytiX DS - Master Deck
AnalytiX DS - Master Deck
 
adb.pdf
adb.pdfadb.pdf
adb.pdf
 
Data Architecture Best Practices for Advanced Analytics
Data Architecture Best Practices for Advanced AnalyticsData Architecture Best Practices for Advanced Analytics
Data Architecture Best Practices for Advanced Analytics
 
Loan Origination Reference Architecture Deep Dive
Loan Origination Reference Architecture Deep DiveLoan Origination Reference Architecture Deep Dive
Loan Origination Reference Architecture Deep Dive
 
Master Data Management - Aligning Data, Process, and Governance
Master Data Management - Aligning Data, Process, and GovernanceMaster Data Management - Aligning Data, Process, and Governance
Master Data Management - Aligning Data, Process, and Governance
 
Big Data Readiness Assessment
Big Data Readiness AssessmentBig Data Readiness Assessment
Big Data Readiness Assessment
 
Building the Data Lake with Azure Data Factory and Data Lake Analytics
Building the Data Lake with Azure Data Factory and Data Lake AnalyticsBuilding the Data Lake with Azure Data Factory and Data Lake Analytics
Building the Data Lake with Azure Data Factory and Data Lake Analytics
 
Data Sharing with Snowflake
Data Sharing with SnowflakeData Sharing with Snowflake
Data Sharing with Snowflake
 
Modern Data architecture Design
Modern Data architecture DesignModern Data architecture Design
Modern Data architecture Design
 
DAMA Feb2015 Mastering Master Data
DAMA Feb2015 Mastering Master DataDAMA Feb2015 Mastering Master Data
DAMA Feb2015 Mastering Master Data
 
Solution Blueprint - Customer 360
Solution Blueprint - Customer 360Solution Blueprint - Customer 360
Solution Blueprint - Customer 360
 
Creating an Effective MDM Strategy for Salesforce
Creating an Effective MDM Strategy for SalesforceCreating an Effective MDM Strategy for Salesforce
Creating an Effective MDM Strategy for Salesforce
 
Data Lake Implementation: Processing and Querying Data in Place (STG204-R1) -...
Data Lake Implementation: Processing and Querying Data in Place (STG204-R1) -...Data Lake Implementation: Processing and Querying Data in Place (STG204-R1) -...
Data Lake Implementation: Processing and Querying Data in Place (STG204-R1) -...
 
Infrastructure Security: Your Minimum Security Baseline
Infrastructure Security: Your Minimum Security BaselineInfrastructure Security: Your Minimum Security Baseline
Infrastructure Security: Your Minimum Security Baseline
 
Customer Data Platform ( CDP ) and Marketing Automation for FMCG
Customer Data Platform ( CDP ) and Marketing Automation for FMCGCustomer Data Platform ( CDP ) and Marketing Automation for FMCG
Customer Data Platform ( CDP ) and Marketing Automation for FMCG
 
Data Warehouse Basics
Data Warehouse BasicsData Warehouse Basics
Data Warehouse Basics
 

Ähnlich wie Testing Strategies for Data Lake Hosted on Hadoop

Building the enterprise data architecture
Building the enterprise data architectureBuilding the enterprise data architecture
Building the enterprise data architecture
Costa Pissaris
 
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
Databricks
 

Ähnlich wie Testing Strategies for Data Lake Hosted on Hadoop (20)

Extract, Transform and Load.pptx
Extract, Transform and Load.pptxExtract, Transform and Load.pptx
Extract, Transform and Load.pptx
 
So You Want to Build a Data Lake?
So You Want to Build a Data Lake?So You Want to Build a Data Lake?
So You Want to Build a Data Lake?
 
Building the enterprise data architecture
Building the enterprise data architectureBuilding the enterprise data architecture
Building the enterprise data architecture
 
Combining Data Lake and Data Wrangling for Ensuring Data Quality in CRIS
Combining Data Lake and Data Wrangling for Ensuring Data Quality in CRISCombining Data Lake and Data Wrangling for Ensuring Data Quality in CRIS
Combining Data Lake and Data Wrangling for Ensuring Data Quality in CRIS
 
Data Warehouse By Piyush
Data Warehouse By PiyushData Warehouse By Piyush
Data Warehouse By Piyush
 
Dipping Your Toes: Azure Data Lake for DBAs
Dipping Your Toes: Azure Data Lake for DBAsDipping Your Toes: Azure Data Lake for DBAs
Dipping Your Toes: Azure Data Lake for DBAs
 
60141457-Oracle-Golden-Gate-Presentation.ppt
60141457-Oracle-Golden-Gate-Presentation.ppt60141457-Oracle-Golden-Gate-Presentation.ppt
60141457-Oracle-Golden-Gate-Presentation.ppt
 
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
 
Data Warehousing.pptx
Data Warehousing.pptxData Warehousing.pptx
Data Warehousing.pptx
 
From Data Warehouse to Lakehouse
From Data Warehouse to LakehouseFrom Data Warehouse to Lakehouse
From Data Warehouse to Lakehouse
 
De-duplicated Refined Zone in Healthcare Data Lake Using Big Data Processing ...
De-duplicated Refined Zone in Healthcare Data Lake Using Big Data Processing ...De-duplicated Refined Zone in Healthcare Data Lake Using Big Data Processing ...
De-duplicated Refined Zone in Healthcare Data Lake Using Big Data Processing ...
 
Module 1_Data Warehousing Fundamentals.pptx
Module 1_Data Warehousing Fundamentals.pptxModule 1_Data Warehousing Fundamentals.pptx
Module 1_Data Warehousing Fundamentals.pptx
 
Building the Petcare Data Platform using Delta Lake and 'Kyte': Our Spark ETL...
Building the Petcare Data Platform using Delta Lake and 'Kyte': Our Spark ETL...Building the Petcare Data Platform using Delta Lake and 'Kyte': Our Spark ETL...
Building the Petcare Data Platform using Delta Lake and 'Kyte': Our Spark ETL...
 
Why a Data Services Marketplace is Critical for a Successful Data-Driven Ente...
Why a Data Services Marketplace is Critical for a Successful Data-Driven Ente...Why a Data Services Marketplace is Critical for a Successful Data-Driven Ente...
Why a Data Services Marketplace is Critical for a Successful Data-Driven Ente...
 
Datawarehousing Terminology
Datawarehousing TerminologyDatawarehousing Terminology
Datawarehousing Terminology
 
Data warehouse
Data warehouse Data warehouse
Data warehouse
 
The Marriage of the Data Lake and the Data Warehouse and Why You Need Both
The Marriage of the Data Lake and the Data Warehouse and Why You Need BothThe Marriage of the Data Lake and the Data Warehouse and Why You Need Both
The Marriage of the Data Lake and the Data Warehouse and Why You Need Both
 
Oracle GoldenGate for Oracle DBAs
Oracle GoldenGate for Oracle DBAsOracle GoldenGate for Oracle DBAs
Oracle GoldenGate for Oracle DBAs
 
Case Study: A Multi-Source Time Variant Data warehouse
Case Study: A Multi-Source Time Variant Data warehouseCase Study: A Multi-Source Time Variant Data warehouse
Case Study: A Multi-Source Time Variant Data warehouse
 
Building Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics PrimerBuilding Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics Primer
 

Mehr von CitiusTech

Mehr von CitiusTech (20)

Member Engagement Using Sentiment Analysis for Health Plans
Member Engagement Using Sentiment Analysis for Health PlansMember Engagement Using Sentiment Analysis for Health Plans
Member Engagement Using Sentiment Analysis for Health Plans
 
Evolving Role of Digital Biomarkers in Healthcare
Evolving Role of Digital Biomarkers in HealthcareEvolving Role of Digital Biomarkers in Healthcare
Evolving Role of Digital Biomarkers in Healthcare
 
Virtual Care: Key Challenges & Opportunities for Payer Organizations
Virtual Care: Key Challenges & Opportunities for Payer Organizations Virtual Care: Key Challenges & Opportunities for Payer Organizations
Virtual Care: Key Challenges & Opportunities for Payer Organizations
 
Provider-led Health Plans (Payviders)
Provider-led Health Plans (Payviders)Provider-led Health Plans (Payviders)
Provider-led Health Plans (Payviders)
 
CMS Medicare Advantage 2021 Star Ratings: An Analysis
CMS Medicare Advantage 2021 Star Ratings: An AnalysisCMS Medicare Advantage 2021 Star Ratings: An Analysis
CMS Medicare Advantage 2021 Star Ratings: An Analysis
 
Accelerate Healthcare Technology Modernization with Containerization and DevOps
Accelerate Healthcare Technology Modernization with Containerization and DevOpsAccelerate Healthcare Technology Modernization with Containerization and DevOps
Accelerate Healthcare Technology Modernization with Containerization and DevOps
 
FHIR for Life Sciences
FHIR for Life SciencesFHIR for Life Sciences
FHIR for Life Sciences
 
Leveraging Analytics to Identify High Risk Patients
Leveraging Analytics to Identify High Risk PatientsLeveraging Analytics to Identify High Risk Patients
Leveraging Analytics to Identify High Risk Patients
 
FHIR Adoption Framework for Payers
FHIR Adoption Framework for PayersFHIR Adoption Framework for Payers
FHIR Adoption Framework for Payers
 
Payer-Provider Engagement
Payer-Provider Engagement Payer-Provider Engagement
Payer-Provider Engagement
 
COVID19: Impact & Mitigation Strategies for Payer Quality Improvement 2021
COVID19: Impact & Mitigation Strategies for Payer Quality Improvement 2021COVID19: Impact & Mitigation Strategies for Payer Quality Improvement 2021
COVID19: Impact & Mitigation Strategies for Payer Quality Improvement 2021
 
Demystifying Robotic Process Automation (RPA) & Automation Testing
Demystifying Robotic Process Automation (RPA) & Automation TestingDemystifying Robotic Process Automation (RPA) & Automation Testing
Demystifying Robotic Process Automation (RPA) & Automation Testing
 
Progressive Web Apps in Healthcare
Progressive Web Apps in HealthcareProgressive Web Apps in Healthcare
Progressive Web Apps in Healthcare
 
RPA in Healthcare
RPA in HealthcareRPA in Healthcare
RPA in Healthcare
 
6 Epilepsy Use Cases for NLP
6 Epilepsy Use Cases for NLP6 Epilepsy Use Cases for NLP
6 Epilepsy Use Cases for NLP
 
Opioid Epidemic - Causes, Impact and Future
Opioid Epidemic - Causes, Impact and FutureOpioid Epidemic - Causes, Impact and Future
Opioid Epidemic - Causes, Impact and Future
 
Rising Importance of Health Economics & Outcomes Research
Rising Importance of Health Economics & Outcomes ResearchRising Importance of Health Economics & Outcomes Research
Rising Importance of Health Economics & Outcomes Research
 
ICD 11: Impact on Payer Market
ICD 11: Impact on Payer MarketICD 11: Impact on Payer Market
ICD 11: Impact on Payer Market
 
Driving Home Health Efficiency through Data Analytics
Driving Home Health Efficiency through Data AnalyticsDriving Home Health Efficiency through Data Analytics
Driving Home Health Efficiency through Data Analytics
 
Poster Presentation - FDA Compliance Landscape & What it Means to Your AI Asp...
Poster Presentation - FDA Compliance Landscape & What it Means to Your AI Asp...Poster Presentation - FDA Compliance Landscape & What it Means to Your AI Asp...
Poster Presentation - FDA Compliance Landscape & What it Means to Your AI Asp...
 

Kürzlich hochgeladen

Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Kürzlich hochgeladen (20)

Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
 

Testing Strategies for Data Lake Hosted on Hadoop

  • 1. This document is confidential and contains proprietary information, including trade secrets of CitiusTech. Neither the document nor any of the information contained in it may be reproduced or disclosed to any unauthorized person under any circumstances without the express written permission of CitiusTech. Testing Strategies for Data Lake Hosted on Hadoop August 2019 | Authors: Vaibhav Shahane and Vaibhavi Indap CitiusTech Thought Leadership
  • 2. 2 Agenda ▪ Overview ▪ Data Lake v/s Data Warehouse ▪ Structured Data ▪ Semi-structured Data ▪ References
  • 3. 3 Overview ▪ Data lake is a repository to store massive amount of data in its native form ▪ Businesses have disparate sources of data which are difficult to analyze unless brought together on a single platform (common pool of data) ▪ Data lake allows business decision makers/data analysts/data scientists to get holistic view of data coming in from heterogeneous sources
  • 4. 4 Data Lake v/s Data Warehouse Similarities Differences ▪ Data Lakes maintain heterogeneous sources in a single pool ▪ It provides better access to enterprise wide data to analysts and scientists ▪ Data is highly organized and structured in data warehouses ▪ Data lake uses flat structure and original schema ▪ Data present in data warehouses is transformed, aggregated and may lose its original schema ▪ Data warehouses provide transactional solutions by enabling analysts to drill down/up/through specific areas of business ▪ Data lakes answer questions that aren’t structured but need discovery using iterative algorithms and/or complex mathematical functions
  • 5. 5 Structured Data (1/9) Testing Gates ▪ Data Lakes creation and testing can take place around the following areas: • Schema validations • Data Masking validations • Data Reconciliation at each load frequency • ELT Framework (Extract, Load and Transform) • On premise Vs. On cloud validations (in case data lake is being hosted on cloud) • Data Quality and Standardization validations • Data partitioning and compaction
  • 6. 6 Structured Data (2/9) Schema Validation ▪ Data from heterogeneous sources is present on data lakes and table schema of all the tables is preserved in the source ▪ As a part of schema validation, QA team can cover the following pointers: • Data type • Data length • Null/Not Null constraints • Delimiters (pay attention to delimiters coming as part of data) • Special characters (visible and invisible) For e.g., Hex code ‘c2 ad’ is a soft hyphen but appears as a white space which doesn’t go away when TRIM function is applied during data comparison ▪ Source metadata and metadata from data lake can be extracted from respective metastores and compared • If source is on SSMS, metadata can be retrieved using sp_help stored procedure. • If data lake is on HDFS, Hive has its own metadata tables such as DBS, TBLS, COLUMNS_V2, SDS, etc. ▪ Visit https://utf8-chartable.de/unicode-utf8-table.pl for more characters
  • 7. 7 Structured Data (3/9) Data Masking Validations ▪ Source systems might have PHI/PPI data in unmasked form unless client has anonymized it beforehand ▪ Data masking logic implementation can be tested based on pre-agreed masking logic ▪ Masking logic can be written in SQL query / Excel formula and output compared with the data masked by ETL code which has flowed into the data lake ▪ E.g. Unmasked SSN 123-456-7891 needs to be masked as XXX-XX-7891 ▪ E.g. Unmasked email abc.def@xyz.com needs to be masked as axx.xxx@xyz.com ▪ Pay attention to unmasked data that is not coming in the expected format which causes masking logic to fail
  • 8. 8 Data Reconciliation at Each Load Frequency ▪ Data Reconciliation is a testing gate wherein the data which is loaded in target is compared against data in the source to ensure that no data is dropped/corrupted in the migration process ▪ Record count for each table between source and data lake: • Initial load • Incremental load ▪ Truncate load v/s. Append load: • Truncate load: Data in target table is truncated every time a new data feed is about to be loaded • Append load: A new data feed is appended to already existing data in target table ▪ Data validation for each table between source and data lake: Data in each row and column in source table to be compared with data lake • To use MS Excel, batching requires to be done to get handful of data • To compare entire dataset, custom built automation tool can be used ▪ Duplicates in data lake: SQL queries were used to identify if any duplicates were introduced during data lake load Structured Data (4/9)
  • 9. 9 Structured Data (5/9) ELT Framework Validation ▪ Logging of type of source informs if the source data is from a table or flat file ▪ Logging of source connection string (DB link, file path, etc.): • Indicates the database connection string if source data is coming from database table • If source is a flat file, this informs the landing location of the file and where to read the data from ▪ Generation of batch IDs on a fresh run and on rerun upon failure helps identify the data loaded in a batch every day ▪ Flags such as primary key flag, truncate load flag, critical table flag, upload to cloud flag, etc. help define the behavior of ELT jobs ▪ Logging of records processed, loaded and rejected for each table show number of records that are extracted from source, rejected/loaded into target data lake ▪ Polling frequency, trigger check, email notification etc. indicate the frequency to poll for incoming file/data, trigger next batch, send notifications of batch status, etc.
  • 10. 10 Structured Data (6/9) On-Premise Vs. On-Cloud Validation ▪ Additional testing is required when sources are hosted on-premise and data lake is being created on cloud ▪ Validate data types supported by on cloud application (through which data analysts/scientists will be querying data) • For e.g., Azure SQL warehouse doesn’t support timestamp/text data types. One needs to cast source data to datetime2/ varchar respectively • For e.g., Impala does not support DOUBLE data type which has to be converted to NUMERIC ▪ Validate user group access (who can see what type of data) ▪ Validate masked/unmasked views based on type of users ▪ Validate attribute names with spaces/special characters/reserved keywords between on-premise and on-cloud • For e.g., Source attribute named LOCATION and DIV are reserved keyword in Impala hence, it must be changed to LOC and DIVISION to preserve the meaning ▪ Validate external tables created on HDFS files published through ELT jobs • For e.g., Validate whether the external table is pointing to correct location on HDFS where the files are being published by ETL jobs ▪ Validate the data consistency between on-premise and on-cloud • For e.g., Use custom built validation tools to compare each attribute
  • 11. 11 Structured Data (7/9) Data Quality and Standardization Validation ▪ Data in data lake needs to be cleansed and standardized for better analysis ▪ Actual data isn’t removed or updated, but flags (Warning / Invalid) can highlight the data quality ▪ Validate various DQ/DS rules with SQL queries on source data and compare output with DQ/DS tools ▪ For e.g., Null Check DQ rule on MRN (Medical Record Number): When data is processed through a tool for DQ, all the records with NULL MRN are flagged as WARNING/INVALID along with a remark column that states MRN is NULL ▪ For e.g., Missing Parent DQ rule on Encounter ID with respect to Patient: When encounter doesn’t have associated patient in patient table, Encounter ID is flagged as WARNING/INVALID along with a remark column that states PATIENT MISSING ▪ For e.g., Race Data Standardization: • Race data in source with codes such as 1001, 1002 need to be standardized with corresponding description as Hispanic, Asian, etc. • Based on requirements, standardization can be achieved on reference table or transaction (data) tables as well
  • 12. 12 Structured Data (8/9) Data Partitioning and Compaction ▪ When data lake is being created on cloud using Hadoop file system (HDFS), it is preferred to store data in partitions (based on user-provided partition criteria) and use compaction in order to minimize multiple read operations from underlying HDFS ▪ Validate whether the data in source is getting partitioned appropriately while being stored in a data lake on HDFS. For e.g., Partitioning based on Encounter_Create_Date: • This will create a folder structure in the output by year and will contain encounters in the file partitioned by year • Data retrieval will be faster when analysts/scientists query on specified date range since data is already stored in such partitions ▪ Compaction includes two areas: • Merging of multiple smaller file chunks into a predefined file size to avoid multiple read operations • Converting text format into Parquet/ZIP format to achieve file size reduction
  • 13. 13 Structured Data (9/9) Data Partitioning and Compaction ▪ Validate size of the files being published on HDFS by ELT jobs by logging into Impala • For e.g., ELT jobs produced files in .txt format on cloud which had the file size of 3.19 GB, got reduced to 516 MB after Parquet conversion ▪ Validate merging of multiple smaller files into one or more large files • For e.g., DQ-DS tool might produce multiple small-sized files (based on storage availability on underlying data nodes) which can be seen at the output location. Utility can be written to merge all these files into a single file in Parquet format
  • 14. 14 Structured Data – Challenges Test Strategy ▪ ELT jobs are unable to manage continuously varying data with special characters ▪ Delimiters as a part of incoming data ▪ PHI data in a format different than test data may cause masking logic to fail ▪ Backdated inserts not captured in incremental runs Tools and Technology ▪ Limitations on data types, reserved keywords, special characters handled by cloud applications ▪ Date format conversion based on time zone selected during installation ▪ Data retrieval challenges and cost involved for downloading data for analysis Configuration / Environment ▪ Late arriving flat files ▪ Any service breakdown in production cause parts of end-to-end workflow to break Others ▪ Project timelines need to accommodate any unknowns in the data during production deployment and/or few weeks after deployment
  • 15. 15 Semi-structured Data (1/6) Flow Diagram Of Semi-Structured (JSON) Message Ingestion into Data lake
  • 16. 16 Semi-structured Data (2/6) Testing Gates ▪ Data Lakes creation and testing for semi-structured data can take place in the following areas: • JSON Message Validation • Data Reconciliation • ELT Framework (Extract, Load and Transform) • Data Quality and Standardization Validations
  • 17. 17 Semi-structured Data (3/6) JSON Message Validation ▪ Data lake can be integrated with Kafka Messaging System that produces JSON messages in semi- structured format ▪ As a part of JSON message validation, QA team can cover the following pointers: • Compare the JSON Message with the JSON schema provided as part of requirement • Data Type Check • Null / Not Null constraints Check ▪ For instance: JSON Schema JSON Message { "ServiceLevel": { "type": ["string", "null" ] }, "ServiceType": { "type": ["string"] } } { "ServiceLevel": "One", "ServiceType": "Skilled Nurse" }
  • 18. 18 Semi-structured Data (4/6) Data Reconciliation ▪ Data Reconciliation is a testing gate wherein the data loaded in target tables is compared against the data in the source JSON messages to ensure no data is trimmed/corrupted/missed in the migration process. One can use the following strategies to ensure the same: ▪ Record count for each table between source and data lake • Simple JSON: Single JSON message ingested in lake is loaded as single row in target table • Complex JSON: More than one row is loaded in the target table depending on the level of hierarchy and nesting present in the JSON message ▪ Data validation for each table between source and data lake. Data in each row and column in JSON message to be compared with data lake • To use openjson function in SQL Server for parsing the JSON messages and converting them into structured format • To compare the parsed output of openjson with the data loaded in target tables using Python
  • 19. 19 Semi-structured Data (5/6) ETL Framework Validation ▪ Raw layer Validation on HDFS: Indicates whether the source JSON messages ingested through Kafka are loaded in HDFS in raw form, before the processing tool selects the messages and loads it into the target tables ▪ Logging of Kafka details: Informs about Kafka topic, partition, offset, and hostname ▪ Generation of Data lake UIDs helps in identifying JSON messages ▪ Logging of records processed, loaded and rejected as part of each JSON ingestion shows the amount of records ingested through Kafka, processed, and failed in JSON schema validation with error logging ▪ Email notification: This shows JSONs ingested through Kafka on hourly basis, JSON count loaded in raw layer, and daily report which shows JSON wise count, failures and successful ingestion
  • 20. 20 Semi-structured Data (6/6) Data Quality and Standardization Validation ▪ The data in data lake needs to be cleansed and standardized for better analysis ▪ In this case, the actual data need not be removed or updated but both valid and erroneous data are logged into the audit tables ▪ Validate JSON messages against JSON schema • For e.g., Null Check: If non-nullable attribute is assigned null value in the input JSON message, then ingestion of such message fails, and the error gets logged in audit table with non-nullable attribute details
  • 21. 21 Semi-structured Data – Challenges Test Strategy ▪ ELT jobs are unable to manage continuously varying data with special characters ▪ Test data preparation as per test scenarios ▪ Manual validation of single JSON loaded into multiple tables ▪ Live reconciliation of messages produced through Kafka due to continuous streaming Tools and Technology ▪ Limitations on reserved keywords and special character handling Configuration / Environment ▪ Multiple services running simultaneously on cluster results in choking of JSON messages in QA environment ▪ Any service breakdown in production cause parts of end-to-end workflow to break Others ▪ Project timelines need to accommodate any unknowns in the data during production deployment and/or few weeks after deployment
  • 22. 22 References ▪ https://www.cloudmoyo.com/blog/difference-between-a-data-warehouse-and-a-data-lake/ ▪ https://utf8-chartable.de/unicode-utf8-table.pl ▪ https://www.cigniti.com/blog/5-big-data-testing-challenges/ ▪ https://www.confluent.io/blog/the-value-of-apache-kafka-in-big-data-ecosystem/ ▪ https://kafka.apache.org/ ▪ https://dzone.com/articles/json-drivers-parsing-hierarchical-data
  • 23. About CitiusTech 3,500+ Healthcare IT professionals worldwide 1,500+ Healthcare software engineering 800+ HL7 certified professionals 30%+ CAGR over last 5 years 110+ Healthcare customers ▪ Healthcare technology companies ▪ Hospitals, IDNs & medical groups ▪ Payers and health plans ▪ ACO, MCO, HIE, HIX, NHIN and RHIO ▪ Pharma & Life Sciences companies 23 Thank You Authors: Vaibhav Shahane Vaibhavi Indap Technical Lead thoughtleaders@citiustech.com