ChakraView – A 360° Approach to Data Quality

ChakraView
A 360° approach to data quality
Shankar Manian
Keerthika Thiyagarajan

Background
● ~15 years in Big Data...
● ...as Data Janitors
● Can we do better ?

Data Quality - Missing Focus
● Afterthought
● Needle in a haystack
● Huge cost

Detection - Missing Dimensions
● Completeness
● Consistency
● Auditability

Cleansing - The Hidden Cost
● Trace the issue to source
● No SOP on how to fix
● Hard to Automate

Visibility - Or the lack of it
● Impact - Cost of bad data
● Breakdown and Prioritization
● Push quality upstream

State before
● Stakeholder driven
● Reactive process
● Business metrics
● Huge monetary impact
● Iterative Discovery

Validations Framework
● Granular Validations -> Business metrics
● Self serve onboarding
● Tigger on data refresh
● System health dashboard

TransactionI
d
OrderId Amount B.Amount InvoiceId L.Amount
TX1 OD1 100 100 I1 10
TX2 OD2 50 50 I2 50
TX3 OD3 75 75 I3 75
TX4 OD4 200 200
TX5 OD5 50 I5 50
Bad Records
PaymentGateway * BankStatement * Ledger
Amount Mismatch
Entry missing in Ledger
Entry missing in Bank statement

Salient features
● Abstract templates
○ Null check
○ Datatype compliance
○ Aggregated check
○ Range check
○ Cross comparison check
● Filter and transformation support
○ Exclude few records
○ Case-insensitive conversion
● Construct target dataframe
● Row level results

Sample Validation
{
"fact": [{
"fact_1": "payment_gateway",
"fact_2": "ledger",
"join_type":
"full_outer_join",
"join_columns": [{
"fact_1_column":
"transaction_id",
"fact_2_column":
"transaction_id",
"operator":
"equal"
}]
}],
"group_by_columns": ["transaction_id"],
"idempotency_columns": [
"transaction_id"
],
"validation_configurations": [{
"name": "amount_recon",
"operator": "equal",
"expression_list": [{
"expression": {
"operator": "amount",
"terminal": "pg_amount"
}
},
{
"expression": {
"operator": "l.amount",
"terminal": "ledger_amount"
}
}
]
}]
}

Data Flow
Trigger from
Azkaban
Run spark job Publish validation
failures
Fact refresh
Dashboard Datastore
Template Library
Validation
Configuration

Until now we were blissfully ignorant, Now we spend multiple man hours
categorising the bad records

TransactionId OrderId B.Amount InvoiceId L.Amount Category
TX1 OD1 100 I1 10
Amount wrong in
Ledger entry
TX5 OD4 200
Upstream Failure-
Payments
TX6 OD6 I6 50 File upload issue
Root Cause Analysis(RCA)
Bank Statement * Ledger

Combinatorial explosion
● The cycle is longer for big data due to
● Complexity of the system
● Time consuming
● Error prone
● Humanly impossible

● Real-time systems has ELK kind of tools
● No tools available for Big data to RCA
How do we make this operation cheap?

Auto-RCA
● Enrich logs and data from main pipeline

Enrichments
{
"commerce_activity": {
"activityType": "create_ledger",
"activityId": "TX12345",
"payload": "{"event":"create_ledger","entity_id":"TX12345"}",
"eventStatus": "ERRORED",
"retryCount": 0
},
"error_details": {
"activityType": "create_ledger",
"activityId": "TX12345",
"errorCode": "503",
"errorDescription": "Error: EnricherException{statusCode=503}",
"sourceSystem": "IRN",
"upstreamUriSignature": "/payment/<transaction>",
"upstreamUrl": "/payment/TX12345",
"upstreamHttpMethod": "GET",
"upstreamHeader": null,
"upstreamPayload": null,
"errorStatus": "OPEN",
"failureCount": null,
}
}

Auto-RCA
● Perform 5 Why RCA
● Hierarchical categorisation
● Leaf category -> Unique issues

Unclassified
Amount mismatch Missing entries
Missing entries in Bank
statement
Missing entries in ledger
Issue in invoice creation
Issue in Bank statement
Event processing failure
Event not arrived
Wrong value in file
File upload issue
Data not pushed to
analytical store
Unclassified

Fixture
● Can we automate cleaning the data?

Fixture
Event processing failure
Event not arrived
Wrong value in file
File upload issue
Data not pushed to
analytical store
reprocess_event
replay_event
reprocess_file republish_ledger_entry

Fixture
{
"flowName": "debtor_flow",
"categoryName": "Event processing failure",
"recipeName": "reprocess_event"
}

Fixture
● Recipes - Library of functions that automate the cleansing
● Leaf Category -> Recipe
● Sample Recipes
○ Reverse
○ Retry
○ Restore

● Man-days reduced to few hours.
● Reactive to proactive
● Dev-friendly
● People independent
● Complete visibility

Next Steps
● Open source
● Data observability
● Performance optimisation

ChakraView – A 360° Approach to Data Quality

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie ChakraView – A 360° Approach to Data Quality

Ähnlich wie ChakraView – A 360° Approach to Data Quality (20)

Mehr von Databricks

Mehr von Databricks (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

ChakraView – A 360° Approach to Data Quality