SlideShare ist ein Scribd-Unternehmen logo
1 von 4
Downloaden Sie, um offline zu lesen
Auditing data
Five finger mnemonic
1
All of those
examples
are real
And they happened at big,
respectable companies
full of very smart people.

Industries have been
swapped to protect

the guilty.
2 I wanted to use the music from Benny Hill but I don’t know how to make that work
with Keynote, so you get Blackadder. I’ve tried to mask which company was
responsible for which discrepancy, but every where I have looked, I have stumbled
on confusing results.

Most of the time, it was because I, or the analyst responsible for building the data
pipeline, overlooked an edge-case, misunderstood the logging. Sometimes, it was
a real problem with actual or attempted fraud, using race conditions, unexpected
patterns, etc. The point here is less to audit the website than to audit both the
analyst understanding of the data process and the process itself.
Untitled - 18 September 2018
THIS ONE IS EASY, YOU’D THINK
ADD UP TOTALS
SELECT

SUM(value) AS sum_value

, COUNT(1) count_total

FROM denormalised_table;

SELECT

SUM(value) AS sum_value

, COUNT(1) count_total

FROM original_table;
-- How much revenue & orders was generated?

SELECT

SUM(subtotal) AS sum_value

, COUNT(1) count_total

FROM denormalised_orders;

SELECT

SUM(subtotal) AS sum_value

, COUNT(1) count_total

FROM orders;
-- How much … per day?

SELECT

SUM(subtotal) AS sum_value

, COUNT(1) count_total
, order_date

FROM denormalised_orders
GROUP BY order_date;

SELECT

SUM(subtotal) AS sum_value

, COUNT(1) count_total

, order_date

FROM orders
GROUP BY order_date;
3 You would be surprised how often even basic numbers like the raw revenues or
order count do not add up.

This is because processing involved handling duplicates, cancellations, etc.

Often, a fact table was only started later, so the discrepancy comes from missing
values before that, because of a strict join.

With an international audience, the date of an activity requires a convention: should
we associate the date of an event to UTC, or should be associate behaviour to
perceived local time (typically before and after 4AM local)?
CLASSIC
UNICITY
SELECT date

, contact_name
, contact_id

, COUNT(1)

, COUNT(DISTINCT zone_id) AS zones

FROM {schema}.table_name

GROUP BY 1, 2, 3

HAVING COUNT(1) > 1

ORDER BY 4 DESC, 1 DESC, 3;

-- Unique orders & deliveries

SELECT order_id

, delivery_id

, COUNT(1)

FROM denormalised_deliveries

GROUP BY 1, 2

HAVING COUNT(1) > 1;

-- Unique credits

SELECT credit_id
, order_id

, COUNT(1)

FROM denormalised_credit

GROUP BY 1, 2

HAVING COUNT(1) > 1;
4 The most commonly overlooked aspect in database management by analyst is
unicity.

Source fact tables, even logs typically have extensive unique index because proper
engineering recommends it. When being copied and joined along 1::n and n::n
relations, those unicity conditions are commonly challenged.

I strongly recommend not using a `count` vs. `unique_count` method, because
those are expensive often silently default to approximations when dealing with
large dataset, and only tell if you that there is a discrepancy. Using a `having
count(1) > 1` allows you to get the exact values that are problematic and more
often than not, diagnose the problem on the spot.
Untitled - 18 September 2018
YOU’D BE SURPRISED
MIN, MAX, AVG, NULL


SELECT

MAX(computed_value) AS max_computed_value

, MIN(computed_value) AS min_computed_value

, AVG(computed_value) AS avg_computed_value

, COUNT(CASE
WHEN computed_value IS NULL THEN 1
END) AS computed_value_is_null

-- , … More values

, COUNT(1) count_total

FROM outcome_being_verified;
-- Customer agent hours worked

SELECT

MAX(sum_hrs_out_of_service) AS max_sum_hrs_out_of_service

, MIN(hrs_worked_paid) AS min_hrs_worked_paid

, AVG(log_entry_count) AS avg_log_entry_count

, COUNT(CASE
WHEN hrs_worked_paid IS NULL THEN 1
END) AS computed_value_is_null

, COUNT(1) count_total

FROM denormalised_cs_agent_hours_worked


15.7425 -22.75305 1155 0 3247936
5 There are less obviously failing examples when you look at representative
statistics. Typically extremes and null counts.

I have seen table where the sum of hours worked was negative, or larger than 24
hours — because of edge cases around the clock.

Same thing for non-matching joins:

- you generally want to keep at least one side (left or right join)

- but that means you might get nulls;

- one or a handful is fine, but if you see thousands of mismatching instances can
flag a bigger problem.
YOU’D BE SURPRISED
DISTRIBUTION


SELECT

ROUND(computed_value) AS computed_value

, computed_boolean

, COUNT(CASE
WHEN hrs_worked_raw IS NULL THEN 1
END) AS computed_value_is_null

, COUNT(1) count_total

FROM outcome_being_verified
▸
-- Redeliveries

SELECT is_fraudulent

, COUNT(1)

FROM denormalised_actions

GROUP BY 1;


—- is_fraudulent count
—- false 82 206
—- null 1 418 487
—- true 21 477 408
6 Also without a clear boolean outcome are data distribution: you want certain values
to be very low, but you might not have a clear hard limit to how much is too much.

An example of a good inferred feature are booleans to flag problems. You really
want to have one of those for every possible edge-case, type of abuse, etc.
`is_fraudulent` was one of those. Can you see why this one was problematic?

- First surprise, the amount of nulls. It was meant to have some nulls (the estimate
was ran once a day, so any transaction on the same day would have a null
estimate) but not that many — but still, that number was way too high.

- Second problem, even more obvious: is it likely that three order of magnitude
transactions are fraudulent? Not in this case: the analyst flipped the branches of
the `if` by mistake.
Untitled - 18 September 2018
TO THE TIME MACHINE, MORTY!
SUCCESSIVE TIMESTAMPS
-- Timestamp succession

SELECT

(datetime_1 < datetime_2)

, (datetime_2 < datetime_3)

, COUNT(1)

FROM {schema}.table_name

GROUP BY 1, 2;
▸
-- Checking the timeframe of ZenDesk tickets
SELECT

ticket_created_at < ticket_updated_at

, ticket_received_at < org_created

, COUNT(1)

FROM denormalised_zendesk_tickets

GROUP BY 1, 2;
-- Checking the pick-up timing makes sense

SELECT

(order_created_at < delivered_at)

, (order_received_at < driver_confirmed_at)

, COUNT(1)

FROM denormalised_orders

GROUP BY 1, 2;
7 One common way to detect issues with data processes in production are
successions. Certain actions, like creating a account, have to happen before
another, like an action with that account. Not all timestamps have those constrains,
but there must be some. I have yet to work for a company without a very odd issue
along those lines. Typically, those are race conditions, over-written values or
misunderstood event names.

I have seen those both for physical logistics, where a parcel was delivered before it
was sent, or for paperwork, with ZenDesk tickets resolved before they were
created, or even before the type of the ticket, or the organisation handling them
were created. Having a detailed process for all of these tend to reveal edge cases.
YOU CAN TAKE NOTES
WHAT TO CHECK
▸ Everything you can think of:
▸ Totals need to add-up
▸ Some things have to be unique
▸ Min, Max, Avg, Null aka basic distribution
▸ Distributions make sense
▸ A coherent timeline
▸ Keep a record of what you checked & expect
8 This is not an exhaustive list, more a list of suggestions to start checks. Expect to
write a dozen tests per key table and have most of them fail initially. Other ways of
auditing your data include:

- dashboard that breakdown key metrics; anomaly detection on key averages,
ratios; 

- having one or several small datasets of the same schema as production,

filled with edge-cases; process those through the ETL & compare to an
expected output.

Not sure if you found that part as funny as I did, but I can promise you, it’s great to
ask that over-confident analyst about any of those, because, the 20-something
who swears there is no way that they ever did anything wrong, they make this
game really fun.
Untitled - 18 September 2018

Weitere ähnliche Inhalte

Ähnlich wie Five finger audit

100 sample formulas_v6
100 sample formulas_v6100 sample formulas_v6
100 sample formulas_v6
artimaroo1
 
What is the relationship between Accounting and an Accounting inform.pdf
What is the relationship between Accounting and an Accounting inform.pdfWhat is the relationship between Accounting and an Accounting inform.pdf
What is the relationship between Accounting and an Accounting inform.pdf
annikasarees
 
Big Data Analytics Tools..DS_Store__MACOSXBig Data Analyti.docx
Big Data Analytics Tools..DS_Store__MACOSXBig Data Analyti.docxBig Data Analytics Tools..DS_Store__MACOSXBig Data Analyti.docx
Big Data Analytics Tools..DS_Store__MACOSXBig Data Analyti.docx
tangyechloe
 

Ähnlich wie Five finger audit (20)

Experiments and Results on Click stream analysis using R
Experiments and Results on Click stream analysis using RExperiments and Results on Click stream analysis using R
Experiments and Results on Click stream analysis using R
 
Best Practices for Planning your Datacenter
Best Practices for Planning your DatacenterBest Practices for Planning your Datacenter
Best Practices for Planning your Datacenter
 
How to Realize an Additional 270% ROI on Snowflake
How to Realize an Additional 270% ROI on SnowflakeHow to Realize an Additional 270% ROI on Snowflake
How to Realize an Additional 270% ROI on Snowflake
 
Transformation of business analytics from business intelligence
Transformation of business analytics from business intelligenceTransformation of business analytics from business intelligence
Transformation of business analytics from business intelligence
 
The Impact of Data Science on Finance
The Impact of Data Science on FinanceThe Impact of Data Science on Finance
The Impact of Data Science on Finance
 
Date Analysis .pdf
Date Analysis .pdfDate Analysis .pdf
Date Analysis .pdf
 
Defining the Grain | Source system: Dynamics 365
Defining the Grain | Source system: Dynamics 365Defining the Grain | Source system: Dynamics 365
Defining the Grain | Source system: Dynamics 365
 
Presentacion 1
Presentacion 1Presentacion 1
Presentacion 1
 
The Data Science Process
The Data Science ProcessThe Data Science Process
The Data Science Process
 
100 sample formulas_v6
100 sample formulas_v6100 sample formulas_v6
100 sample formulas_v6
 
An Improved Frequent Itemset Generation Algorithm Based On Correspondence
An Improved Frequent Itemset Generation Algorithm Based On Correspondence An Improved Frequent Itemset Generation Algorithm Based On Correspondence
An Improved Frequent Itemset Generation Algorithm Based On Correspondence
 
What is the relationship between Accounting and an Accounting inform.pdf
What is the relationship between Accounting and an Accounting inform.pdfWhat is the relationship between Accounting and an Accounting inform.pdf
What is the relationship between Accounting and an Accounting inform.pdf
 
Winter Simulation Conference 2021 - Process Wind Tunnel Talk
Winter Simulation Conference 2021 - Process Wind Tunnel TalkWinter Simulation Conference 2021 - Process Wind Tunnel Talk
Winter Simulation Conference 2021 - Process Wind Tunnel Talk
 
CATCH User Guide
CATCH User GuideCATCH User Guide
CATCH User Guide
 
LoQutus: A deep-dive into Microsoft Power BI
LoQutus: A deep-dive into Microsoft Power BILoQutus: A deep-dive into Microsoft Power BI
LoQutus: A deep-dive into Microsoft Power BI
 
Big Data Analytics Tools..DS_Store__MACOSXBig Data Analyti.docx
Big Data Analytics Tools..DS_Store__MACOSXBig Data Analyti.docxBig Data Analytics Tools..DS_Store__MACOSXBig Data Analyti.docx
Big Data Analytics Tools..DS_Store__MACOSXBig Data Analyti.docx
 
Data science role in business
Data science role in businessData science role in business
Data science role in business
 
Enterprise Data Science Introduction
Enterprise Data Science IntroductionEnterprise Data Science Introduction
Enterprise Data Science Introduction
 
OLAP on the Cloud with Azure Databricks and Azure Synapse
OLAP on the Cloud with Azure Databricks and Azure SynapseOLAP on the Cloud with Azure Databricks and Azure Synapse
OLAP on the Cloud with Azure Databricks and Azure Synapse
 
End-to-End Machine Learning Project
End-to-End Machine Learning ProjectEnd-to-End Machine Learning Project
End-to-End Machine Learning Project
 

Mehr von Bertil Hatt (6)

AlexNet
AlexNetAlexNet
AlexNet
 
Are you ready for Data science? A 12 point test
Are you ready for Data science? A 12 point testAre you ready for Data science? A 12 point test
Are you ready for Data science? A 12 point test
 
Prediction machines
Prediction machinesPrediction machines
Prediction machines
 
Garbage in, garbage out
Garbage in, garbage outGarbage in, garbage out
Garbage in, garbage out
 
MancML Growth accounting
MancML Growth accountingMancML Growth accounting
MancML Growth accounting
 
What to do to get started with AI
What to do to get started with AIWhat to do to get started with AI
What to do to get started with AI
 

Kürzlich hochgeladen

➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men 🔝Thrissur🔝 Escor...
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men  🔝Thrissur🔝   Escor...➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men  🔝Thrissur🔝   Escor...
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men 🔝Thrissur🔝 Escor...
amitlee9823
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
amitlee9823
 
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
amitlee9823
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
amitlee9823
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
amitlee9823
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
amitlee9823
 
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
amitlee9823
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
amitlee9823
 
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
karishmasinghjnh
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 

Kürzlich hochgeladen (20)

➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men 🔝Thrissur🔝 Escor...
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men  🔝Thrissur🔝   Escor...➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men  🔝Thrissur🔝   Escor...
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men 🔝Thrissur🔝 Escor...
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
 
hybrid Seed Production In Chilli & Capsicum.pptx
hybrid Seed Production In Chilli & Capsicum.pptxhybrid Seed Production In Chilli & Capsicum.pptx
hybrid Seed Production In Chilli & Capsicum.pptx
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
 
Detecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning ApproachDetecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning Approach
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
 

Five finger audit

  • 1. Auditing data Five finger mnemonic 1 All of those examples are real And they happened at big, respectable companies full of very smart people. Industries have been swapped to protect
 the guilty. 2 I wanted to use the music from Benny Hill but I don’t know how to make that work with Keynote, so you get Blackadder. I’ve tried to mask which company was responsible for which discrepancy, but every where I have looked, I have stumbled on confusing results. Most of the time, it was because I, or the analyst responsible for building the data pipeline, overlooked an edge-case, misunderstood the logging. Sometimes, it was a real problem with actual or attempted fraud, using race conditions, unexpected patterns, etc. The point here is less to audit the website than to audit both the analyst understanding of the data process and the process itself. Untitled - 18 September 2018
  • 2. THIS ONE IS EASY, YOU’D THINK ADD UP TOTALS SELECT
 SUM(value) AS sum_value
 , COUNT(1) count_total
 FROM denormalised_table;
 SELECT
 SUM(value) AS sum_value
 , COUNT(1) count_total
 FROM original_table; -- How much revenue & orders was generated?
 SELECT
 SUM(subtotal) AS sum_value
 , COUNT(1) count_total
 FROM denormalised_orders;
 SELECT
 SUM(subtotal) AS sum_value
 , COUNT(1) count_total
 FROM orders; -- How much … per day?
 SELECT
 SUM(subtotal) AS sum_value
 , COUNT(1) count_total , order_date
 FROM denormalised_orders GROUP BY order_date;
 SELECT
 SUM(subtotal) AS sum_value
 , COUNT(1) count_total
 , order_date
 FROM orders GROUP BY order_date; 3 You would be surprised how often even basic numbers like the raw revenues or order count do not add up. This is because processing involved handling duplicates, cancellations, etc. Often, a fact table was only started later, so the discrepancy comes from missing values before that, because of a strict join. With an international audience, the date of an activity requires a convention: should we associate the date of an event to UTC, or should be associate behaviour to perceived local time (typically before and after 4AM local)? CLASSIC UNICITY SELECT date
 , contact_name , contact_id
 , COUNT(1)
 , COUNT(DISTINCT zone_id) AS zones
 FROM {schema}.table_name
 GROUP BY 1, 2, 3
 HAVING COUNT(1) > 1
 ORDER BY 4 DESC, 1 DESC, 3;
 -- Unique orders & deliveries
 SELECT order_id
 , delivery_id
 , COUNT(1)
 FROM denormalised_deliveries
 GROUP BY 1, 2
 HAVING COUNT(1) > 1;
 -- Unique credits
 SELECT credit_id , order_id
 , COUNT(1)
 FROM denormalised_credit
 GROUP BY 1, 2
 HAVING COUNT(1) > 1; 4 The most commonly overlooked aspect in database management by analyst is unicity. Source fact tables, even logs typically have extensive unique index because proper engineering recommends it. When being copied and joined along 1::n and n::n relations, those unicity conditions are commonly challenged. I strongly recommend not using a `count` vs. `unique_count` method, because those are expensive often silently default to approximations when dealing with large dataset, and only tell if you that there is a discrepancy. Using a `having count(1) > 1` allows you to get the exact values that are problematic and more often than not, diagnose the problem on the spot. Untitled - 18 September 2018
  • 3. YOU’D BE SURPRISED MIN, MAX, AVG, NULL 
 SELECT
 MAX(computed_value) AS max_computed_value
 , MIN(computed_value) AS min_computed_value
 , AVG(computed_value) AS avg_computed_value
 , COUNT(CASE WHEN computed_value IS NULL THEN 1 END) AS computed_value_is_null
 -- , … More values
 , COUNT(1) count_total
 FROM outcome_being_verified; -- Customer agent hours worked
 SELECT
 MAX(sum_hrs_out_of_service) AS max_sum_hrs_out_of_service
 , MIN(hrs_worked_paid) AS min_hrs_worked_paid
 , AVG(log_entry_count) AS avg_log_entry_count
 , COUNT(CASE WHEN hrs_worked_paid IS NULL THEN 1 END) AS computed_value_is_null
 , COUNT(1) count_total
 FROM denormalised_cs_agent_hours_worked 
 15.7425 -22.75305 1155 0 3247936 5 There are less obviously failing examples when you look at representative statistics. Typically extremes and null counts. I have seen table where the sum of hours worked was negative, or larger than 24 hours — because of edge cases around the clock. Same thing for non-matching joins: - you generally want to keep at least one side (left or right join) - but that means you might get nulls; - one or a handful is fine, but if you see thousands of mismatching instances can flag a bigger problem. YOU’D BE SURPRISED DISTRIBUTION 
 SELECT
 ROUND(computed_value) AS computed_value
 , computed_boolean
 , COUNT(CASE WHEN hrs_worked_raw IS NULL THEN 1 END) AS computed_value_is_null
 , COUNT(1) count_total
 FROM outcome_being_verified ▸ -- Redeliveries
 SELECT is_fraudulent
 , COUNT(1)
 FROM denormalised_actions
 GROUP BY 1; 
 —- is_fraudulent count —- false 82 206 —- null 1 418 487 —- true 21 477 408 6 Also without a clear boolean outcome are data distribution: you want certain values to be very low, but you might not have a clear hard limit to how much is too much. An example of a good inferred feature are booleans to flag problems. You really want to have one of those for every possible edge-case, type of abuse, etc. `is_fraudulent` was one of those. Can you see why this one was problematic? - First surprise, the amount of nulls. It was meant to have some nulls (the estimate was ran once a day, so any transaction on the same day would have a null estimate) but not that many — but still, that number was way too high. - Second problem, even more obvious: is it likely that three order of magnitude transactions are fraudulent? Not in this case: the analyst flipped the branches of the `if` by mistake. Untitled - 18 September 2018
  • 4. TO THE TIME MACHINE, MORTY! SUCCESSIVE TIMESTAMPS -- Timestamp succession
 SELECT
 (datetime_1 < datetime_2)
 , (datetime_2 < datetime_3)
 , COUNT(1)
 FROM {schema}.table_name
 GROUP BY 1, 2; ▸ -- Checking the timeframe of ZenDesk tickets SELECT
 ticket_created_at < ticket_updated_at
 , ticket_received_at < org_created
 , COUNT(1)
 FROM denormalised_zendesk_tickets
 GROUP BY 1, 2; -- Checking the pick-up timing makes sense
 SELECT
 (order_created_at < delivered_at)
 , (order_received_at < driver_confirmed_at)
 , COUNT(1)
 FROM denormalised_orders
 GROUP BY 1, 2; 7 One common way to detect issues with data processes in production are successions. Certain actions, like creating a account, have to happen before another, like an action with that account. Not all timestamps have those constrains, but there must be some. I have yet to work for a company without a very odd issue along those lines. Typically, those are race conditions, over-written values or misunderstood event names. I have seen those both for physical logistics, where a parcel was delivered before it was sent, or for paperwork, with ZenDesk tickets resolved before they were created, or even before the type of the ticket, or the organisation handling them were created. Having a detailed process for all of these tend to reveal edge cases. YOU CAN TAKE NOTES WHAT TO CHECK ▸ Everything you can think of: ▸ Totals need to add-up ▸ Some things have to be unique ▸ Min, Max, Avg, Null aka basic distribution ▸ Distributions make sense ▸ A coherent timeline ▸ Keep a record of what you checked & expect 8 This is not an exhaustive list, more a list of suggestions to start checks. Expect to write a dozen tests per key table and have most of them fail initially. Other ways of auditing your data include: - dashboard that breakdown key metrics; anomaly detection on key averages, ratios; - having one or several small datasets of the same schema as production,
 filled with edge-cases; process those through the ETL & compare to an expected output. Not sure if you found that part as funny as I did, but I can promise you, it’s great to ask that over-confident analyst about any of those, because, the 20-something who swears there is no way that they ever did anything wrong, they make this game really fun. Untitled - 18 September 2018