Five finger audit

Auditing data
Five ﬁnger mnemonic
1
All of those
examples
are real
And they happened at big,
respectable companies
full of very smart people.

Industries have been
swapped to protect 
the guilty.
2 I wanted to use the music from Benny Hill but I don’t know how to make that work
with Keynote, so you get Blackadder. I’ve tried to mask which company was
responsible for which discrepancy, but every where I have looked, I have stumbled
on confusing results.

Most of the time, it was because I, or the analyst responsible for building the data
pipeline, overlooked an edge-case, misunderstood the logging. Sometimes, it was
a real problem with actual or attempted fraud, using race conditions, unexpected
patterns, etc. The point here is less to audit the website than to audit both the
analyst understanding of the data process and the process itself.
Untitled - 18 September 2018

THIS ONE IS EASY, YOU’D THINK
ADD UP TOTALS
SELECT 
SUM(value) AS sum_value 
, COUNT(1) count_total 
FROM denormalised_table; 
SELECT 
SUM(value) AS sum_value 
FROM original_table;
-- How much revenue & orders was generated? 
SELECT 
SUM(subtotal) AS sum_value 
FROM denormalised_orders; 
SELECT 
FROM orders;
-- How much … per day? 
SELECT 
, COUNT(1) count_total
, order_date 
FROM denormalised_orders
GROUP BY order_date; 
SELECT 
, order_date 
FROM orders
GROUP BY order_date;
3 You would be surprised how often even basic numbers like the raw revenues or
order count do not add up.

This is because processing involved handling duplicates, cancellations, etc.

Often, a fact table was only started later, so the discrepancy comes from missing
values before that, because of a strict join.

With an international audience, the date of an activity requires a convention: should
we associate the date of an event to UTC, or should be associate behaviour to
perceived local time (typically before and after 4AM local)?
CLASSIC
UNICITY
SELECT date 
, contact_name
, contact_id 
, COUNT(1) 
, COUNT(DISTINCT zone_id) AS zones 
FROM {schema}.table_name 
GROUP BY 1, 2, 3 
HAVING COUNT(1) > 1 
ORDER BY 4 DESC, 1 DESC, 3; 
-- Unique orders & deliveries 
SELECT order_id 
, delivery_id 
, COUNT(1) 
FROM denormalised_deliveries 
GROUP BY 1, 2 
HAVING COUNT(1) > 1; 
-- Unique credits 
SELECT credit_id
, order_id 
, COUNT(1) 
FROM denormalised_credit 
GROUP BY 1, 2 
HAVING COUNT(1) > 1;
4 The most commonly overlooked aspect in database management by analyst is
unicity.

Source fact tables, even logs typically have extensive unique index because proper
engineering recommends it. When being copied and joined along 1::n and n::n
relations, those unicity conditions are commonly challenged.

I strongly recommend not using a `count` vs. `unique_count` method, because
those are expensive often silently default to approximations when dealing with
large dataset, and only tell if you that there is a discrepancy. Using a `having
count(1) > 1` allows you to get the exact values that are problematic and more
often than not, diagnose the problem on the spot.

YOU’D BE SURPRISED
MIN, MAX, AVG, NULL
 
SELECT 
MAX(computed_value) AS max_computed_value 
, MIN(computed_value) AS min_computed_value 
, AVG(computed_value) AS avg_computed_value 
, COUNT(CASE
WHEN computed_value IS NULL THEN 1
END) AS computed_value_is_null 
-- , … More values 
FROM outcome_being_verified;
-- Customer agent hours worked 
SELECT 
MAX(sum_hrs_out_of_service) AS max_sum_hrs_out_of_service 
, MIN(hrs_worked_paid) AS min_hrs_worked_paid 
, AVG(log_entry_count) AS avg_log_entry_count 
, COUNT(CASE
WHEN hrs_worked_paid IS NULL THEN 1
FROM denormalised_cs_agent_hours_worked
 
15.7425 -22.75305 1155 0 3247936
5 There are less obviously failing examples when you look at representative
statistics. Typically extremes and null counts.

I have seen table where the sum of hours worked was negative, or larger than 24
hours — because of edge cases around the clock.

Same thing for non-matching joins:

- you generally want to keep at least one side (left or right join)

- but that means you might get nulls;

- one or a handful is fine, but if you see thousands of mismatching instances can
flag a bigger problem.
YOU’D BE SURPRISED
DISTRIBUTION
 
SELECT 
ROUND(computed_value) AS computed_value 
, computed_boolean 
, COUNT(CASE
WHEN hrs_worked_raw IS NULL THEN 1
FROM outcome_being_verified
▸
-- Redeliveries 
SELECT is_fraudulent 
, COUNT(1) 
FROM denormalised_actions 
GROUP BY 1;
 
—- is_fraudulent count
—- false 82 206
—- null 1 418 487
—- true 21 477 408
6 Also without a clear boolean outcome are data distribution: you want certain values
to be very low, but you might not have a clear hard limit to how much is too much.

An example of a good inferred feature are booleans to flag problems. You really
want to have one of those for every possible edge-case, type of abuse, etc.
ìs_fraudulent` was one of those. Can you see why this one was problematic?

- First surprise, the amount of nulls. It was meant to have some nulls (the estimate
was ran once a day, so any transaction on the same day would have a null
estimate) but not that many — but still, that number was way too high.

- Second problem, even more obvious: is it likely that three order of magnitude
transactions are fraudulent? Not in this case: the analyst flipped the branches of
the ìf` by mistake.

TO THE TIME MACHINE, MORTY!
SUCCESSIVE TIMESTAMPS
-- Timestamp succession 
SELECT 
(datetime_1 < datetime_2) 
, (datetime_2 < datetime_3) 
, COUNT(1) 
FROM {schema}.table_name 
GROUP BY 1, 2;
▸
-- Checking the timeframe of ZenDesk tickets
SELECT 
ticket_created_at < ticket_updated_at 
, ticket_received_at < org_created 
, COUNT(1) 
FROM denormalised_zendesk_tickets 
GROUP BY 1, 2;
-- Checking the pick-up timing makes sense 
SELECT 
(order_created_at < delivered_at) 
, (order_received_at < driver_confirmed_at) 
, COUNT(1) 
FROM denormalised_orders 
GROUP BY 1, 2;
7 One common way to detect issues with data processes in production are
successions. Certain actions, like creating a account, have to happen before
another, like an action with that account. Not all timestamps have those constrains,
but there must be some. I have yet to work for a company without a very odd issue
along those lines. Typically, those are race conditions, over-written values or
misunderstood event names.

I have seen those both for physical logistics, where a parcel was delivered before it
was sent, or for paperwork, with ZenDesk tickets resolved before they were
created, or even before the type of the ticket, or the organisation handling them
were created. Having a detailed process for all of these tend to reveal edge cases.
YOU CAN TAKE NOTES
WHAT TO CHECK
▸ Everything you can think of:
▸ Totals need to add-up
▸ Some things have to be unique
▸ Min, Max, Avg, Null aka basic distribution
▸ Distributions make sense
▸ A coherent timeline
▸ Keep a record of what you checked & expect
8 This is not an exhaustive list, more a list of suggestions to start checks. Expect to
write a dozen tests per key table and have most of them fail initially. Other ways of
auditing your data include:

- dashboard that breakdown key metrics; anomaly detection on key averages,
ratios;

- having one or several small datasets of the same schema as production, 
ﬁlled with edge-cases; process those through the ETL & compare to an
expected output.

Not sure if you found that part as funny as I did, but I can promise you, it’s great to
ask that over-conﬁdent analyst about any of those, because, the 20-something
who swears there is no way that they ever did anything wrong, they make this
game really fun.

Five finger audit

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Ähnlich wie Five finger audit

Ähnlich wie Five finger audit (20)

Mehr von Bertil Hatt

Mehr von Bertil Hatt (6)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Five finger audit