Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Five finger audit
1. Auditing data
Five finger mnemonic
1
All of those
examples
are real
And they happened at big,
respectable companies
full of very smart people.
Industries have been
swapped to protect
the guilty.
2 I wanted to use the music from Benny Hill but I don’t know how to make that work
with Keynote, so you get Blackadder. I’ve tried to mask which company was
responsible for which discrepancy, but every where I have looked, I have stumbled
on confusing results.
Most of the time, it was because I, or the analyst responsible for building the data
pipeline, overlooked an edge-case, misunderstood the logging. Sometimes, it was
a real problem with actual or attempted fraud, using race conditions, unexpected
patterns, etc. The point here is less to audit the website than to audit both the
analyst understanding of the data process and the process itself.
Untitled - 18 September 2018
2. THIS ONE IS EASY, YOU’D THINK
ADD UP TOTALS
SELECT
SUM(value) AS sum_value
, COUNT(1) count_total
FROM denormalised_table;
SELECT
SUM(value) AS sum_value
, COUNT(1) count_total
FROM original_table;
-- How much revenue & orders was generated?
SELECT
SUM(subtotal) AS sum_value
, COUNT(1) count_total
FROM denormalised_orders;
SELECT
SUM(subtotal) AS sum_value
, COUNT(1) count_total
FROM orders;
-- How much … per day?
SELECT
SUM(subtotal) AS sum_value
, COUNT(1) count_total
, order_date
FROM denormalised_orders
GROUP BY order_date;
SELECT
SUM(subtotal) AS sum_value
, COUNT(1) count_total
, order_date
FROM orders
GROUP BY order_date;
3 You would be surprised how often even basic numbers like the raw revenues or
order count do not add up.
This is because processing involved handling duplicates, cancellations, etc.
Often, a fact table was only started later, so the discrepancy comes from missing
values before that, because of a strict join.
With an international audience, the date of an activity requires a convention: should
we associate the date of an event to UTC, or should be associate behaviour to
perceived local time (typically before and after 4AM local)?
CLASSIC
UNICITY
SELECT date
, contact_name
, contact_id
, COUNT(1)
, COUNT(DISTINCT zone_id) AS zones
FROM {schema}.table_name
GROUP BY 1, 2, 3
HAVING COUNT(1) > 1
ORDER BY 4 DESC, 1 DESC, 3;
-- Unique orders & deliveries
SELECT order_id
, delivery_id
, COUNT(1)
FROM denormalised_deliveries
GROUP BY 1, 2
HAVING COUNT(1) > 1;
-- Unique credits
SELECT credit_id
, order_id
, COUNT(1)
FROM denormalised_credit
GROUP BY 1, 2
HAVING COUNT(1) > 1;
4 The most commonly overlooked aspect in database management by analyst is
unicity.
Source fact tables, even logs typically have extensive unique index because proper
engineering recommends it. When being copied and joined along 1::n and n::n
relations, those unicity conditions are commonly challenged.
I strongly recommend not using a `count` vs. `unique_count` method, because
those are expensive often silently default to approximations when dealing with
large dataset, and only tell if you that there is a discrepancy. Using a `having
count(1) > 1` allows you to get the exact values that are problematic and more
often than not, diagnose the problem on the spot.
Untitled - 18 September 2018
3. YOU’D BE SURPRISED
MIN, MAX, AVG, NULL
SELECT
MAX(computed_value) AS max_computed_value
, MIN(computed_value) AS min_computed_value
, AVG(computed_value) AS avg_computed_value
, COUNT(CASE
WHEN computed_value IS NULL THEN 1
END) AS computed_value_is_null
-- , … More values
, COUNT(1) count_total
FROM outcome_being_verified;
-- Customer agent hours worked
SELECT
MAX(sum_hrs_out_of_service) AS max_sum_hrs_out_of_service
, MIN(hrs_worked_paid) AS min_hrs_worked_paid
, AVG(log_entry_count) AS avg_log_entry_count
, COUNT(CASE
WHEN hrs_worked_paid IS NULL THEN 1
END) AS computed_value_is_null
, COUNT(1) count_total
FROM denormalised_cs_agent_hours_worked
15.7425 -22.75305 1155 0 3247936
5 There are less obviously failing examples when you look at representative
statistics. Typically extremes and null counts.
I have seen table where the sum of hours worked was negative, or larger than 24
hours — because of edge cases around the clock.
Same thing for non-matching joins:
- you generally want to keep at least one side (left or right join)
- but that means you might get nulls;
- one or a handful is fine, but if you see thousands of mismatching instances can
flag a bigger problem.
YOU’D BE SURPRISED
DISTRIBUTION
SELECT
ROUND(computed_value) AS computed_value
, computed_boolean
, COUNT(CASE
WHEN hrs_worked_raw IS NULL THEN 1
END) AS computed_value_is_null
, COUNT(1) count_total
FROM outcome_being_verified
▸
-- Redeliveries
SELECT is_fraudulent
, COUNT(1)
FROM denormalised_actions
GROUP BY 1;
—- is_fraudulent count
—- false 82 206
—- null 1 418 487
—- true 21 477 408
6 Also without a clear boolean outcome are data distribution: you want certain values
to be very low, but you might not have a clear hard limit to how much is too much.
An example of a good inferred feature are booleans to flag problems. You really
want to have one of those for every possible edge-case, type of abuse, etc.
`is_fraudulent` was one of those. Can you see why this one was problematic?
- First surprise, the amount of nulls. It was meant to have some nulls (the estimate
was ran once a day, so any transaction on the same day would have a null
estimate) but not that many — but still, that number was way too high.
- Second problem, even more obvious: is it likely that three order of magnitude
transactions are fraudulent? Not in this case: the analyst flipped the branches of
the `if` by mistake.
Untitled - 18 September 2018
4. TO THE TIME MACHINE, MORTY!
SUCCESSIVE TIMESTAMPS
-- Timestamp succession
SELECT
(datetime_1 < datetime_2)
, (datetime_2 < datetime_3)
, COUNT(1)
FROM {schema}.table_name
GROUP BY 1, 2;
▸
-- Checking the timeframe of ZenDesk tickets
SELECT
ticket_created_at < ticket_updated_at
, ticket_received_at < org_created
, COUNT(1)
FROM denormalised_zendesk_tickets
GROUP BY 1, 2;
-- Checking the pick-up timing makes sense
SELECT
(order_created_at < delivered_at)
, (order_received_at < driver_confirmed_at)
, COUNT(1)
FROM denormalised_orders
GROUP BY 1, 2;
7 One common way to detect issues with data processes in production are
successions. Certain actions, like creating a account, have to happen before
another, like an action with that account. Not all timestamps have those constrains,
but there must be some. I have yet to work for a company without a very odd issue
along those lines. Typically, those are race conditions, over-written values or
misunderstood event names.
I have seen those both for physical logistics, where a parcel was delivered before it
was sent, or for paperwork, with ZenDesk tickets resolved before they were
created, or even before the type of the ticket, or the organisation handling them
were created. Having a detailed process for all of these tend to reveal edge cases.
YOU CAN TAKE NOTES
WHAT TO CHECK
▸ Everything you can think of:
▸ Totals need to add-up
▸ Some things have to be unique
▸ Min, Max, Avg, Null aka basic distribution
▸ Distributions make sense
▸ A coherent timeline
▸ Keep a record of what you checked & expect
8 This is not an exhaustive list, more a list of suggestions to start checks. Expect to
write a dozen tests per key table and have most of them fail initially. Other ways of
auditing your data include:
- dashboard that breakdown key metrics; anomaly detection on key averages,
ratios;
- having one or several small datasets of the same schema as production,
filled with edge-cases; process those through the ETL & compare to an
expected output.
Not sure if you found that part as funny as I did, but I can promise you, it’s great to
ask that over-confident analyst about any of those, because, the 20-something
who swears there is no way that they ever did anything wrong, they make this
game really fun.
Untitled - 18 September 2018