The document discusses how product managers can improve a company's data model to make queries and analysis easier. It explains that as questions get more complex, queries become prone to errors. The presentation recommends modeling business logic directly in the data by adding calculated fields, standardizing values, and cleaning up columns. This improves the data quality and allows simpler queries that are less error-prone. Product managers can contribute by suggesting improvements to the data model.
6. Improving a Data Model:
How PMs can contribute
WEBINAR
Matt David
Growth PM @ CHARTIO
7. Speaker
Matt David
Product Lead, Chartio
PM for 7 years:
● Adecco - built apps to help
people get jobs
● Udacity - built courses to help
people increase their career
trajectory
● Chartio - building a product to
help everyone make better
decisions
● General Assembly - part-time
instructor on data
12. Typical PM Question
What is the number of active users?
Define Active Define Users
SELECT
Count(Distinct Users.id)
FROM
Users
JOIN
Actions
ON Users.id = Actions.user_id
WHERE
Actions.date > CURRENT_DATE -30 AND
Actions.date < CURRENT_DATE
13. Typical PM Question
What is the number of active users?
Define Active Define Users
SELECT
Count(Distinct Users.id)
FROM
Users
JOIN
Actions
ON Users.id = Actions.user_id
WHERE
Actions.date > CURRENT_DATE -30 AND
Actions.date < CURRENT_DATE
1,000 Active Users
15. How Questions Evolve
Active Users
What emails did
non active receive?
What actions did they
do the most?
A Month ago
16. How Questions Evolve
Active Users
What emails did
non active receive?
What actions did they
do the most?
A Month ago
17. Complex Queries are Error Prone
SELECT
Users.id,
Case When
Actions.date > CURRENT_DATE -30 AND
Actions.date < CURRENT_DATE THEN “Active”
ELSE
“Not Active”
END as Active_User
FROM
Users
JOIN
Actions
ON Users.id = Actions.user_id
id active_user
1 active
2 not active
3 active
18. Complex Queries are Error Prone
SELECT
Subject,
COUNT(*)
FROM
Emails
JOIN (
SELECT
Users.id as user_id,
Case When
Actions.date > CURRENT_DATE -30 AND
Actions.date < CURRENT_DATE THEN “Active”
ELSE
“Not Active”
END as Active_User
FROM
Users
JOIN
Actions
ON Users.id = Actions.user_id
) as a
ON a.user_id = Email. User_id
WHERE
A.active_user = “not active”
GROUP BY
Subject
id active_user
1 active
2 not active
3 active
19. Complex Queries are Error Prone
SELECT
Subject,
COUNT(*)
FROM
Emails
JOIN (
SELECT
Users.id as user_id,
Case When
Actions.date > CURRENT_DATE -30 AND
Actions.date < CURRENT_DATE THEN “Active”
ELSE
“Not Active”
END as Active_User
FROM
Users
JOIN
Actions
ON Users.id = Actions.user_id
) as a
ON a.user_id = Email. User_id
WHERE
A.active_user = “not active”
GROUP BY
Subject
Subject Count
Welcome to Chartio 1000
Connect A Database 900
Check out our webinar 600
20. Improve the Data Model
Users
id name date active_user
1 Matt 1-1-2020 active
2 Dave 1-3-2020 not active
3 Tra 1-5-2020 active
21. Improve the Data Model
SELECT
Subject,
COUNT(*)
FROM
Emails
JOIN
Users
ON Users.id = Email.User_id
WHERE
Users.active_user = “not active”
GROUP BY
Subject
Subject Count
Welcome to Chartio 1000
Connect A Database 900
Check out our webinar 600
22. Improve the Data Model
SELECT
Subject,
COUNT(*)
FROM
Emails
JOIN
Users
ON Users.id = Email.User_id
WHERE
Users.active_user = “not active”
GROUP BY
Subject
SELECT
Subject,
COUNT(*)
FROM
Emails
JOIN (
SELECT
Users.id as user_id,
Case When
Actions.date > CURRENT_DATE -30 AND
Actions.date < CURRENT_DATE THEN “Active”
ELSE
“Not Active”
END as Active_User
FROM
Users
JOIN
Actions
ON Users.id = Actions.user_id
) as a
ON a.user_id = Email. User_id
WHERE
A.active_user = “not active”
GROUP BY
Subject
23. How do we make this happen?
Users
id name date active_user
1 Matt 1-1-2020 active
2 Dave 1-3-2020 not active
3 Tra 1-5-2020 active
27. Write the Query
SELECT
*,
Case When
Actions.date > CURRENT_DATE -30 AND
Actions.date < CURRENT_DATE THEN
“Active”
ELSE
“Not Active”
END as Active_User
FROM
Users
JOIN
Actions
ON
Users.id = Actions.user_id
Users
id name date active_user
1 Matt 1-1-2020 active
2 Dave 1-3-2020 not active
3 Tra 1-5-2020 active
31. Query the improved data
SELECT
COUNT(*)
FROM
Users
WHERE
Active_users = ‘Active’
SELECT
Count(Distinct Users.id)
FROM
Users
JOIN
Actions
ON
Users.id = Actions.user_id
WHERE
Actions.date > CURRENT_DATE - 30 AND
Actions.date < CURRENT_DATE
32. Query the improved data
SELECT
Subject,
COUNT(*)
FROM
Emails
JOIN
Users
ON Users.id = Email. User_id
WHERE
Users.active_user = “not active”
GROUP BY
Subject
SELECT
Subject,
COUNT(*)
FROM
Emails
JOIN (
SELECT
Users.id as user_id,
Case When
Actions.date > CURRENT_DATE -30 AND
Actions.date < CURRENT_DATE THEN “Active”
ELSE
“Not Active”
END as Active_User
FROM
Users
JOIN
Actions
ON Users.id = Actions.user_id
) as a
ON a.user_id = Email. User_id
WHERE
A.active_user = “not active”
GROUP BY
Subject
34. Improvements to look out for
Adding fields that contain business logic
Confusing Columns
Inconsistent naming
Non-descriptive Columns
Non-descriptive Values
JSON
Deprecated Data
37. -- drop unused column External_id
WITH t1 AS (
SELECT Id, Name, Display Name, Email,
Location, Type, Info, Status
FROM dl_table
),
-- Add consistent column Email
t2 AS (
SELECT Id, Name, Display Name, Email,
Location, Type, Info, Status, is_deleted
FROM t1
JOIN dl_email
ON t1.Id = dl_email.Id
),
--Standardize Location column
t3 AS (
SELECT Id, Name, Display Name, Email,
CASE WHEN Location = "US" THEN "USA"
WHEN Location = "Texas" THEN "USA"
WHEN Location = "Sao Paulo" THEN
"Brazil"
ELSE Location
END AS "Location" ,
Type, Info, Status, is_deleted
FROM t2
),
--Make column names and values descriptive for
Type
t4 as (
SELECT Id, Name, Display Name, Email,
Location,
CASE WHEN Type = "1" THEN "Can view"
WHEN Type = "2" THEN "Can edit"
WHEN Type = "3" THEN "Can admin"
END AS "Access Level" ,
Info, Status, is_deleted
FROM t3
),
--Parse relevant fields, drop original column
for Info
t5 as (
SELECT Id, Name, Display Name, Email,
Location, Access Level,
CASE WHEN Info = "%active" THEN
"active"
WHEN Info = "%inactive" THEN "inactive"
END AS "Status" ,
is_deleted
FROM t4
),
-- filter row that was deprecated from
is_deleted, and drop column
t6 as (
SELECT Id, Name, Display Name,
Email, Location, Access Level, Status
FROM t5
WHERE is_deleted != True
)
-- create view for Data Warehouse
CREATE VIEW dw_table AS
SELECT *
FROM t6
Cleaned View SQL