In order to find value in your organization’s data assets, heroic data stewards are tasked with saving the day — every single day! Adhering to the organizational Data Governance (DG) framework, they work to ensure that data is captured right the first time, validated through appropriately automated means, and integrated into business processing. Whether it’s data profiling or in-depth root cause analysis, data stewards ensure the organization’s mission-critical data is reliably coordinated. This program will approach this framework and punctuate important facets of a data steward’s role.
49. 8/10/21, 3:31 PM
You Probably Don't Need a Data Dictionary - Locally Optimistic
Page 1 of 9
https://locallyoptimistic.com/post/data_dictionaries/
Home About Getting Involved LO Events
In Tools Tags Analytics, Business Intelligence, Data Dictionary, Data Engineering June 17, 2019 Michael Kaminsky & Alex J
You Probably Don’t Need a Data Dictionary
While efforts to build a data dictionary are often undertaken out of a zeal for documentation
applaud, in practice data dictionaries and data catalogs end up being a large maintenance
value, and tend to very quickly become out of date.
Instead of investing in building out traditional data dictionaries, we recommend a few differe
50. 8/10/21, 3:31 PM
You Probably Don't Need a Data Dictionary - Locally Optimistic
Page 2 of 9
https://locallyoptimistic.com/post/data_dictionaries/
achieving the same goals in ways that are less burdensome to maintain and better serve th
well.
What is a data dictionary?
In this post we’ll use the term “data dictionary” to describe a range of products that include
and “data catalogs”. In general, we’re referring to documents that contain metadata about
warehouse – most commonly, they have one section per table with one row per column inc
column-name, the column-type, any relevant foreign-keys, and a brief description of the co
will be additional metadata about the source of the table (a source system or potentially a p
A screenshot from an unhelpful data dictionary.
Why are data dictionaries created?
Listen. There’s no activity the two of us like better for enjoying a beautiful spring Saturday t
desk, cracking open a cold Kombucha, and writing some damn documentation. While we d
desire to document-all-the-things, we also believe that it is important to find the right docum
the goals that you want to achieve. People tend to embark on the data dictionary journey w
vague goals in mind. They want to 1) enable more data exploration, 2) codify a single sourc
definitions, and 3) document the provenance of particular pieces of business logic.
51. 8/10/21, 3:31 PM
You Probably Don't Need a Data Dictionary - Locally Optimistic
Page 3 of 9
https://locallyoptimistic.com/post/data_dictionaries/
These are all great goals! However, we do not believe that data dictionaries are the best too
these objectives. We will talk a bit about why we do not believe that data dictionaries, as tra
work very well before diving into alternative ways to achieve those goals.
What goes wrong with data dictionaries?
With your traditional data dictionary, you start small and simple. Maybe just documenting th
tables, or the definitions and business logic behind your top three KPIs.
But new use cases are always coming up, new ways of looking at data and measuring succ
to standardize) – and so the list grows.
If your KPIs aren’t just pulling from a single table or a single system, now you need to start
different sources join together in a wild mess of foreign key relationships. Maybe you even c
pictures speak a thousand words after all.
And everything is great…until you run into the issue of different audiences:
Non-technical business users: Just want the data, or at most, a high-level summary
they’re important.
Business analysts: Want the data, but also the business logic behind it, and probably
populate the source and/or the relevant context.
Data scientists / Engineers: Just want the column-level metadata, relationships, and
raw data correctly – “I can write my own SQL/Python, thanks.”
Which means that realistically, to solve everyone’s use case, you’ll need to write three versi
one, very large version) for any given piece of data. Not fun.
But even if you’re willing to go down that path, eventually something will change. Maybe yo
Conversion just shifted a little bit (as we all know, business logic can be extremely slippery)
52. 8/10/21, 3:31 PM
You Probably Don't Need a Data Dictionary - Locally Optimistic
Page 4 of 9
https://locallyoptimistic.com/post/data_dictionaries/
have to update not only your data model and dashboards + reports, but also 3 documents
The maintenance burden very quickly grows into a scalability issue, and a serious bloc
iteration speed of the entire data team. Not only does maintaining documentation slow you
documentation starts to slip (as it inevitably will) the consumers of the data dictionary will b
you will be investing time in maintaining this document that people do not even trust (and th
As a result, the heroes who start down the path of the perfect data dictionary often end up
1. Is hard to maintain
2. Is hard to use, and doesn’t answer everyone’s questions
3. Is generally out-of-date (in one way or another)
4. Ends up getting deprecated
Fortunately, we have better ways of achieving the same goals with less maintenance burde
What are the other options?
We believe the best way to approach the problem is by using the right tools for the right job
solutions for users at each level of the data literacy curve.
For the non-technical business user:
They’re really looking for an actionable, easy-to-use, single source of truth. Data – signed, s
our opinion, the best solution involves showing rather than telling: a well-architected and w
general, we believe that non-technical business users should not be writing SQL and theref
dictionary as traditionally constituted.
We do not want to go into a full-fledged comparison between BI platforms here, but whatev
here are the key features you’ll want to keep an eye out for:
53. 8/10/21, 3:31 PM
You Probably Don't Need a Data Dictionary - Locally Optimistic
Page 5 of 9
https://locallyoptimistic.com/post/data_dictionaries/
1. Single Source of Truth
Data used to drive business decisions comes from a single system/platform
Business logic is centrally defined, curated, and easily reproducible
KPIs are clearly defined, well-organized, and shared across the organization
2. Ad-hoc Exploration
High-level KPIs are rarely actionable – so allowing users to take the next step in s
discovering, and drilling down into the data is critical
3. Right Level of Abstraction
Naively exposing every piece of data will create a difficult, confusing, and intimida
Develop a curated, properly labeled, easy-to-navigate dataset (with complex joins
abstracted away from the end user)
4. Balance between governance and democratization
Do you allow people to save their own reports/dashboards? Define their own dim
transformations?
5. Self Documenting Code
In a good BI tool, it should be easy for business users to find existing reports/dashboards t
modify (or leverage as a starting point for additional exploration). The tool should provide g
avoid making common mistakes, and provide enough metadata (either through tool-tips or
business users both find what they are looking for and avoid common pitfalls. The differenc
and a data dictionary is that the tool tips live with the data as they are actually used by bus
the post-business-logic description of the entity which is generally much more useful to end
If your BI tool has these qualities, then your business users will not need (or wouldn’t get an
dictionary. Since the BI tool should help business users answer business questions, your bu
have to know anything about the underlying data structures in order to achieve their goals.
users asking for a data dictionary, you should be thinking about building them better tools r
documenting the current suboptimal system.
Depending on the size of your team, the structure of your organization, and the complexity
54. 8/10/21, 3:31 PM
You Probably Don't Need a Data Dictionary - Locally Optimistic
Page 6 of 9
https://locallyoptimistic.com/post/data_dictionaries/
may need to augment your BI tool with some additional tooling in the form of a wiki or know
below). For your non-technical users this is best used as a reference for “who is the expert”
to if they want to get started in a new domain area.
For the business / data analyst: For the people in your organization who are rolling up the
the data, there should be two sources of documentation that are most relevant:
A self-documenting and searchable code-base
A knowledge-base or wiki of high-level descriptions
The first place an analyst should look when trying to understand a table or column is in the
and manages that table or column. If, for example, you are using Looker as your BI tool on
analysts should be able to quickly and easily search those code bases to find the correct se
code base is well-architected and you follow best practices for programming style and read
straightforward for the analyst to review and comprehend the relative logic. If your analyst c
comprehend the logic, you should consider refactoring your code base or implementing mo
practices.
For the types of knowledge that are not easy to see in the code itself, it can be helpful to ha
that people can review for high-level information about a given concept. This documentatio
catalog of “what is in the data warehouse” but rather should include information about the
behind broad concepts in the business:
What is (NPS, ARR, Conversion, insert your concept here), how do we define it, how is
business logic), why is it important, and what are the next steps?
Important context and history (“this is the story of why it takes 5 joins to connect our h
coupon code system”).
Any potential “gotchas” (“we know that this logic is incorrect for some number of user
checked it impacted less than 1% of users and so we ignored it”).
Between these two sources of information, analysts should have everything they need to un
55. 8/10/21, 3:31 PM
You Probably Don't Need a Data Dictionary - Locally Optimistic
Page 7 of 9
https://locallyoptimistic.com/post/data_dictionaries/
of the data they have.
For the data scientist / engineer:
For the most technical folks on your team, you can add some additional tooling that can he
that will be most useful to them. In addition to the self-documenting code-base (which is
concept in this blog post), you can think about adding:
Knowledge-repo: This tool is used to keep track of analyses that have happened in the
resource for both analysts and data scientists to see 1) example uses of certain data s
are related to the topic they are working on. Seeing how someone else used a given ta
for someone to understand its intricacies.
Programmatically-generated ERDs: rather than maintaining these types of documenta
programmatically-generated ERDs or data-heritage DAGs (from a tool like dbt or airflo
ensure that your documentation actually matches your code.
Architecture Design Documentation: this is often an artifact of the development proces
depth description of requirements, technical decisions made, architecture, etc.
Other important things to keep in mind:
You need a strong data organization to maintain these different avenues of documenta
You need a data-driven culture of cross-functional collaboration to make all of the piec
What are data dictionaries useful for?
There are two use-cases where having something like a data dictionary can be valuable:
1. If you are a data-as-a-service (DaaS) provider, then your data dictionary is effectively d
will be critical to your clients
2. For compliance / security reasons, you have complex restrictions on who can access
56. 8/10/21, 3:31 PM
You Probably Don't Need a Data Dictionary - Locally Optimistic
Page 8 of 9
https://locallyoptimistic.com/post/data_dictionaries/
!
warehouse
These two use-cases are pretty niche, and we do not believe they apply to most internal-fa
second use-case is a bit tricky because what you want to build is not a “data dictionary” as
“helpful” metadata about tables and columns) but rather an access-spec which can hierarc
different parts of a warehouse (at the schema, table, column, and row level). We have some
build an effective tool for this, so if you’re working on something like that, please reach out!
Conclusion
Overall, we think that data dictionaries tend not to be very useful, and a data team’s time w
maintaining a well-organized code-base that is self-documenting and providing well-design
that do not require much documentation. Where possible, teams should make use of tools
generating documentation directly from code in order to prevent concept drift.
Michael Kaminsky & Alex Jia
Check out the Author pages for Alex and Michael
Related Posts:
" # $
Previous Post:
KPI Principles
%
%
57. 8/10/21, 3:31 PM
You Probably Don't Need a Data Dictionary - Locally Optimistic
Page 9 of 9
https://locallyoptimistic.com/post/data_dictionaries/
The Attribution Dilemma Adding Context to Your Analysis with
Annotations
The Data S