6. Discover past
work
Discover
trusted data
Explore &
validate data
Consume
Looker, Tableau, ML modeling, etc
Ingest and Store
Ingest: Stitch,
Store: Redshift, Snowflake, BQ
Process: Airflow, DBT, Spark
Under-invested. Some companies use Alation or in-house solutions, but many
use Slack, company wikis, or spreadsheets.
How did this become a problem?
7.
8. Goals for evaluation
● Automatically captures everything related to data endeavors (tables, dashboards,
ETL DAGs, HR systems and their relationships).
● Exposes it in user friendly ways (search, lineage, and API)
● Easy to extend to new sources and new classes of sources
It is the source of truth for where, what and how data is being stored and used.
9. Search based Lineage based Network based Programmatic
Where is the
table/dashboard for X?
What does it contain?
I am changing a data model,
who are the owner and most
common users?
I want to follow a
power user in my team.
Access metadata
programmatically
Does this analysis
already exist?
This table’s delivery was
delayed today, I want to
notify everyone downstream.
I want to bookmark
tables of interest and
get a feed of data
delay, schema change,
incidents.
Put (pull / push)
metadata
programmatically
Other requirements
● Leverage as much data automatically as possible
● Preferably, open source and healthy community
● Preferably, Cloud agnostic
● Easy to set up
14. Criteria / Products Alation Where
Hows
Airbnb
Data
Portal
Cloudera
Navigator
Apache
Atlas
Search based
Lineage based
Network based
Hive/Presto support
Redshift support
Open source (pref.)
15. First person to discover the South Pole -
Norwegian explorer, Roald Amundsen
16.
17.
18.
19.
20.
21.
22.
23.
24.
25.
26.
27. Postgres Hive Redshift ... Presto
Github
Source
File
Databuilder Crawler
Graph
DB
Elastic
Search
Metadata Service Search Service
Frontend ServiceML
Feature
Service
Security
Service
Other Microservices
Metadata Sources
28. Pull Model Push Model
● Periodically update the index by pulling from
the system (e.g. database) via crawlers.
● Onus of integration lays on data graph
● No interface to prescribe, hard to maintain
crawlers
● The system (e.g. DB) pushes to a message
bus which downstream subscribes to.
● Onus of integration lies on database
● Message format serves as the interface
● Allows for near-real time indexing
Crawler
Database Data graph
Scheduler
Database Message
queue
Data graph
Preferred if
● Near-real time indexing is important
● Clean interface doesn’t exist
● Other tools like Wherehows are moving
towards Push Model
Preferred if
● Waiting for indexing is ok
● Working with “strapped” teams
● There’s already an interface
32. Relevance Popularity
Tables:
● Descriptions
● Table names, column names
● Tags
Dashboards:
● Description
● Chart names
Tables:
● Querying activity
● Different weights for automated vs adhoc
querying
Dashboards:
● Number of views
● Number of edits
33.
34.
35.
36. “This is God’s
work” - George
X, ex-head of
Analytics, Lyft
“I was on call and
I’m confident 50%
of the questions
could have been
answered by a
simple search in
Amundsen” -
Bomee P, DS, Lyft