This document provides an overview of a data catalog called Amundsen that was created to improve the productivity of data users. Amundsen indexes data resources and powers search based on usage patterns to help users discover, understand, and analyze data. It aims to reduce the time data scientists spend on data discovery activities from one-third to increase their productivity. The tool provides search of metadata from various data sources and displays table details, column metadata stats, and people profiles to help users find and understand corporate data.
2. |
HRS |
Increase the productivity of data users:
● data scientists
● data analyst
● BI engineers
Why do we need it
Title of presentation 2
3. |
HRS |
Step 1: Search and find the data
Step 2: Understand the data
Step 3: Perform and analysis and visualization
Step 4: Make a decision and/or share insights
Data-Driven Decision Making Process
Title of presentation 3
Data Discovery
4. |
HRS |
1. Ask coworkers
2. Ask in wider Zoom channel
3. Search over Confluence
4. Search over Repositories
5. Explore using * SQL queries
Challenge: Search and find the data
Title of presentation 4
5. |
HRS |
● Multiple results, which one is correct or
up to date?
● What do different columns mean?
Challenge: Understand the data
Title of presentation 5
7. |
HRS |
1. Discover new data sources
2. Identify end users to notify them of
changes
3. Understand the popularity and
trustworthiness of data
4. Investigate/monitor the magnitude of
protected data exposure
5. Know what your boss or colleagues are
using
6. Talk to upstream producers
7. +30% productivity for data users
Metadata is the key to next bigdata wave
Title of presentation 7
8. |
HRS |
What type of questions we want to answer
Title of presentation 8
10. |
HRS |
● First person to explore both North and
South poles
● Norwegian explorer, Roald Amundsen
Amundsen: Person
Title of presentation 10
11. |
HRS |
• Amundsen is a data discovery and metadata engine for improving the
productivity of data users
• It does that today by indexing data resources (tables, dashboards, streams,
etc.) and powering a page-rank style search based on usage patterns (e.g.
highly queried tables show up earlier than less queried tables)
• Think of it as Google search for data
Amundsen: The tool
Title of presentation 11
12. |
HRS |
Architecture: Key components
Title of presentation 12
Athena MSSql Exasol ... Glue
CI/CD
Source
File
Databuilder Crawler
Neo4j
Elastic
Search
Metadata Service Search Service
Frontend Service
ML
Feature
Service
Security
Service
Other Microservices
Metadata Sources
19. |
HRS |
ElasticSearch for search and relevance
Title of presentation 19
● Normal search: match records based on relevancy
● Category search: match records first based on data
type, then relevancy
○ column: warehouse_cost
● Wildcard search:
○ event_*
20. |
HRS |
Amundsen uses Apache Airflow to orchestrate
Databuilder jobs
Title of presentation 20