From discovering to trusting data

•

3 gefällt mir•423 views

M

Presentation at SF Big Analytics meetup on Jan 12, 2021. https://www.meetup.com/SF-Big-Analytics/events/275217663/

From discovering to trusting data

SELECT
*
FROM
default.my_table
WHERE ds=’2018-01-01’
LIMIT 100;

Discover past
work
Discover
trusted data
Explore &
validate data
Consume
Looker, Tableau, ML modeling, etc
Ingest and Store
Ingest: Stitch,
Store: Redshift, Snowflake, BQ
Process: Airflow, DBT, Spark
Under-invested. Some companies use Alation or in-house solutions, but many
use Slack, company wikis, or spreadsheets.
How did this become a problem?

Goals for evaluation
● Automatically captures everything related to data endeavors (tables, dashboards,
ETL DAGs, HR systems and their relationships).
● Exposes it in user friendly ways (search, lineage, and API)
● Easy to extend to new sources and new classes of sources
It is the source of truth for where, what and how data is being stored and used.

Search based Lineage based Network based Programmatic
Where is the
table/dashboard for X?
What does it contain?
I am changing a data model,
who are the owner and most
common users?
I want to follow a
power user in my team.
Access metadata
programmatically
Does this analysis
already exist?
This table’s delivery was
delayed today, I want to
notify everyone downstream.
I want to bookmark
tables of interest and
get a feed of data
delay, schema change,
incidents.
Put (pull / push)
metadata
programmatically
Other requirements
● Leverage as much data automatically as possible
● Preferably, open source and healthy community
● Preferably, Cloud agnostic
● Easy to set up

metadata
noun /ˈmedəˌdādə,ˈmedəˌdadə/
1. What kind of
information?
2. About what
data?

●
●
●
●
●
●
Terminology borrowed from Ground paper

Data stores Dashboard /
Reports
Schema registry
Events /
Schemas
StreamsPeople
Employees
Notebooks

Criteria / Products Alation Where
Hows
Airbnb
Data
Portal
Cloudera
Navigator
Apache
Atlas
Search based
Lineage based
Network based
Hive/Presto support
Redshift support
Open source (pref.)

First person to discover the South Pole -
Norwegian explorer, Roald Amundsen

Postgres Hive Redshift ... Presto
Github
Source
File
Databuilder Crawler
Graph
DB
Elastic
Search
Metadata Service Search Service
Frontend ServiceML
Feature
Service
Security
Service
Other Microservices
Metadata Sources

Pull Model Push Model
● Periodically update the index by pulling from
the system (e.g. database) via crawlers.
● Onus of integration lays on data graph
● No interface to prescribe, hard to maintain
crawlers
● The system (e.g. DB) pushes to a message
bus which downstream subscribes to.
● Onus of integration lies on database
● Message format serves as the interface
● Allows for near-real time indexing
Crawler
Database Data graph
Scheduler
Database Message
queue
Data graph
Preferred if
● Near-real time indexing is important
● Clean interface doesn’t exist
● Other tools like Wherehows are moving
towards Push Model
Preferred if
● Waiting for indexing is ok
● Working with “strapped” teams
● There’s already an interface

Low relevance High relevance

Low popularity High popularity

Relevance Popularity
Tables:
● Descriptions
● Table names, column names
● Tags
Dashboards:
● Description
● Chart names
Tables:
● Querying activity
● Different weights for automated vs adhoc
querying
Dashboards:
● Number of views
● Number of edits

“This is God’s
work” - George
X, ex-head of
Analytics, Lyft
“I was on call and
I’m confident 50%
of the questions
could have been
answered by a
simple search in
Amundsen” -
Bomee P, DS, Lyft

Penetration rate:
DS (aka analyst): 81%
RS (aka DS): 71%
PM: 22%
SWE: 17%
Cust Serv: 7% (12/390)
Sp. Ops: 67% (10/15)
Sp. Op Leads: 53% (9/17)
Economist: 100% (7/7)
Cust. Quality: 78% (7/9)
Growth Mktg: 25% (6/24)

1.
Metadata
Data
Discovery
& Trust
Compliance
(GDPR /
CCPA /
Financial)
Security /
Privacy
Data
Monitorin
g / Ops
Data
Quality
Cost
Mgmt
Data
Maintena
nce

Icons under Creative Commons License from https://thenounproject.com/

Empfohlen

Amundsen at Brex and Looker integration

Amundsen at Brex and Looker integration

Amundsen at Brex and Looker integrationmarkgrover

Big Data at Speed

Big Data at Speed

Big Data at Speedmarkgrover

Disrupting Data Discovery

Disrupting Data Discovery

Disrupting Data Discoverymarkgrover

Strata sf - Amundsen presentation

Strata sf - Amundsen presentation

Strata sf - Amundsen presentationTao Feng

Data council sf amundsen presentation

Data council sf amundsen presentation

Data council sf amundsen presentationTao Feng

Near real-time anomaly detection at Lyft

Near real-time anomaly detection at Lyft

Near real-time anomaly detection at Lyftmarkgrover

How Lyft Drives Data Discovery

How Lyft Drives Data Discovery

How Lyft Drives Data DiscoveryNeo4j

REA Group's journey with Data Cataloging and Amundsen

REA Group's journey with Data Cataloging and Amundsen

REA Group's journey with Data Cataloging and Amundsenmarkgrover

Empfohlen

Amundsen at Brex and Looker integration

Amundsen at Brex and Looker integration

Amundsen at Brex and Looker integrationmarkgrover

Big Data at Speed

Big Data at Speed

Big Data at Speedmarkgrover

Disrupting Data Discovery

Disrupting Data Discovery

Disrupting Data Discoverymarkgrover

Strata sf - Amundsen presentation

Strata sf - Amundsen presentation

Strata sf - Amundsen presentationTao Feng

Data council sf amundsen presentation

Data council sf amundsen presentation

Data council sf amundsen presentationTao Feng

Near real-time anomaly detection at Lyft

Near real-time anomaly detection at Lyft

Near real-time anomaly detection at Lyftmarkgrover

How Lyft Drives Data Discovery

How Lyft Drives Data Discovery

How Lyft Drives Data DiscoveryNeo4j

REA Group's journey with Data Cataloging and Amundsen

REA Group's journey with Data Cataloging and Amundsen

REA Group's journey with Data Cataloging and Amundsenmarkgrover

Data Discovery and Metadata

Data Discovery and Metadata

Data Discovery and Metadatamarkgrover

KnowIT, semantic informatics knowledge base

KnowIT, semantic informatics knowledge base

KnowIT, semantic informatics knowledge baseLaurent Alquier

Tuning ML Models: Scaling, Workflows, and Architecture

Tuning ML Models: Scaling, Workflows, and Architecture

Tuning ML Models: Scaling, Workflows, and ArchitectureDatabricks

Introduction to basic data analytics tools

Introduction to basic data analytics tools

Introduction to basic data analytics toolsNascenia IT

Multi Model Machine Learning by Maximo Gurmendez and Beth Logan

Multi Model Machine Learning by Maximo Gurmendez and Beth Logan

Multi Model Machine Learning by Maximo Gurmendez and Beth LoganSpark Summit

Amundsen: From discovering to security data

Amundsen: From discovering to security data

Amundsen: From discovering to security datamarkgrover

Building End-to-End Delta Pipelines on GCP

Building End-to-End Delta Pipelines on GCP

Building End-to-End Delta Pipelines on GCPDatabricks

Data Pipelines With Streamsets

Data Pipelines With Streamsets

Data Pipelines With Streamsets Jowanza Joseph

Meetup SF - Amundsen

Meetup SF - Amundsen

Meetup SF - AmundsenPhilippe Mizrahi

Importance of ML Reproducibility & Applications with MLfLow

Importance of ML Reproducibility & Applications with MLfLow

Importance of ML Reproducibility & Applications with MLfLowDatabricks

Patterns and Anti-Patterns for Memorializing Data Science Project Artifacts

Patterns and Anti-Patterns for Memorializing Data Science Project Artifacts

Patterns and Anti-Patterns for Memorializing Data Science Project ArtifactsDatabricks

NoSQL: what does it mean, how did we get here, and why should I care? - Hugo ...

NoSQL: what does it mean, how did we get here, and why should I care? - Hugo ...

NoSQL: what does it mean, how did we get here, and why should I care? - Hugo ...South London Geek Nights

Тарас Кльоба "ETL — вже не актуальна; тривалі живі потоки із системою Apache...

Тарас Кльоба "ETL — вже не актуальна; тривалі живі потоки із системою Apache...

Тарас Кльоба "ETL — вже не актуальна; тривалі живі потоки із системою Apache...Lviv Startup Club

Fully Automated QA System For Large Scale Search And Recommendation Engines U...

Fully Automated QA System For Large Scale Search And Recommendation Engines U...

Fully Automated QA System For Large Scale Search And Recommendation Engines U...Spark Summit

Credit Fraud Prevention with Spark and Graph Analysis

Credit Fraud Prevention with Spark and Graph Analysis

Credit Fraud Prevention with Spark and Graph AnalysisJen Aman

Radical Speed for SQL Queries on Databricks: Photon Under the Hood

Radical Speed for SQL Queries on Databricks: Photon Under the Hood

Radical Speed for SQL Queries on Databricks: Photon Under the HoodDatabricks

From zero to hero with the actor model - Tamir Dresher - Odessa 2019

From zero to hero with the actor model - Tamir Dresher - Odessa 2019

From zero to hero with the actor model - Tamir Dresher - Odessa 2019Tamir Dresher

Democratizing Data within your organization - Data Discovery

Democratizing Data within your organization - Data Discovery

Democratizing Data within your organization - Data DiscoveryMark Grover

You Can Do It in SQL

You Can Do It in SQL

You Can Do It in SQLDatabricks

Data Discovery & Trust through Metadata

Data Discovery & Trust through Metadata

Data Discovery & Trust through Metadatamarkgrover

Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...

Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...

Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...Databricks

Big data architectures and the data lake

Big data architectures and the data lake

Big data architectures and the data lakeJames Serra

Weitere ähnliche Inhalte

Was ist angesagt?

Data Discovery and Metadata

Data Discovery and Metadata

Data Discovery and Metadatamarkgrover

KnowIT, semantic informatics knowledge base

KnowIT, semantic informatics knowledge base

KnowIT, semantic informatics knowledge baseLaurent Alquier

Tuning ML Models: Scaling, Workflows, and Architecture

Tuning ML Models: Scaling, Workflows, and Architecture

Tuning ML Models: Scaling, Workflows, and ArchitectureDatabricks

Introduction to basic data analytics tools

Introduction to basic data analytics tools

Introduction to basic data analytics toolsNascenia IT

Multi Model Machine Learning by Maximo Gurmendez and Beth Logan

Multi Model Machine Learning by Maximo Gurmendez and Beth Logan

Multi Model Machine Learning by Maximo Gurmendez and Beth LoganSpark Summit

Amundsen: From discovering to security data

Amundsen: From discovering to security data

Amundsen: From discovering to security datamarkgrover

Building End-to-End Delta Pipelines on GCP

Building End-to-End Delta Pipelines on GCP

Building End-to-End Delta Pipelines on GCPDatabricks

Data Pipelines With Streamsets

Data Pipelines With Streamsets

Data Pipelines With Streamsets Jowanza Joseph

Meetup SF - Amundsen

Meetup SF - Amundsen

Meetup SF - AmundsenPhilippe Mizrahi

Importance of ML Reproducibility & Applications with MLfLow

Importance of ML Reproducibility & Applications with MLfLow

Importance of ML Reproducibility & Applications with MLfLowDatabricks

Patterns and Anti-Patterns for Memorializing Data Science Project Artifacts

Patterns and Anti-Patterns for Memorializing Data Science Project Artifacts

Patterns and Anti-Patterns for Memorializing Data Science Project ArtifactsDatabricks

NoSQL: what does it mean, how did we get here, and why should I care? - Hugo ...

NoSQL: what does it mean, how did we get here, and why should I care? - Hugo ...

NoSQL: what does it mean, how did we get here, and why should I care? - Hugo ...South London Geek Nights

Тарас Кльоба "ETL — вже не актуальна; тривалі живі потоки із системою Apache...

Тарас Кльоба "ETL — вже не актуальна; тривалі живі потоки із системою Apache...

Тарас Кльоба "ETL — вже не актуальна; тривалі живі потоки із системою Apache...Lviv Startup Club

Fully Automated QA System For Large Scale Search And Recommendation Engines U...

Fully Automated QA System For Large Scale Search And Recommendation Engines U...

Fully Automated QA System For Large Scale Search And Recommendation Engines U...Spark Summit

Credit Fraud Prevention with Spark and Graph Analysis

Credit Fraud Prevention with Spark and Graph Analysis

Credit Fraud Prevention with Spark and Graph AnalysisJen Aman

Radical Speed for SQL Queries on Databricks: Photon Under the Hood

Radical Speed for SQL Queries on Databricks: Photon Under the Hood

Radical Speed for SQL Queries on Databricks: Photon Under the HoodDatabricks

From zero to hero with the actor model - Tamir Dresher - Odessa 2019

From zero to hero with the actor model - Tamir Dresher - Odessa 2019

From zero to hero with the actor model - Tamir Dresher - Odessa 2019Tamir Dresher

Democratizing Data within your organization - Data Discovery

Democratizing Data within your organization - Data Discovery

Democratizing Data within your organization - Data DiscoveryMark Grover

You Can Do It in SQL

You Can Do It in SQL

You Can Do It in SQLDatabricks

Data Discovery & Trust through Metadata

Data Discovery & Trust through Metadata

Data Discovery & Trust through Metadatamarkgrover

Was ist angesagt? (20)

Data Discovery and Metadata

Data Discovery and Metadata

Data Discovery and Metadata

KnowIT, semantic informatics knowledge base

KnowIT, semantic informatics knowledge base

KnowIT, semantic informatics knowledge base

Tuning ML Models: Scaling, Workflows, and Architecture

Tuning ML Models: Scaling, Workflows, and Architecture

Tuning ML Models: Scaling, Workflows, and Architecture

Introduction to basic data analytics tools

Introduction to basic data analytics tools

Introduction to basic data analytics tools

Multi Model Machine Learning by Maximo Gurmendez and Beth Logan

Multi Model Machine Learning by Maximo Gurmendez and Beth Logan

Multi Model Machine Learning by Maximo Gurmendez and Beth Logan

Amundsen: From discovering to security data

Amundsen: From discovering to security data

Amundsen: From discovering to security data

Building End-to-End Delta Pipelines on GCP

Building End-to-End Delta Pipelines on GCP

Building End-to-End Delta Pipelines on GCP

Data Pipelines With Streamsets

Data Pipelines With Streamsets

Data Pipelines With Streamsets

Meetup SF - Amundsen

Meetup SF - Amundsen

Meetup SF - Amundsen

Importance of ML Reproducibility & Applications with MLfLow

Importance of ML Reproducibility & Applications with MLfLow

Importance of ML Reproducibility & Applications with MLfLow

Patterns and Anti-Patterns for Memorializing Data Science Project Artifacts

Patterns and Anti-Patterns for Memorializing Data Science Project Artifacts

Patterns and Anti-Patterns for Memorializing Data Science Project Artifacts

NoSQL: what does it mean, how did we get here, and why should I care? - Hugo ...

NoSQL: what does it mean, how did we get here, and why should I care? - Hugo ...

NoSQL: what does it mean, how did we get here, and why should I care? - Hugo ...

Тарас Кльоба "ETL — вже не актуальна; тривалі живі потоки із системою Apache...

Тарас Кльоба "ETL — вже не актуальна; тривалі живі потоки із системою Apache...

Тарас Кльоба "ETL — вже не актуальна; тривалі живі потоки із системою Apache...

Fully Automated QA System For Large Scale Search And Recommendation Engines U...

Fully Automated QA System For Large Scale Search And Recommendation Engines U...

Fully Automated QA System For Large Scale Search And Recommendation Engines U...

Credit Fraud Prevention with Spark and Graph Analysis

Credit Fraud Prevention with Spark and Graph Analysis

Credit Fraud Prevention with Spark and Graph Analysis

Radical Speed for SQL Queries on Databricks: Photon Under the Hood

Radical Speed for SQL Queries on Databricks: Photon Under the Hood

Radical Speed for SQL Queries on Databricks: Photon Under the Hood

From zero to hero with the actor model - Tamir Dresher - Odessa 2019

From zero to hero with the actor model - Tamir Dresher - Odessa 2019

From zero to hero with the actor model - Tamir Dresher - Odessa 2019

Democratizing Data within your organization - Data Discovery

Democratizing Data within your organization - Data Discovery

Democratizing Data within your organization - Data Discovery

You Can Do It in SQL

You Can Do It in SQL

You Can Do It in SQL

Data Discovery & Trust through Metadata

Data Discovery & Trust through Metadata

Data Discovery & Trust through Metadata

Ähnlich wie From discovering to trusting data

Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...

Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...

Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...Databricks

Big data architectures and the data lake

Big data architectures and the data lake

Big data architectures and the data lakeJames Serra

Is the traditional data warehouse dead?

Is the traditional data warehouse dead?

Is the traditional data warehouse dead?James Serra

Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS

Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS

Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS Amazon Web Services LATAM

Data Lake Overview

Data Lake Overview

Data Lake OverviewJames Serra

Big data analytics: Technology's bleeding edge

Big data analytics: Technology's bleeding edge

Big data analytics: Technology's bleeding edgeBhavya Gulati

An architecture for federated data discovery and lineage over on-prem datasou...

An architecture for federated data discovery and lineage over on-prem datasou...

An architecture for federated data discovery and lineage over on-prem datasou...DataWorks Summit

Introduction of big data unit 1

Introduction of big data unit 1

Introduction of big data unit 1RojaT4

Dr. Christian Kurze from Denodo, "Data Virtualization: Fulfilling the Promise...

Dr. Christian Kurze from Denodo, "Data Virtualization: Fulfilling the Promise...

Dr. Christian Kurze from Denodo, "Data Virtualization: Fulfilling the Promise...Dataconomy Media

Big Data Session 1.pptx

Big Data Session 1.pptx

Big Data Session 1.pptxElsonPaul2

LinkedInSaxoBankDataWorkbench

LinkedInSaxoBankDataWorkbench

LinkedInSaxoBankDataWorkbenchSheetal Pratik

Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure

Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure

Otimizações de Projetos de Big Data, Dw e AI no Microsoft AzureLuan Moreno Medeiros Maciel

Data Wrangling and Visualization Using Python

Data Wrangling and Visualization Using Python

Data Wrangling and Visualization Using PythonMOHITKUMAR1379

Data Discovery at Databricks with Amundsen

Data Discovery at Databricks with Amundsen

Data Discovery at Databricks with AmundsenDatabricks

Mastering MapReduce: MapReduce for Big Data Management and Analysis

Mastering MapReduce: MapReduce for Big Data Management and Analysis

Mastering MapReduce: MapReduce for Big Data Management and AnalysisTeradata Aster

Boston Data Engineering: Alphabet Soup with Composable Analytics

Boston Data Engineering: Alphabet Soup with Composable Analytics

Boston Data Engineering: Alphabet Soup with Composable AnalyticsBoston Data Engineering

Spsbepoelmanssharepointbigdataclean 150421080105-conversion-gate02

Spsbepoelmanssharepointbigdataclean 150421080105-conversion-gate02

Spsbepoelmanssharepointbigdataclean 150421080105-conversion-gate02BIWUG

How to build your own Delve: combining machine learning, big data and SharePoint

How to build your own Delve: combining machine learning, big data and SharePoint

How to build your own Delve: combining machine learning, big data and SharePointJoris Poelmans

Cloudera Breakfast Series, Analytics Part 1: Use All Your Data

Cloudera Breakfast Series, Analytics Part 1: Use All Your Data

Cloudera Breakfast Series, Analytics Part 1: Use All Your DataCloudera, Inc.

Big data.pptIdontKnow66967

Ähnlich wie From discovering to trusting data (20)

Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...

Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...

Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...

Big data architectures and the data lake

Big data architectures and the data lake

Big data architectures and the data lake

Is the traditional data warehouse dead?

Is the traditional data warehouse dead?

Is the traditional data warehouse dead?

Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS

Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS

Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS

Data Lake Overview

Data Lake Overview

Data Lake Overview

Big data analytics: Technology's bleeding edge

Big data analytics: Technology's bleeding edge

Big data analytics: Technology's bleeding edge

An architecture for federated data discovery and lineage over on-prem datasou...

An architecture for federated data discovery and lineage over on-prem datasou...

An architecture for federated data discovery and lineage over on-prem datasou...

Introduction of big data unit 1

Introduction of big data unit 1

Introduction of big data unit 1

Dr. Christian Kurze from Denodo, "Data Virtualization: Fulfilling the Promise...

Dr. Christian Kurze from Denodo, "Data Virtualization: Fulfilling the Promise...

Dr. Christian Kurze from Denodo, "Data Virtualization: Fulfilling the Promise...

Big Data Session 1.pptx

Big Data Session 1.pptx

Big Data Session 1.pptx

LinkedInSaxoBankDataWorkbench

LinkedInSaxoBankDataWorkbench

LinkedInSaxoBankDataWorkbench

Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure

Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure

Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure

Data Wrangling and Visualization Using Python

Data Wrangling and Visualization Using Python

Data Wrangling and Visualization Using Python

Data Discovery at Databricks with Amundsen

Data Discovery at Databricks with Amundsen

Data Discovery at Databricks with Amundsen

Mastering MapReduce: MapReduce for Big Data Management and Analysis

Mastering MapReduce: MapReduce for Big Data Management and Analysis

Mastering MapReduce: MapReduce for Big Data Management and Analysis

Boston Data Engineering: Alphabet Soup with Composable Analytics

Boston Data Engineering: Alphabet Soup with Composable Analytics

Boston Data Engineering: Alphabet Soup with Composable Analytics

Spsbepoelmanssharepointbigdataclean 150421080105-conversion-gate02

Spsbepoelmanssharepointbigdataclean 150421080105-conversion-gate02

Spsbepoelmanssharepointbigdataclean 150421080105-conversion-gate02

How to build your own Delve: combining machine learning, big data and SharePoint

How to build your own Delve: combining machine learning, big data and SharePoint

How to build your own Delve: combining machine learning, big data and SharePoint

Cloudera Breakfast Series, Analytics Part 1: Use All Your Data

Cloudera Breakfast Series, Analytics Part 1: Use All Your Data

Cloudera Breakfast Series, Analytics Part 1: Use All Your Data

Big data.ppt

Mehr von markgrover

Amundsen lineage designs - community meeting, Dec 2020

Amundsen lineage designs - community meeting, Dec 2020

Amundsen lineage designs - community meeting, Dec 2020 markgrover

Amundsen gremlin proxy design

Amundsen gremlin proxy design

Amundsen gremlin proxy designmarkgrover

Amundsen: From discovering to security data

Amundsen: From discovering to security data

Amundsen: From discovering to security datamarkgrover

The Lyft data platform: Now and in the future

The Lyft data platform: Now and in the future

The Lyft data platform: Now and in the futuremarkgrover

TensorFlow Extension (TFX) and Apache Beam

TensorFlow Extension (TFX) and Apache Beam

TensorFlow Extension (TFX) and Apache Beammarkgrover

Dogfooding data at Lyft

Dogfooding data at Lyft

Dogfooding data at Lyftmarkgrover

Fighting cybersecurity threats with Apache Spot

Fighting cybersecurity threats with Apache Spot

Fighting cybersecurity threats with Apache Spotmarkgrover

Fraud Detection with Hadoop

Fraud Detection with Hadoop

Fraud Detection with Hadoopmarkgrover

Top 5 mistakes when writing Spark applications

Top 5 mistakes when writing Spark applications

Top 5 mistakes when writing Spark applicationsmarkgrover

Top 5 mistakes when writing Spark applications

Top 5 mistakes when writing Spark applications

Top 5 mistakes when writing Spark applicationsmarkgrover

Architecting Applications with Hadoop

Architecting Applications with Hadoop

Architecting Applications with Hadoopmarkgrover

SQL Engines for Hadoop - The case for Impala

SQL Engines for Hadoop - The case for Impala

SQL Engines for Hadoop - The case for Impalamarkgrover

Intro to hadoop tutorial

Intro to hadoop tutorial

Intro to hadoop tutorialmarkgrover

NYC HUG - Application Architectures with Apache Hadoop

NYC HUG - Application Architectures with Apache Hadoop

NYC HUG - Application Architectures with Apache Hadoopmarkgrover

Intro to Hadoop Presentation at Carnegie Mellon - Silicon Valley

Intro to Hadoop Presentation at Carnegie Mellon - Silicon Valley

Intro to Hadoop Presentation at Carnegie Mellon - Silicon Valleymarkgrover

Application architectures with Hadoop and Sessionization in MR

Application architectures with Hadoop and Sessionization in MR

Application architectures with Hadoop and Sessionization in MRmarkgrover

Introduction to Impala

Introduction to Impala

Introduction to Impalamarkgrover

Introduction to Hive and HCatalog

Introduction to Hive and HCatalog

Introduction to Hive and HCatalogmarkgrover

Applications on Hadoop

Applications on Hadoop

Applications on Hadoopmarkgrover

Hadoop and Hive in Enterprises

Hadoop and Hive in Enterprises

Hadoop and Hive in Enterprisesmarkgrover

Mehr von markgrover (20)

Amundsen lineage designs - community meeting, Dec 2020

Amundsen lineage designs - community meeting, Dec 2020

Amundsen lineage designs - community meeting, Dec 2020

Amundsen gremlin proxy design

Amundsen gremlin proxy design

Amundsen gremlin proxy design

Amundsen: From discovering to security data

Amundsen: From discovering to security data

Amundsen: From discovering to security data

The Lyft data platform: Now and in the future

The Lyft data platform: Now and in the future

The Lyft data platform: Now and in the future

TensorFlow Extension (TFX) and Apache Beam

TensorFlow Extension (TFX) and Apache Beam

TensorFlow Extension (TFX) and Apache Beam

Dogfooding data at Lyft

Dogfooding data at Lyft

Dogfooding data at Lyft

Fighting cybersecurity threats with Apache Spot

Fighting cybersecurity threats with Apache Spot

Fighting cybersecurity threats with Apache Spot

Fraud Detection with Hadoop

Fraud Detection with Hadoop

Fraud Detection with Hadoop

Top 5 mistakes when writing Spark applications

Top 5 mistakes when writing Spark applications

Top 5 mistakes when writing Spark applications

Top 5 mistakes when writing Spark applications

Top 5 mistakes when writing Spark applications

Top 5 mistakes when writing Spark applications

Architecting Applications with Hadoop

Architecting Applications with Hadoop

Architecting Applications with Hadoop

SQL Engines for Hadoop - The case for Impala

SQL Engines for Hadoop - The case for Impala

SQL Engines for Hadoop - The case for Impala

Intro to hadoop tutorial

Intro to hadoop tutorial

Intro to hadoop tutorial

NYC HUG - Application Architectures with Apache Hadoop

NYC HUG - Application Architectures with Apache Hadoop

NYC HUG - Application Architectures with Apache Hadoop

Intro to Hadoop Presentation at Carnegie Mellon - Silicon Valley

Intro to Hadoop Presentation at Carnegie Mellon - Silicon Valley

Intro to Hadoop Presentation at Carnegie Mellon - Silicon Valley

Application architectures with Hadoop and Sessionization in MR

Application architectures with Hadoop and Sessionization in MR

Application architectures with Hadoop and Sessionization in MR

Introduction to Impala

Introduction to Impala

Introduction to Impala

Introduction to Hive and HCatalog

Introduction to Hive and HCatalog

Introduction to Hive and HCatalog

Applications on Hadoop

Applications on Hadoop

Applications on Hadoop

Hadoop and Hive in Enterprises

Hadoop and Hive in Enterprises

Hadoop and Hive in Enterprises

Kürzlich hochgeladen

Exploring the Future Potential of AI-Enabled Smartphone Processors

Exploring the Future Potential of AI-Enabled Smartphone Processors

Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2

Handwritten Text Recognition for manuscripts and early printed texts

Handwritten Text Recognition for manuscripts and early printed texts

Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j

A Year of the Servo Reboot: Where Are We Now?

A Year of the Servo Reboot: Where Are We Now?

A Year of the Servo Reboot: Where Are We Now?Igalia

[2024]Digital Global Overview Report 2024 Meltwater.pdf

[2024]Digital Global Overview Report 2024 Meltwater.pdf

[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745

Advantages of Hiring UIUX Design Service Providers for Your Business

Advantages of Hiring UIUX Design Service Providers for Your Business

Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech

08448380779 Call Girls In Diplomatic Enclave Women Seeking Men

08448380779 Call Girls In Diplomatic Enclave Women Seeking Men

08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments

TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc

Boost Fertility New Invention Ups Success Rates.pdf

Boost Fertility New Invention Ups Success Rates.pdf

Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1

🐬 The future of MySQL is Postgres 🐘

🐬 The future of MySQL is Postgres 🐘

🐬 The future of MySQL is Postgres 🐘RTylerCroy

Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...

Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...

Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays

08448380779 Call Girls In Friends Colony Women Seeking Men

08448380779 Call Girls In Friends Colony Women Seeking Men

08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung

GenCyber Cyber Security Day Presentation

GenCyber Cyber Security Day Presentation

GenCyber Cyber Security Day PresentationMichael W. Hawkins

How to Troubleshoot Apps for the Modern Connected Worker

How to Troubleshoot Apps for the Modern Connected Worker

How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes

Boost PC performance: How more available memory can improve productivity

Boost PC performance: How more available memory can improve productivity

Boost PC performance: How more available memory can improve productivityPrincipled Technologies

From Event to Action: Accelerate Your Decision Making with Real-Time Automation

From Event to Action: Accelerate Your Decision Making with Real-Time Automation

From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer

Powerful Google developer tools for immediate impact! (2023-24 C)

Powerful Google developer tools for immediate impact! (2023-24 C)

Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun

Kürzlich hochgeladen (20)

Exploring the Future Potential of AI-Enabled Smartphone Processors

Exploring the Future Potential of AI-Enabled Smartphone Processors

Exploring the Future Potential of AI-Enabled Smartphone Processors

Handwritten Text Recognition for manuscripts and early printed texts

Handwritten Text Recognition for manuscripts and early printed texts

Handwritten Text Recognition for manuscripts and early printed texts

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...

A Year of the Servo Reboot: Where Are We Now?

A Year of the Servo Reboot: Where Are We Now?

A Year of the Servo Reboot: Where Are We Now?

[2024]Digital Global Overview Report 2024 Meltwater.pdf

[2024]Digital Global Overview Report 2024 Meltwater.pdf

[2024]Digital Global Overview Report 2024 Meltwater.pdf

Advantages of Hiring UIUX Design Service Providers for Your Business

Advantages of Hiring UIUX Design Service Providers for Your Business

Advantages of Hiring UIUX Design Service Providers for Your Business

08448380779 Call Girls In Diplomatic Enclave Women Seeking Men

08448380779 Call Girls In Diplomatic Enclave Women Seeking Men

08448380779 Call Girls In Diplomatic Enclave Women Seeking Men

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments

Boost Fertility New Invention Ups Success Rates.pdf

Boost Fertility New Invention Ups Success Rates.pdf

Boost Fertility New Invention Ups Success Rates.pdf

🐬 The future of MySQL is Postgres 🐘

🐬 The future of MySQL is Postgres 🐘

🐬 The future of MySQL is Postgres 🐘

Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...

Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...

Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...

08448380779 Call Girls In Friends Colony Women Seeking Men

08448380779 Call Girls In Friends Colony Women Seeking Men

08448380779 Call Girls In Friends Colony Women Seeking Men

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...

GenCyber Cyber Security Day Presentation

GenCyber Cyber Security Day Presentation

GenCyber Cyber Security Day Presentation

How to Troubleshoot Apps for the Modern Connected Worker

How to Troubleshoot Apps for the Modern Connected Worker

How to Troubleshoot Apps for the Modern Connected Worker

Boost PC performance: How more available memory can improve productivity

Boost PC performance: How more available memory can improve productivity

Boost PC performance: How more available memory can improve productivity

From Event to Action: Accelerate Your Decision Making with Real-Time Automation

From Event to Action: Accelerate Your Decision Making with Real-Time Automation

From Event to Action: Accelerate Your Decision Making with Real-Time Automation

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024

Powerful Google developer tools for immediate impact! (2023-24 C)

Powerful Google developer tools for immediate impact! (2023-24 C)

Powerful Google developer tools for immediate impact! (2023-24 C)

From discovering to trusting data

1. From discovering to trusting data

3.

4.

5. SELECT * FROM default.my_table WHERE ds=’2018-01-01’ LIMIT 100;

6. Discover past work Discover trusted data Explore & validate data Consume Looker, Tableau, ML modeling, etc Ingest and Store Ingest: Stitch, Store: Redshift, Snowflake, BQ Process: Airflow, DBT, Spark Under-invested. Some companies use Alation or in-house solutions, but many use Slack, company wikis, or spreadsheets. How did this become a problem?

7.

8. Goals for evaluation ● Automatically captures everything related to data endeavors (tables, dashboards, ETL DAGs, HR systems and their relationships). ● Exposes it in user friendly ways (search, lineage, and API) ● Easy to extend to new sources and new classes of sources It is the source of truth for where, what and how data is being stored and used.

9. Search based Lineage based Network based Programmatic Where is the table/dashboard for X? What does it contain? I am changing a data model, who are the owner and most common users? I want to follow a power user in my team. Access metadata programmatically Does this analysis already exist? This table’s delivery was delayed today, I want to notify everyone downstream. I want to bookmark tables of interest and get a feed of data delay, schema change, incidents. Put (pull / push) metadata programmatically Other requirements ● Leverage as much data automatically as possible ● Preferably, open source and healthy community ● Preferably, Cloud agnostic ● Easy to set up

10. metadata noun /ˈmedəˌdādə,ˈmedəˌdadə/ 1. What kind of information? 2. About what data?

11. ● ● ● ● ● ● Terminology borrowed from Ground paper

12. Data stores Dashboard / Reports Schema registry Events / Schemas StreamsPeople Employees Notebooks

13.

14. Criteria / Products Alation Where Hows Airbnb Data Portal Cloudera Navigator Apache Atlas Search based Lineage based Network based Hive/Presto support Redshift support Open source (pref.)

15. First person to discover the South Pole - Norwegian explorer, Roald Amundsen

16.

17.

18.

19.

20.

21.

22.

23.

24.

25.

26.

27. Postgres Hive Redshift ... Presto Github Source File Databuilder Crawler Graph DB Elastic Search Metadata Service Search Service Frontend ServiceML Feature Service Security Service Other Microservices Metadata Sources

28. Pull Model Push Model ● Periodically update the index by pulling from the system (e.g. database) via crawlers. ● Onus of integration lays on data graph ● No interface to prescribe, hard to maintain crawlers ● The system (e.g. DB) pushes to a message bus which downstream subscribes to. ● Onus of integration lies on database ● Message format serves as the interface ● Allows for near-real time indexing Crawler Database Data graph Scheduler Database Message queue Data graph Preferred if ● Near-real time indexing is important ● Clean interface doesn’t exist ● Other tools like Wherehows are moving towards Push Model Preferred if ● Waiting for indexing is ok ● Working with “strapped” teams ● There’s already an interface

29.

30. Low relevance High relevance

31. Low popularity High popularity

32. Relevance Popularity Tables: ● Descriptions ● Table names, column names ● Tags Dashboards: ● Description ● Chart names Tables: ● Querying activity ● Different weights for automated vs adhoc querying Dashboards: ● Number of views ● Number of edits

33.

34.

35.

36. “This is God’s work” - George X, ex-head of Analytics, Lyft “I was on call and I’m confident 50% of the questions could have been answered by a simple search in Amundsen” - Bomee P, DS, Lyft

37. Penetration rate: DS (aka analyst): 81% RS (aka DS): 71% PM: 22% SWE: 17% Cust Serv: 7% (12/390) Sp. Ops: 67% (10/15) Sp. Op Leads: 53% (9/17) Economist: 100% (7/7) Cust. Quality: 78% (7/9) Growth Mktg: 25% (6/24)

38.

39.

40.

41.

42.

43.

44.

45.

46. 1. Metadata Data Discovery & Trust Compliance (GDPR / CCPA / Financial) Security / Privacy Data Monitorin g / Ops Data Quality Cost Mgmt Data Maintena nce

47.

48.

49.

50. Icons under Creative Commons License from https://thenounproject.com/