Dunning - SIGMOD - Data Economy.pptx

T
Ted DunningSoftware Engineer um MapR Technologies
FROM ROOTS TO FRUITS: EXPLORING
LINEAGE FOR DATASET
RECOMMENDATIONS
Ted Dunning, Fellow, HPE
18 June, 2023
the meaning of words lies in their use
2
the meaning of words lies in their use
3
the meaning of data lies in its use
(apologies to Dr. Wittgenstein)
4
A meteorologist’s data
- rainfall
- windspeed
- temperature
5
A meteorologist’s data
- rainfall
- windspeed
- temperature
A business uses the
data to predict umbrella
sales
6
What does the data
actually mean?
7
What does the data
actually mean?
the meaning of data lies in its use
TRAINING PROCESS
8
README
URL
History
Datasets
+
Models
Metadata
We start with explicit metadata.
Examples: column and table
names, documentation, common
values, and others
TRAINING PROCESS
9
README
URL
History
Datasets
+
Models
Metadata
This is encoded as a large
artifact x characters
incidence table
At this point, direct metadata
search is possible
TRAINING PROCESS
10
README
URL
History
Datasets
+
Models
Metadata
We augment with
metadata from all
ancestors and
descendants in
the global data
lineage graph
TRAINING PROCESS
11
README
URL
History
Datasets
+
Models
Metadata
Finally, we reduce the characteristic
cooccurrences using indicator-based
recommendation methods.
A NOTE ON IMPLICATIONS
12
The characteristic indicator
matrix is what connects
“umbrella” with “rainfall” or
“mosquito” with
“temperature” + “windspeed”
QUERY PROCESS
13
The original query is often
textual, possibly a README
QUERY PROCESS
14
augmented by recent project
behavior (queries, references)
QUERY PROCESS
15
The query is expanded based
on indicators (when they say
“umbrellas” they also mean
“rainfall”)
as well as semantic token
embedding using BERT
Recommendations Explanation
positives.csv
notpositives.csv
SARIMA_model
dengue_monthly.csv
climate_monthly.csv
“dengue” ancestor
“dengue” ancestor
“dengue” ancestor
“dengue"
“wind speed”
QUERY PROCESS
16
The final results include an
explanation of why files or
programs are included.
Recommendations Explanation
positives.csv
notpositives.csv
SARIMA_model
dengue_monthly.csv
climate_monthly.csv
“dengue” ancestor
“dengue” ancestor
“dengue” ancestor
“dengue"
“wind speed”
QUERY PROCESS
17
EVALUATION
• Evaluation is difficult due to a lack of public datasets
• Most machine learning examples are truncated to final steps
• Very few non-machine learning pipelines exist outside of toy examples
• Private datasets generally cannot be shared
• Still important to use when possible due to scale
• Evaluation of recommendation engines is a subtle art
• Their purpose is to change behaviors
• Todays recommendations select tomorrow’s training data
• We aren’t to this point yet, this would be a symptom of success
18
EVALUATION
19
EVALUATION
20
THANK YOU
ted.dunning@hpe.com
@ted_dunning
@ted_dunning@mastodon.social
21
1 von 21

Recomendados

Python's Role in the Future of Data Analysis von
Python's Role in the Future of Data AnalysisPython's Role in the Future of Data Analysis
Python's Role in the Future of Data AnalysisPeter Wang
6.4K views66 Folien
Open government data portals: from publishing to use and impact von
Open government data portals: from publishing to use and impactOpen government data portals: from publishing to use and impact
Open government data portals: from publishing to use and impactElena Simperl
171 views40 Folien
The web of data: how are we doing so far? von
The web of data: how are we doing so far?The web of data: how are we doing so far?
The web of data: how are we doing so far?Elena Simperl
1.5K views41 Folien
Consuming open and linked data with open source tools von
Consuming open and linked data with open source toolsConsuming open and linked data with open source tools
Consuming open and linked data with open source toolsJoanne Cook
948 views29 Folien
TSE_Pres12.pptx von
TSE_Pres12.pptxTSE_Pres12.pptx
TSE_Pres12.pptxssuseracaaae2
6 views18 Folien
Being FAIR: FAIR data and model management SSBSS 2017 Summer School von
Being FAIR:  FAIR data and model management SSBSS 2017 Summer SchoolBeing FAIR:  FAIR data and model management SSBSS 2017 Summer School
Being FAIR: FAIR data and model management SSBSS 2017 Summer SchoolCarole Goble
978 views65 Folien

Más contenido relacionado

Similar a Dunning - SIGMOD - Data Economy.pptx

Stream Processing von
Stream Processing Stream Processing
Stream Processing FogGuru MSCA Project
60 views38 Folien
Wed roman tut_open_datapub von
Wed roman tut_open_datapubWed roman tut_open_datapub
Wed roman tut_open_datapubeswcsummerschool
433 views36 Folien
Cognitive data von
Cognitive dataCognitive data
Cognitive dataSören Auer
1.9K views48 Folien
From Science to Data: Following a principled path to Data Science von
From Science to Data: Following a principled path to Data ScienceFrom Science to Data: Following a principled path to Data Science
From Science to Data: Following a principled path to Data ScienceInstitute of Contemporary Sciences
129 views40 Folien
Data science | What is Data science von
Data science | What is Data scienceData science | What is Data science
Data science | What is Data scienceShilpaKrishna6
840 views32 Folien
Converged IT and Data Commons von
Converged IT and Data CommonsConverged IT and Data Commons
Converged IT and Data CommonsSimon Twigger
25 views26 Folien

Similar a Dunning - SIGMOD - Data Economy.pptx(20)

Data science | What is Data science von ShilpaKrishna6
Data science | What is Data scienceData science | What is Data science
Data science | What is Data science
ShilpaKrishna6840 views
Converged IT and Data Commons von Simon Twigger
Converged IT and Data CommonsConverged IT and Data Commons
Converged IT and Data Commons
Simon Twigger25 views
Research Knowledge Graphs at GESIS & NFDI4DataScience von Stefan Dietze
Research Knowledge Graphs at GESIS & NFDI4DataScienceResearch Knowledge Graphs at GESIS & NFDI4DataScience
Research Knowledge Graphs at GESIS & NFDI4DataScience
Stefan Dietze101 views
Camp 4-data workshop presentation von Paolo Missier
Camp 4-data workshop presentationCamp 4-data workshop presentation
Camp 4-data workshop presentation
Paolo Missier659 views
Big Data Benchmarking Tutorial von Tilmann Rabl
Big Data Benchmarking TutorialBig Data Benchmarking Tutorial
Big Data Benchmarking Tutorial
Tilmann Rabl5K views
Henning agt talk-caise-semnet von caise2013vlc
Henning agt   talk-caise-semnetHenning agt   talk-caise-semnet
Henning agt talk-caise-semnet
caise2013vlc536 views
Challenges in Analytics for BIG Data von Prasant Misra
Challenges in Analytics for BIG DataChallenges in Analytics for BIG Data
Challenges in Analytics for BIG Data
Prasant Misra549 views
Data Lake or Data Warehouse? Data Cleaning or Data Wrangling? How to Ensure t... von Anastasija Nikiforova
Data Lake or Data Warehouse? Data Cleaning or Data Wrangling? How to Ensure t...Data Lake or Data Warehouse? Data Cleaning or Data Wrangling? How to Ensure t...
Data Lake or Data Warehouse? Data Cleaning or Data Wrangling? How to Ensure t...
Donders neuroimage toolkit - open science and good practices von Radboud University
Donders neuroimage toolkit -  open science and good practicesDonders neuroimage toolkit -  open science and good practices
Donders neuroimage toolkit - open science and good practices
Data management for TA's von aaroncollie
Data management for TA'sData management for TA's
Data management for TA's
aaroncollie576 views
Capturing Context in Scientific Experiments: Towards Computer-Driven Science von dgarijo
Capturing Context in Scientific Experiments: Towards Computer-Driven ScienceCapturing Context in Scientific Experiments: Towards Computer-Driven Science
Capturing Context in Scientific Experiments: Towards Computer-Driven Science
dgarijo551 views
Current Trends and Challenges in Big Data Benchmarking von eXascale Infolab
Current Trends and Challenges in Big Data BenchmarkingCurrent Trends and Challenges in Big Data Benchmarking
Current Trends and Challenges in Big Data Benchmarking
eXascale Infolab3.2K views

Más de Ted Dunning

How to Get Going with Kubernetes von
How to Get Going with KubernetesHow to Get Going with Kubernetes
How to Get Going with KubernetesTed Dunning
593 views80 Folien
Progress for big data in Kubernetes von
Progress for big data in KubernetesProgress for big data in Kubernetes
Progress for big data in KubernetesTed Dunning
473 views82 Folien
Anomaly Detection: How to find what you didn’t know to look for von
Anomaly Detection: How to find what you didn’t know to look forAnomaly Detection: How to find what you didn’t know to look for
Anomaly Detection: How to find what you didn’t know to look forTed Dunning
766 views104 Folien
Streaming Architecture including Rendezvous for Machine Learning von
Streaming Architecture including Rendezvous for Machine LearningStreaming Architecture including Rendezvous for Machine Learning
Streaming Architecture including Rendezvous for Machine LearningTed Dunning
679 views83 Folien
Machine Learning Logistics von
Machine Learning LogisticsMachine Learning Logistics
Machine Learning LogisticsTed Dunning
613 views52 Folien
Tensor Abuse - how to reuse machine learning frameworks von
Tensor Abuse - how to reuse machine learning frameworksTensor Abuse - how to reuse machine learning frameworks
Tensor Abuse - how to reuse machine learning frameworksTed Dunning
883 views24 Folien

Más de Ted Dunning(20)

How to Get Going with Kubernetes von Ted Dunning
How to Get Going with KubernetesHow to Get Going with Kubernetes
How to Get Going with Kubernetes
Ted Dunning593 views
Progress for big data in Kubernetes von Ted Dunning
Progress for big data in KubernetesProgress for big data in Kubernetes
Progress for big data in Kubernetes
Ted Dunning473 views
Anomaly Detection: How to find what you didn’t know to look for von Ted Dunning
Anomaly Detection: How to find what you didn’t know to look forAnomaly Detection: How to find what you didn’t know to look for
Anomaly Detection: How to find what you didn’t know to look for
Ted Dunning766 views
Streaming Architecture including Rendezvous for Machine Learning von Ted Dunning
Streaming Architecture including Rendezvous for Machine LearningStreaming Architecture including Rendezvous for Machine Learning
Streaming Architecture including Rendezvous for Machine Learning
Ted Dunning679 views
Machine Learning Logistics von Ted Dunning
Machine Learning LogisticsMachine Learning Logistics
Machine Learning Logistics
Ted Dunning613 views
Tensor Abuse - how to reuse machine learning frameworks von Ted Dunning
Tensor Abuse - how to reuse machine learning frameworksTensor Abuse - how to reuse machine learning frameworks
Tensor Abuse - how to reuse machine learning frameworks
Ted Dunning883 views
Machine Learning logistics von Ted Dunning
Machine Learning logisticsMachine Learning logistics
Machine Learning logistics
Ted Dunning3.9K views
T digest-update von Ted Dunning
T digest-updateT digest-update
T digest-update
Ted Dunning1.4K views
Finding Changes in Real Data von Ted Dunning
Finding Changes in Real DataFinding Changes in Real Data
Finding Changes in Real Data
Ted Dunning803 views
Where is Data Going? - RMDC Keynote von Ted Dunning
Where is Data Going? - RMDC KeynoteWhere is Data Going? - RMDC Keynote
Where is Data Going? - RMDC Keynote
Ted Dunning545 views
Real time-hadoop von Ted Dunning
Real time-hadoopReal time-hadoop
Real time-hadoop
Ted Dunning1.7K views
Cheap learning-dunning-9-18-2015 von Ted Dunning
Cheap learning-dunning-9-18-2015Cheap learning-dunning-9-18-2015
Cheap learning-dunning-9-18-2015
Ted Dunning1.8K views
Sharing Sensitive Data Securely von Ted Dunning
Sharing Sensitive Data SecurelySharing Sensitive Data Securely
Sharing Sensitive Data Securely
Ted Dunning1.8K views
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time von Ted Dunning
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-timeReal-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
Ted Dunning2.8K views
How the Internet of Things is Turning the Internet Upside Down von Ted Dunning
How the Internet of Things is Turning the Internet Upside DownHow the Internet of Things is Turning the Internet Upside Down
How the Internet of Things is Turning the Internet Upside Down
Ted Dunning1.7K views
Apache Kylin - OLAP Cubes for SQL on Hadoop von Ted Dunning
Apache Kylin - OLAP Cubes for SQL on HadoopApache Kylin - OLAP Cubes for SQL on Hadoop
Apache Kylin - OLAP Cubes for SQL on Hadoop
Ted Dunning8.5K views
Dunning time-series-2015 von Ted Dunning
Dunning time-series-2015Dunning time-series-2015
Dunning time-series-2015
Ted Dunning1.1K views
Doing-the-impossible von Ted Dunning
Doing-the-impossibleDoing-the-impossible
Doing-the-impossible
Ted Dunning3.3K views
Anomaly Detection - New York Machine Learning von Ted Dunning
Anomaly Detection - New York Machine LearningAnomaly Detection - New York Machine Learning
Anomaly Detection - New York Machine Learning
Ted Dunning6.3K views
Cognitive computing with big data, high tech and low tech approaches von Ted Dunning
Cognitive computing with big data, high tech and low tech approachesCognitive computing with big data, high tech and low tech approaches
Cognitive computing with big data, high tech and low tech approaches
Ted Dunning2.6K views

Último

SUPER STORE SQL PROJECT.pptx von
SUPER STORE SQL PROJECT.pptxSUPER STORE SQL PROJECT.pptx
SUPER STORE SQL PROJECT.pptxkhan888620
12 views16 Folien
[DSC Europe 23] Spela Poklukar & Tea Brasanac - Retrieval Augmented Generation von
[DSC Europe 23] Spela Poklukar & Tea Brasanac - Retrieval Augmented Generation[DSC Europe 23] Spela Poklukar & Tea Brasanac - Retrieval Augmented Generation
[DSC Europe 23] Spela Poklukar & Tea Brasanac - Retrieval Augmented GenerationDataScienceConferenc1
11 views29 Folien
SAP-TCodes.pdf von
SAP-TCodes.pdfSAP-TCodes.pdf
SAP-TCodes.pdfmustafaghulam8181
9 views285 Folien
[DSC Europe 23] Milos Grubjesic Empowering Business with Pepsico s Advanced M... von
[DSC Europe 23] Milos Grubjesic Empowering Business with Pepsico s Advanced M...[DSC Europe 23] Milos Grubjesic Empowering Business with Pepsico s Advanced M...
[DSC Europe 23] Milos Grubjesic Empowering Business with Pepsico s Advanced M...DataScienceConferenc1
5 views11 Folien
TGP 2.docx von
TGP 2.docxTGP 2.docx
TGP 2.docxsandi636490
10 views8 Folien
PROGRAMME.pdf von
PROGRAMME.pdfPROGRAMME.pdf
PROGRAMME.pdfHiNedHaJar
19 views13 Folien

Último(20)

SUPER STORE SQL PROJECT.pptx von khan888620
SUPER STORE SQL PROJECT.pptxSUPER STORE SQL PROJECT.pptx
SUPER STORE SQL PROJECT.pptx
khan88862012 views
[DSC Europe 23] Spela Poklukar & Tea Brasanac - Retrieval Augmented Generation von DataScienceConferenc1
[DSC Europe 23] Spela Poklukar & Tea Brasanac - Retrieval Augmented Generation[DSC Europe 23] Spela Poklukar & Tea Brasanac - Retrieval Augmented Generation
[DSC Europe 23] Spela Poklukar & Tea Brasanac - Retrieval Augmented Generation
[DSC Europe 23] Milos Grubjesic Empowering Business with Pepsico s Advanced M... von DataScienceConferenc1
[DSC Europe 23] Milos Grubjesic Empowering Business with Pepsico s Advanced M...[DSC Europe 23] Milos Grubjesic Empowering Business with Pepsico s Advanced M...
[DSC Europe 23] Milos Grubjesic Empowering Business with Pepsico s Advanced M...
Survey on Factuality in LLM's.pptx von NeethaSherra1
Survey on Factuality in LLM's.pptxSurvey on Factuality in LLM's.pptx
Survey on Factuality in LLM's.pptx
NeethaSherra15 views
Cross-network in Google Analytics 4.pdf von GA4 Tutorials
Cross-network in Google Analytics 4.pdfCross-network in Google Analytics 4.pdf
Cross-network in Google Analytics 4.pdf
GA4 Tutorials6 views
[DSC Europe 23] Zsolt Feleki - Machine Translation should we trust it.pptx von DataScienceConferenc1
[DSC Europe 23] Zsolt Feleki - Machine Translation should we trust it.pptx[DSC Europe 23] Zsolt Feleki - Machine Translation should we trust it.pptx
[DSC Europe 23] Zsolt Feleki - Machine Translation should we trust it.pptx
Short Story Assignment by Kelly Nguyen von kellynguyen01
Short Story Assignment by Kelly NguyenShort Story Assignment by Kelly Nguyen
Short Story Assignment by Kelly Nguyen
kellynguyen0119 views
CRIJ4385_Death Penalty_F23.pptx von yvettemm100
CRIJ4385_Death Penalty_F23.pptxCRIJ4385_Death Penalty_F23.pptx
CRIJ4385_Death Penalty_F23.pptx
yvettemm1006 views
Organic Shopping in Google Analytics 4.pdf von GA4 Tutorials
Organic Shopping in Google Analytics 4.pdfOrganic Shopping in Google Analytics 4.pdf
Organic Shopping in Google Analytics 4.pdf
GA4 Tutorials12 views
Chapter 3b- Process Communication (1) (1)(1) (1).pptx von ayeshabaig2004
Chapter 3b- Process Communication (1) (1)(1) (1).pptxChapter 3b- Process Communication (1) (1)(1) (1).pptx
Chapter 3b- Process Communication (1) (1)(1) (1).pptx
ayeshabaig20045 views
Ukraine Infographic_22NOV2023_v2.pdf von AnastosiyaGurin
Ukraine Infographic_22NOV2023_v2.pdfUkraine Infographic_22NOV2023_v2.pdf
Ukraine Infographic_22NOV2023_v2.pdf
AnastosiyaGurin1.4K views
Advanced_Recommendation_Systems_Presentation.pptx von neeharikasingh29
Advanced_Recommendation_Systems_Presentation.pptxAdvanced_Recommendation_Systems_Presentation.pptx
Advanced_Recommendation_Systems_Presentation.pptx

Dunning - SIGMOD - Data Economy.pptx