SlideShare ist ein Scribd-Unternehmen logo
1 von 57
Downloaden Sie, um offline zu lesen
Oscar Castañeda, Xoom a PayPal Service
Learning to Rank Datasets
for Search
#SAISDS8
#SAISDS8
About
• Data Scientist at Xoom a PayPal service.
• Interests:
• Data Management,
• Dataset Search,
• Learning to Rank.
2
Spark cluster with Elasticsearch
http://bit.ly/2em6RUKhttp://bit.ly/2ebM9HO
And Indexed RDatasets
Spark cluster with Elasticsearch Inside
5
Learning to Rank Datasets
#SAISDS8
6
Learning to Rank Datasets
for Search!
#SAISDS8
7
Agenda
• Problem Statement and Motivation
• Elasticsearch Learning to Rank
• Data Pipeline: metadata extraction, judgement list extraction
• Demo: Beginnings of a Dataset Search Engine with Machine-
learned relevance ranking.
• Q&A
#SAISDS8
8
Problem Statement (1)
• Despite datasets being a key corporate
asset they are generally not given the
importance they deserve and as a result
they are hard to find.
#SAISDS8
9
Problem Statement (2)
• Specifically, teams within organizations
have a hard time finding datasets
relevant to their function.
#SAISDS8
10
Topics
• Indexing
#SAISDS8
11
Topics
• Indexing (Spark Summit East 2017).
#SAISDS8
12
Topics
• Indexing (Spark Summit East 2017).
• Ranking
#SAISDS8
13
Topics
• Indexing (Spark Summit East 2017).
• Ranking => today’s topic!
#SAISDS8
14
Questions
• How are datasets ranked?
• Can judgement lists (useful for ranking)
be generated at dataset production
time?
#SAISDS8
15
Overview
Rdatasets
Take ES
snapshot
Restore ES snapshot
http://bit.ly/2e5H1jL
#SAISDS8
16
Overview
Rdatasets
Data Pipelines:
•Extract Filename, Format, Field description.
•Extract Type information.
•Index CSV files.
Extract Filename
Extract Type information.
Index CSV files.
Extract Format
Extract Field description
#SAISDS8
17
Overview
Data Lake
#SAISDS8
18
Overview
Data Lake
#SAISDS8
19
Motivation (1)
• Organizing, indexing and ranking Datasets:
#SAISDS8
20
Motivation (1)
• Organizing, indexing and ranking Datasets:
• Produced by individual data pipelines
• On Data Lake(s)
#SAISDS8
21
Motivation (2)
• Produce a ranking function for datasets that are generated as
part of running data pipelines.
#SAISDS8
22
Motivation (2)
• Produce a ranking function for datasets that are generated as
part of running data pipelines.
• Extract “relevance judgements” and use them to bootstrap a
dataset rank model (a posteriori vs. post hoc (Halevy et al.,
2016)).
#SAISDS8
23
Motivation (2)
• Produce a ranking function for datasets that are generated as
part of running data pipelines.
• Extract “relevance judgements” and use them to bootstrap a
dataset rank model (a posteriori vs. post hoc (Halevy et al.,
2016)).
• In a feedback loop leveraging click-through data on dataset
profile pages.
#SAISDS8
24
Organizing, Indexing and Ranking Datasets
Index dataset
features
Input data
Data Pipeline
Create
dataset
profile
pages.
Using Tableau Javascript API
ES Cluster
Fetch dataset from Datalake and run Data Pipeline on demand.
Search
Fetch dataset
from Datalake.
Dataset profile Team Dashboard
Feedback
logging
Relevance
judgements
#SAISDS8
25
Organizing, Indexing and Ranking Datasets
Input data
Data Pipeline
#SAISDS8
Search
26
Organizing, Indexing and Ranking Datasets
Input data
Data Pipeline
ES Cluster
Index dataset
features
#SAISDS8
Index dataset
features
Search
27
Organizing, Indexing and Ranking Datasets
Input data
Data Pipeline
ES Cluster
Feedback
logging
Relevance
judgements
#SAISDS8
Index dataset
features
Search
28
Organizing, Indexing and Ranking Datasets
Input data
Data Pipeline
ES Cluster
Feedback
logging
Relevance
judgements
#SAISDS8
New
Index dataset
features
Search
29
Organizing, Indexing and Ranking Datasets
Input data
Data Pipeline
ES Cluster
Feedback
logging
Relevance
judgements
#SAISDS8
New
New
Search
30
Organizing, Indexing and Ranking Datasets
Input data
Data Pipeline
ES Cluster
Index dataset
features
Feedback
logging
Relevance
judgements
Create
dataset
profile
pages.
Dataset profile
#SAISDS8
Search
31
Organizing, Indexing and Ranking Datasets
Input data
Data Pipeline
ES Cluster
Index dataset
features
Feedback
logging
Relevance
judgements
Create
dataset
profile
pages.
Dataset profile Team Dashboard
#SAISDS8
Search
32
Organizing, Indexing and Ranking Datasets
Input data
Data Pipeline
ES Cluster
Index dataset
features
Feedback
logging
Relevance
judgements
Create
dataset
profile
pages.
Dataset profile Team Dashboard
Fetch dataset from Datalake and run Data Pipeline on demand.
Fetch dataset
from Datalake.
#SAISDS8
Search
33
Organizing, Indexing and Ranking Datasets
Input data
Data Pipeline
ES Cluster
Index dataset
features
Feedback
logging
Relevance
judgements
Create
dataset
profile
pages.
Dataset profile Team Dashboard
Fetch dataset from Datalake and run Data Pipeline on demand.
Fetch dataset
from Datalake.
#SAISDS8
34
How do you rank datasets?
#SAISDS8
Ranking datasets
35
• Extraction of “relevance judgements” can be built into data pipeline
for specific datasets immediately after they are generated.
#SAISDS8
Ranking datasets
36
• Extraction of “relevance judgements” can be built into data pipeline
for specific datasets immediately after they are generated.
• Used to produce a ranking function for datasets.
#SAISDS8
Ranking datasets
37
• Extraction of “relevance judgements” can be built into data pipeline
for specific datasets immediately after they are generated.
• Used to produce a ranking function for datasets.
• Leveraged for training
#SAISDS8
Ranking datasets
38
• Extraction of “relevance judgements” can be built into data pipeline
for specific datasets immediately after they are generated.
• Used to produce a ranking function for datasets.
• Leveraged for training
• And to bootstrap a dataset rank model
#SAISDS8
Ranking datasets
39
• Extraction of “relevance judgements” can be built into data pipeline
for specific datasets immediately after they are generated.
• Used to produce a ranking function for datasets.
• Leveraged for training
• And to bootstrap a dataset rank model
• (a posteriori vs. post hoc (Halevy et al., 2016).)
#SAISDS8
Ranking datasets
40
• Click-through data provides implicit feedback useful to adjust initial
relevance judgements.
#SAISDS8
Ranking datasets
41
• Click-through data provides implicit feedback useful to adjust initial
relevance judgements.
• Leveraging click-through data on dataset profile pages.
#SAISDS8
Ranking datasets
42
• Click-through data provides implicit feedback useful to adjust initial
relevance judgements.
• Leveraging click-through data on dataset profile pages.
#SAISDS8
Create
dataset
profile
pages.
Dataset profile
43
• Alon et al (2016) advocate finding data in a post-
hoc manner by collecting and aggregating
metadata after datasets are created or updated.
• We propose a so-called “a posteriori” approach
where metadata is generated as part of running
pipelines using Spark.
A posteriori vs. Post-hoc
#SAISDS8
44
• Alon et al (2016) advocate indexing
• We prefer indexing
A posteriori vs. Post-hoc
#SAISDS8
after the fact
immediately after the fact
45
Pros
• “Relevance judgements” can be extracted and
leveraged to bootstrap a ranked dataset index.
• In a feedback loop leveraging click-through data on
Dataset profile pages.
• More granular metrics available to evaluate
metadata regeneration.
#SAISDS8
immediately after the fact
46
Cons
• Offline model development is disconnected and only
indirectly part of feedback using click-through data.
• Looking at trees instead of the forest.
• Need to replay indexing pipeline when things change
(per data pipeline).
#SAISDS8
immediately after the fact
47#SAISDS8
Demo!
48#SAISDS8
Demo Scenario
• Movies represent datasets
• TMDB movie pages represent dataset profile
pages.
• Marketing team (also called WAR team) interested
in War movies (datasets).
49#SAISDS8
Movie 1
https://www.themoviedb.org/movie/7555-rambo
50#SAISDS8
Movie 2
https://www.themoviedb.org/movie/1370-rambo-iii
51#SAISDS8
Movie 3
https://www.themoviedb.org/movie/1369-rambo-first-blood-part-ii
52#SAISDS8
judgement file
Rambo
Rambo III
Rambo: First Blood Part II
53
What have we seen?
• How to rank datasets on Elasticsearch using LTR.
• Extract relevance judgements immediately after
datasets are generated in Spark.
• Demo: Dataset Search with Spark and
Elasticsearch LTR.
#SAISDS8
54
Next Steps (1)
• Describe Datasets in a structured schema.org way
using Data Catalog Vocabulary [2].
• Build a knowledge graph and use GraphX to extract insights.
(Useful e.g. for column concept determination (Deng et al. 2013)).
• Build topic models based on structured Datasets using
Glint to perform scalable topic model extraction in Spark
(Jagerman and Eickhoff, 2016) [1].
[1] https://spark-summit.org/eu-2016/events/glint-an-asynchronous-parameter-server-for-spark/
#SAISDS8
55
References
• Alon Y. Halevy, Flip Korn, Natalya Fridman Noy, Christopher Olston, Neoklis Polyzotis, Sudip Roy, and Steven Euijong Whang.
Goods: Organizing google’s datasets. In Fatmañzcan, Georgia Koutrika, and Sam Madden, editors, Proceedings of the 2016
International Conference on Management of Data, SIGMOD Conference 2016, San Francisco, CA, USA, June 26 - July 01,
2016, pages 795–806. ACM, 2016. ISBN 978-1-4503-3531-7. doi: http://doi.acm.org/10.1145/2882903.2903730.
• Katja Hofmann. Fast and Reliable Online Learning to Rank for Information Retrieval. PhD thesis, Informatics Institute,
University of Amsterdam, May 2013.
• Rolf Jagerman and Carsten Eickhoff. Web-scale topic models in spark: An asynchronous parameter server. CoRR, abs/
1605.07422, 2016. URL http://arxiv.org/abs/1605. 07422.
• Dong Deng, Yu Jiang, Guoliang Li, Jian Li, and Cong Yu. Scalable column concept de- termination for web tables using large
knowledge bases. PVLDB, 6(13):1606–1617, 2013. doi: http://www.vldb.org/pvldb/vol6/p1606-li.pdf.
• Anne Schuth, Harrie Oosterhuis, Shimon Whiteson, and Maarten de Rijke. Multileave gradient descent for fast online learning to
rank. In WSDM 2016: The 9th International Conference on Web Search and Data Mining, pages 457-466. ACM, February 2016.
• Sreeram Balakrishnan, Alon Y. Halevy, Boulos Harb, Hongrae Lee, Jayant Madhavan, Afshin Rostamizadeh, Warren Shen,
Kenneth Wilder, Fei Wu 0003, and Cong Yu. Ap- plying webtables in practice. In CIDR 2015, Seventh Biennial Conference on
Innovative Data Systems Research, Asilomar, CA, USA, January 4-7, 2015, Online Proceedings. www.cidrdb.org, 2015.
#SAISDS8
56
Q&A
#SAISDS8
Thank You.
Email: ocastaneda@paypal.com
Twitter: @oscar_castaneda
#SAISDS8

Weitere ähnliche Inhalte

Was ist angesagt?

Learning to Rank: From Theory to Production - Malvina Josephidou & Diego Cecc...
Learning to Rank: From Theory to Production - Malvina Josephidou & Diego Cecc...Learning to Rank: From Theory to Production - Malvina Josephidou & Diego Cecc...
Learning to Rank: From Theory to Production - Malvina Josephidou & Diego Cecc...Lucidworks
 
Learning to Rank Presentation (v2) at LexisNexis Search Guild
Learning to Rank Presentation (v2) at LexisNexis Search GuildLearning to Rank Presentation (v2) at LexisNexis Search Guild
Learning to Rank Presentation (v2) at LexisNexis Search GuildSujit Pal
 
Evaluating Your Learning to Rank Model: Dos and Don’ts in Offline/Online Eval...
Evaluating Your Learning to Rank Model: Dos and Don’ts in Offline/Online Eval...Evaluating Your Learning to Rank Model: Dos and Don’ts in Offline/Online Eval...
Evaluating Your Learning to Rank Model: Dos and Don’ts in Offline/Online Eval...Sease
 
(Paper seminar)real-time personalization using embedding for search ranking a...
(Paper seminar)real-time personalization using embedding for search ranking a...(Paper seminar)real-time personalization using embedding for search ranking a...
(Paper seminar)real-time personalization using embedding for search ranking a...hyunyoung Lee
 
백억개의 로그를 모아 검색하고 분석하고 학습도 시켜보자 : 로기스
백억개의 로그를 모아 검색하고 분석하고 학습도 시켜보자 : 로기스백억개의 로그를 모아 검색하고 분석하고 학습도 시켜보자 : 로기스
백억개의 로그를 모아 검색하고 분석하고 학습도 시켜보자 : 로기스NAVER D2
 
Deep Learning for Personalized Search and Recommender Systems
Deep Learning for Personalized Search and Recommender SystemsDeep Learning for Personalized Search and Recommender Systems
Deep Learning for Personalized Search and Recommender SystemsBenjamin Le
 
Practical End-to-End Learning to Rank Using Fusion - Andy Liu, Lucidworks
Practical End-to-End Learning to Rank Using Fusion - Andy Liu, Lucidworks Practical End-to-End Learning to Rank Using Fusion - Andy Liu, Lucidworks
Practical End-to-End Learning to Rank Using Fusion - Andy Liu, Lucidworks Lucidworks
 
Continuous Delivery of Deep Transformer-Based NLP Models Using MLflow and AWS...
Continuous Delivery of Deep Transformer-Based NLP Models Using MLflow and AWS...Continuous Delivery of Deep Transformer-Based NLP Models Using MLflow and AWS...
Continuous Delivery of Deep Transformer-Based NLP Models Using MLflow and AWS...Databricks
 
Music Personalization At Spotify
Music Personalization At SpotifyMusic Personalization At Spotify
Music Personalization At SpotifyVidhya Murali
 
BigQuery의 모든 것(기획자, 마케터, 신입 데이터 분석가를 위한) 입문편
BigQuery의 모든 것(기획자, 마케터, 신입 데이터 분석가를 위한) 입문편BigQuery의 모든 것(기획자, 마케터, 신입 데이터 분석가를 위한) 입문편
BigQuery의 모든 것(기획자, 마케터, 신입 데이터 분석가를 위한) 입문편Seongyun Byeon
 
Attention is All You Need (Transformer)
Attention is All You Need (Transformer)Attention is All You Need (Transformer)
Attention is All You Need (Transformer)Jeong-Gwan Lee
 
Deep learning for NLP and Transformer
 Deep learning for NLP  and Transformer Deep learning for NLP  and Transformer
Deep learning for NLP and TransformerArvind Devaraj
 
Engagement, Metrics & Personalisation at Scale
Engagement, Metrics &  Personalisation at ScaleEngagement, Metrics &  Personalisation at Scale
Engagement, Metrics & Personalisation at ScaleMounia Lalmas-Roelleke
 
Learning to rank
Learning to rankLearning to rank
Learning to rankBruce Kuo
 
Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)Bhaskar Mitra
 
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...MLconf
 
Recap: Designing a more Efficient Estimator for Off-policy Evaluation in Band...
Recap: Designing a more Efficient Estimator for Off-policy Evaluation in Band...Recap: Designing a more Efficient Estimator for Off-policy Evaluation in Band...
Recap: Designing a more Efficient Estimator for Off-policy Evaluation in Band...Justin Basilico
 
How to Build a Recommendation Engine with Neo4j
How to Build a Recommendation Engine with Neo4jHow to Build a Recommendation Engine with Neo4j
How to Build a Recommendation Engine with Neo4jNeo4j
 

Was ist angesagt? (20)

Learning to Rank: From Theory to Production - Malvina Josephidou & Diego Cecc...
Learning to Rank: From Theory to Production - Malvina Josephidou & Diego Cecc...Learning to Rank: From Theory to Production - Malvina Josephidou & Diego Cecc...
Learning to Rank: From Theory to Production - Malvina Josephidou & Diego Cecc...
 
Learning to Rank Presentation (v2) at LexisNexis Search Guild
Learning to Rank Presentation (v2) at LexisNexis Search GuildLearning to Rank Presentation (v2) at LexisNexis Search Guild
Learning to Rank Presentation (v2) at LexisNexis Search Guild
 
Word embedding
Word embedding Word embedding
Word embedding
 
Evaluating Your Learning to Rank Model: Dos and Don’ts in Offline/Online Eval...
Evaluating Your Learning to Rank Model: Dos and Don’ts in Offline/Online Eval...Evaluating Your Learning to Rank Model: Dos and Don’ts in Offline/Online Eval...
Evaluating Your Learning to Rank Model: Dos and Don’ts in Offline/Online Eval...
 
(Paper seminar)real-time personalization using embedding for search ranking a...
(Paper seminar)real-time personalization using embedding for search ranking a...(Paper seminar)real-time personalization using embedding for search ranking a...
(Paper seminar)real-time personalization using embedding for search ranking a...
 
백억개의 로그를 모아 검색하고 분석하고 학습도 시켜보자 : 로기스
백억개의 로그를 모아 검색하고 분석하고 학습도 시켜보자 : 로기스백억개의 로그를 모아 검색하고 분석하고 학습도 시켜보자 : 로기스
백억개의 로그를 모아 검색하고 분석하고 학습도 시켜보자 : 로기스
 
Deep Learning for Personalized Search and Recommender Systems
Deep Learning for Personalized Search and Recommender SystemsDeep Learning for Personalized Search and Recommender Systems
Deep Learning for Personalized Search and Recommender Systems
 
Practical End-to-End Learning to Rank Using Fusion - Andy Liu, Lucidworks
Practical End-to-End Learning to Rank Using Fusion - Andy Liu, Lucidworks Practical End-to-End Learning to Rank Using Fusion - Andy Liu, Lucidworks
Practical End-to-End Learning to Rank Using Fusion - Andy Liu, Lucidworks
 
Continuous Delivery of Deep Transformer-Based NLP Models Using MLflow and AWS...
Continuous Delivery of Deep Transformer-Based NLP Models Using MLflow and AWS...Continuous Delivery of Deep Transformer-Based NLP Models Using MLflow and AWS...
Continuous Delivery of Deep Transformer-Based NLP Models Using MLflow and AWS...
 
Music Personalization At Spotify
Music Personalization At SpotifyMusic Personalization At Spotify
Music Personalization At Spotify
 
BigQuery의 모든 것(기획자, 마케터, 신입 데이터 분석가를 위한) 입문편
BigQuery의 모든 것(기획자, 마케터, 신입 데이터 분석가를 위한) 입문편BigQuery의 모든 것(기획자, 마케터, 신입 데이터 분석가를 위한) 입문편
BigQuery의 모든 것(기획자, 마케터, 신입 데이터 분석가를 위한) 입문편
 
Attention is All You Need (Transformer)
Attention is All You Need (Transformer)Attention is All You Need (Transformer)
Attention is All You Need (Transformer)
 
Deep learning for NLP and Transformer
 Deep learning for NLP  and Transformer Deep learning for NLP  and Transformer
Deep learning for NLP and Transformer
 
Engagement, Metrics & Personalisation at Scale
Engagement, Metrics &  Personalisation at ScaleEngagement, Metrics &  Personalisation at Scale
Engagement, Metrics & Personalisation at Scale
 
Learning to rank
Learning to rankLearning to rank
Learning to rank
 
Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)
 
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
 
Recap: Designing a more Efficient Estimator for Off-policy Evaluation in Band...
Recap: Designing a more Efficient Estimator for Off-policy Evaluation in Band...Recap: Designing a more Efficient Estimator for Off-policy Evaluation in Band...
Recap: Designing a more Efficient Estimator for Off-policy Evaluation in Band...
 
Faster rcnn
Faster rcnnFaster rcnn
Faster rcnn
 
How to Build a Recommendation Engine with Neo4j
How to Build a Recommendation Engine with Neo4jHow to Build a Recommendation Engine with Neo4j
How to Build a Recommendation Engine with Neo4j
 

Ähnlich wie Learning to Rank Datasets for Search with Oscar Castaneda

Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...Spark Summit
 
DBtrends Semantics 2016
DBtrends Semantics 2016DBtrends Semantics 2016
DBtrends Semantics 2016Edgard Marx
 
Edgard Marx, Amrapali Zaveri, Diego Moussallem and Sandro Rautenberg | DBtren...
Edgard Marx, Amrapali Zaveri, Diego Moussallem and Sandro Rautenberg | DBtren...Edgard Marx, Amrapali Zaveri, Diego Moussallem and Sandro Rautenberg | DBtren...
Edgard Marx, Amrapali Zaveri, Diego Moussallem and Sandro Rautenberg | DBtren...semanticsconference
 
Strata sf - Amundsen presentation
Strata sf - Amundsen presentationStrata sf - Amundsen presentation
Strata sf - Amundsen presentationTao Feng
 
Structured data and metadata evaluation methodology for organizations looking...
Structured data and metadata evaluation methodology for organizations looking...Structured data and metadata evaluation methodology for organizations looking...
Structured data and metadata evaluation methodology for organizations looking...Emily Kolvitz
 
Data council sf amundsen presentation
Data council sf    amundsen presentationData council sf    amundsen presentation
Data council sf amundsen presentationTao Feng
 
Changes in Structured Data at Google (SEO Camp 'us in Paris)
Changes in Structured Data at Google (SEO Camp 'us in Paris)Changes in Structured Data at Google (SEO Camp 'us in Paris)
Changes in Structured Data at Google (SEO Camp 'us in Paris)Bill Slawski
 
Database novelty detection
Database novelty detectionDatabase novelty detection
Database novelty detectionMostafaAliAbbas
 
Disrupting Data Discovery
Disrupting Data DiscoveryDisrupting Data Discovery
Disrupting Data Discoverymarkgrover
 
Robert Isele | eccenca CorporateMemory - Semantically integrated Enterprise D...
Robert Isele | eccenca CorporateMemory - Semantically integrated Enterprise D...Robert Isele | eccenca CorporateMemory - Semantically integrated Enterprise D...
Robert Isele | eccenca CorporateMemory - Semantically integrated Enterprise D...semanticsconference
 
Uncover Your Data Journey: End-To-End Data Lineage For SAP BOBJ And SAP Data ...
Uncover Your Data Journey: End-To-End Data Lineage For SAP BOBJ And SAP Data ...Uncover Your Data Journey: End-To-End Data Lineage For SAP BOBJ And SAP Data ...
Uncover Your Data Journey: End-To-End Data Lineage For SAP BOBJ And SAP Data ...Wiiisdom
 
The power of polyglot searching
The power of polyglot searchingThe power of polyglot searching
The power of polyglot searchingGraphAware
 
Customer Feedback Analytics for Starbucks
Customer Feedback Analytics for Starbucks Customer Feedback Analytics for Starbucks
Customer Feedback Analytics for Starbucks Nishant Gandhi
 
Accelerating Data Lakes and Streams with Real-time Analytics
Accelerating Data Lakes and Streams with Real-time AnalyticsAccelerating Data Lakes and Streams with Real-time Analytics
Accelerating Data Lakes and Streams with Real-time AnalyticsArcadia Data
 
Data Discovery and Metadata
Data Discovery and MetadataData Discovery and Metadata
Data Discovery and Metadatamarkgrover
 
Big Data Management: What's New, What's Different, and What You Need To Know
Big Data Management: What's New, What's Different, and What You Need To KnowBig Data Management: What's New, What's Different, and What You Need To Know
Big Data Management: What's New, What's Different, and What You Need To KnowSnapLogic
 

Ähnlich wie Learning to Rank Datasets for Search with Oscar Castaneda (20)

Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
 
DBtrends Semantics 2016
DBtrends Semantics 2016DBtrends Semantics 2016
DBtrends Semantics 2016
 
Edgard Marx, Amrapali Zaveri, Diego Moussallem and Sandro Rautenberg | DBtren...
Edgard Marx, Amrapali Zaveri, Diego Moussallem and Sandro Rautenberg | DBtren...Edgard Marx, Amrapali Zaveri, Diego Moussallem and Sandro Rautenberg | DBtren...
Edgard Marx, Amrapali Zaveri, Diego Moussallem and Sandro Rautenberg | DBtren...
 
Strata sf - Amundsen presentation
Strata sf - Amundsen presentationStrata sf - Amundsen presentation
Strata sf - Amundsen presentation
 
Structured data and metadata evaluation methodology for organizations looking...
Structured data and metadata evaluation methodology for organizations looking...Structured data and metadata evaluation methodology for organizations looking...
Structured data and metadata evaluation methodology for organizations looking...
 
Data council sf amundsen presentation
Data council sf    amundsen presentationData council sf    amundsen presentation
Data council sf amundsen presentation
 
Changes in Structured Data at Google (SEO Camp 'us in Paris)
Changes in Structured Data at Google (SEO Camp 'us in Paris)Changes in Structured Data at Google (SEO Camp 'us in Paris)
Changes in Structured Data at Google (SEO Camp 'us in Paris)
 
EpiServer find Macaw
EpiServer find MacawEpiServer find Macaw
EpiServer find Macaw
 
Meetup SF - Amundsen
Meetup SF  -  AmundsenMeetup SF  -  Amundsen
Meetup SF - Amundsen
 
Database novelty detection
Database novelty detectionDatabase novelty detection
Database novelty detection
 
Disrupting Data Discovery
Disrupting Data DiscoveryDisrupting Data Discovery
Disrupting Data Discovery
 
Robert Isele | eccenca CorporateMemory - Semantically integrated Enterprise D...
Robert Isele | eccenca CorporateMemory - Semantically integrated Enterprise D...Robert Isele | eccenca CorporateMemory - Semantically integrated Enterprise D...
Robert Isele | eccenca CorporateMemory - Semantically integrated Enterprise D...
 
Uncover Your Data Journey: End-To-End Data Lineage For SAP BOBJ And SAP Data ...
Uncover Your Data Journey: End-To-End Data Lineage For SAP BOBJ And SAP Data ...Uncover Your Data Journey: End-To-End Data Lineage For SAP BOBJ And SAP Data ...
Uncover Your Data Journey: End-To-End Data Lineage For SAP BOBJ And SAP Data ...
 
The power of polyglot searching
The power of polyglot searchingThe power of polyglot searching
The power of polyglot searching
 
Power of Polyglot Search
Power of Polyglot SearchPower of Polyglot Search
Power of Polyglot Search
 
Customer Feedback Analytics for Starbucks
Customer Feedback Analytics for Starbucks Customer Feedback Analytics for Starbucks
Customer Feedback Analytics for Starbucks
 
Accelerating Data Lakes and Streams with Real-time Analytics
Accelerating Data Lakes and Streams with Real-time AnalyticsAccelerating Data Lakes and Streams with Real-time Analytics
Accelerating Data Lakes and Streams with Real-time Analytics
 
Data Discovery and Metadata
Data Discovery and MetadataData Discovery and Metadata
Data Discovery and Metadata
 
Big Data Management: What's New, What's Different, and What You Need To Know
Big Data Management: What's New, What's Different, and What You Need To KnowBig Data Management: What's New, What's Different, and What You Need To Know
Big Data Management: What's New, What's Different, and What You Need To Know
 
The BI Sandbox
The BI SandboxThe BI Sandbox
The BI Sandbox
 

Mehr von Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringDatabricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsDatabricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkDatabricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 

Mehr von Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Kürzlich hochgeladen

代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改atducpo
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSAishani27
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxolyaivanovalion
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023ymrp368
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxStephen266013
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingNeil Barnes
 

Kürzlich hochgeladen (20)

代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICS
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docx
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
 

Learning to Rank Datasets for Search with Oscar Castaneda

  • 1. Oscar Castañeda, Xoom a PayPal Service Learning to Rank Datasets for Search #SAISDS8
  • 2. #SAISDS8 About • Data Scientist at Xoom a PayPal service. • Interests: • Data Management, • Dataset Search, • Learning to Rank. 2
  • 3. Spark cluster with Elasticsearch http://bit.ly/2em6RUKhttp://bit.ly/2ebM9HO
  • 4. And Indexed RDatasets Spark cluster with Elasticsearch Inside
  • 5. 5 Learning to Rank Datasets #SAISDS8
  • 6. 6 Learning to Rank Datasets for Search! #SAISDS8
  • 7. 7 Agenda • Problem Statement and Motivation • Elasticsearch Learning to Rank • Data Pipeline: metadata extraction, judgement list extraction • Demo: Beginnings of a Dataset Search Engine with Machine- learned relevance ranking. • Q&A #SAISDS8
  • 8. 8 Problem Statement (1) • Despite datasets being a key corporate asset they are generally not given the importance they deserve and as a result they are hard to find. #SAISDS8
  • 9. 9 Problem Statement (2) • Specifically, teams within organizations have a hard time finding datasets relevant to their function. #SAISDS8
  • 11. 11 Topics • Indexing (Spark Summit East 2017). #SAISDS8
  • 12. 12 Topics • Indexing (Spark Summit East 2017). • Ranking #SAISDS8
  • 13. 13 Topics • Indexing (Spark Summit East 2017). • Ranking => today’s topic! #SAISDS8
  • 14. 14 Questions • How are datasets ranked? • Can judgement lists (useful for ranking) be generated at dataset production time? #SAISDS8
  • 15. 15 Overview Rdatasets Take ES snapshot Restore ES snapshot http://bit.ly/2e5H1jL #SAISDS8
  • 16. 16 Overview Rdatasets Data Pipelines: •Extract Filename, Format, Field description. •Extract Type information. •Index CSV files. Extract Filename Extract Type information. Index CSV files. Extract Format Extract Field description #SAISDS8
  • 19. 19 Motivation (1) • Organizing, indexing and ranking Datasets: #SAISDS8
  • 20. 20 Motivation (1) • Organizing, indexing and ranking Datasets: • Produced by individual data pipelines • On Data Lake(s) #SAISDS8
  • 21. 21 Motivation (2) • Produce a ranking function for datasets that are generated as part of running data pipelines. #SAISDS8
  • 22. 22 Motivation (2) • Produce a ranking function for datasets that are generated as part of running data pipelines. • Extract “relevance judgements” and use them to bootstrap a dataset rank model (a posteriori vs. post hoc (Halevy et al., 2016)). #SAISDS8
  • 23. 23 Motivation (2) • Produce a ranking function for datasets that are generated as part of running data pipelines. • Extract “relevance judgements” and use them to bootstrap a dataset rank model (a posteriori vs. post hoc (Halevy et al., 2016)). • In a feedback loop leveraging click-through data on dataset profile pages. #SAISDS8
  • 24. 24 Organizing, Indexing and Ranking Datasets Index dataset features Input data Data Pipeline Create dataset profile pages. Using Tableau Javascript API ES Cluster Fetch dataset from Datalake and run Data Pipeline on demand. Search Fetch dataset from Datalake. Dataset profile Team Dashboard Feedback logging Relevance judgements #SAISDS8
  • 25. 25 Organizing, Indexing and Ranking Datasets Input data Data Pipeline #SAISDS8
  • 26. Search 26 Organizing, Indexing and Ranking Datasets Input data Data Pipeline ES Cluster Index dataset features #SAISDS8
  • 27. Index dataset features Search 27 Organizing, Indexing and Ranking Datasets Input data Data Pipeline ES Cluster Feedback logging Relevance judgements #SAISDS8
  • 28. Index dataset features Search 28 Organizing, Indexing and Ranking Datasets Input data Data Pipeline ES Cluster Feedback logging Relevance judgements #SAISDS8 New
  • 29. Index dataset features Search 29 Organizing, Indexing and Ranking Datasets Input data Data Pipeline ES Cluster Feedback logging Relevance judgements #SAISDS8 New New
  • 30. Search 30 Organizing, Indexing and Ranking Datasets Input data Data Pipeline ES Cluster Index dataset features Feedback logging Relevance judgements Create dataset profile pages. Dataset profile #SAISDS8
  • 31. Search 31 Organizing, Indexing and Ranking Datasets Input data Data Pipeline ES Cluster Index dataset features Feedback logging Relevance judgements Create dataset profile pages. Dataset profile Team Dashboard #SAISDS8
  • 32. Search 32 Organizing, Indexing and Ranking Datasets Input data Data Pipeline ES Cluster Index dataset features Feedback logging Relevance judgements Create dataset profile pages. Dataset profile Team Dashboard Fetch dataset from Datalake and run Data Pipeline on demand. Fetch dataset from Datalake. #SAISDS8
  • 33. Search 33 Organizing, Indexing and Ranking Datasets Input data Data Pipeline ES Cluster Index dataset features Feedback logging Relevance judgements Create dataset profile pages. Dataset profile Team Dashboard Fetch dataset from Datalake and run Data Pipeline on demand. Fetch dataset from Datalake. #SAISDS8
  • 34. 34 How do you rank datasets? #SAISDS8
  • 35. Ranking datasets 35 • Extraction of “relevance judgements” can be built into data pipeline for specific datasets immediately after they are generated. #SAISDS8
  • 36. Ranking datasets 36 • Extraction of “relevance judgements” can be built into data pipeline for specific datasets immediately after they are generated. • Used to produce a ranking function for datasets. #SAISDS8
  • 37. Ranking datasets 37 • Extraction of “relevance judgements” can be built into data pipeline for specific datasets immediately after they are generated. • Used to produce a ranking function for datasets. • Leveraged for training #SAISDS8
  • 38. Ranking datasets 38 • Extraction of “relevance judgements” can be built into data pipeline for specific datasets immediately after they are generated. • Used to produce a ranking function for datasets. • Leveraged for training • And to bootstrap a dataset rank model #SAISDS8
  • 39. Ranking datasets 39 • Extraction of “relevance judgements” can be built into data pipeline for specific datasets immediately after they are generated. • Used to produce a ranking function for datasets. • Leveraged for training • And to bootstrap a dataset rank model • (a posteriori vs. post hoc (Halevy et al., 2016).) #SAISDS8
  • 40. Ranking datasets 40 • Click-through data provides implicit feedback useful to adjust initial relevance judgements. #SAISDS8
  • 41. Ranking datasets 41 • Click-through data provides implicit feedback useful to adjust initial relevance judgements. • Leveraging click-through data on dataset profile pages. #SAISDS8
  • 42. Ranking datasets 42 • Click-through data provides implicit feedback useful to adjust initial relevance judgements. • Leveraging click-through data on dataset profile pages. #SAISDS8 Create dataset profile pages. Dataset profile
  • 43. 43 • Alon et al (2016) advocate finding data in a post- hoc manner by collecting and aggregating metadata after datasets are created or updated. • We propose a so-called “a posteriori” approach where metadata is generated as part of running pipelines using Spark. A posteriori vs. Post-hoc #SAISDS8
  • 44. 44 • Alon et al (2016) advocate indexing • We prefer indexing A posteriori vs. Post-hoc #SAISDS8 after the fact immediately after the fact
  • 45. 45 Pros • “Relevance judgements” can be extracted and leveraged to bootstrap a ranked dataset index. • In a feedback loop leveraging click-through data on Dataset profile pages. • More granular metrics available to evaluate metadata regeneration. #SAISDS8 immediately after the fact
  • 46. 46 Cons • Offline model development is disconnected and only indirectly part of feedback using click-through data. • Looking at trees instead of the forest. • Need to replay indexing pipeline when things change (per data pipeline). #SAISDS8 immediately after the fact
  • 48. 48#SAISDS8 Demo Scenario • Movies represent datasets • TMDB movie pages represent dataset profile pages. • Marketing team (also called WAR team) interested in War movies (datasets).
  • 53. 53 What have we seen? • How to rank datasets on Elasticsearch using LTR. • Extract relevance judgements immediately after datasets are generated in Spark. • Demo: Dataset Search with Spark and Elasticsearch LTR. #SAISDS8
  • 54. 54 Next Steps (1) • Describe Datasets in a structured schema.org way using Data Catalog Vocabulary [2]. • Build a knowledge graph and use GraphX to extract insights. (Useful e.g. for column concept determination (Deng et al. 2013)). • Build topic models based on structured Datasets using Glint to perform scalable topic model extraction in Spark (Jagerman and Eickhoff, 2016) [1]. [1] https://spark-summit.org/eu-2016/events/glint-an-asynchronous-parameter-server-for-spark/ #SAISDS8
  • 55. 55 References • Alon Y. Halevy, Flip Korn, Natalya Fridman Noy, Christopher Olston, Neoklis Polyzotis, Sudip Roy, and Steven Euijong Whang. Goods: Organizing google’s datasets. In Fatmañzcan, Georgia Koutrika, and Sam Madden, editors, Proceedings of the 2016 International Conference on Management of Data, SIGMOD Conference 2016, San Francisco, CA, USA, June 26 - July 01, 2016, pages 795–806. ACM, 2016. ISBN 978-1-4503-3531-7. doi: http://doi.acm.org/10.1145/2882903.2903730. • Katja Hofmann. Fast and Reliable Online Learning to Rank for Information Retrieval. PhD thesis, Informatics Institute, University of Amsterdam, May 2013. • Rolf Jagerman and Carsten Eickhoff. Web-scale topic models in spark: An asynchronous parameter server. CoRR, abs/ 1605.07422, 2016. URL http://arxiv.org/abs/1605. 07422. • Dong Deng, Yu Jiang, Guoliang Li, Jian Li, and Cong Yu. Scalable column concept de- termination for web tables using large knowledge bases. PVLDB, 6(13):1606–1617, 2013. doi: http://www.vldb.org/pvldb/vol6/p1606-li.pdf. • Anne Schuth, Harrie Oosterhuis, Shimon Whiteson, and Maarten de Rijke. Multileave gradient descent for fast online learning to rank. In WSDM 2016: The 9th International Conference on Web Search and Data Mining, pages 457-466. ACM, February 2016. • Sreeram Balakrishnan, Alon Y. Halevy, Boulos Harb, Hongrae Lee, Jayant Madhavan, Afshin Rostamizadeh, Warren Shen, Kenneth Wilder, Fei Wu 0003, and Cong Yu. Ap- plying webtables in practice. In CIDR 2015, Seventh Biennial Conference on Innovative Data Systems Research, Asilomar, CA, USA, January 4-7, 2015, Online Proceedings. www.cidrdb.org, 2015. #SAISDS8