Haystack 2019 Lightning Talk - Relevance on 17 million full text documents - ...OpenSource Connections
HathiTrust is a shared digital repository containing over 17 million scanned books from over 140 member libraries, totaling around 5 billion pages. It faces challenges in providing large-scale full-text search across this multilingual collection where document quality and structure varies. Initial approaches involved a two-tiered index but relevance must balance weights between full text and shorter metadata fields. Further tuning of algorithms like BM25 is needed to properly rank longer documents in the collection against metadata.
Vectors in Search - Towards More Semantic MatchingSimon Hughes
With the advent of deep learning and algorithms like word2vec and doc2vec, vectors-based representations are increasingly being used in search to represent anything from documents to images and products. However, search engines work with documents made of tokens, and not vectors, and are typically not designed for fast vector matching out of the box. In this talk, I will give an overview of how vectors can be derived from documents to produce a semantic representation of a document that can be used to implement semantic / conceptual search without hurting performance. I will then I will describe a few different techniques for efficiently searching vector-based representations in an inverted index, such as learning sparse representations of vectors, clustering, and learning binary vectors. Finally, I will discuss some of the pitfalls of vector-based search, and how to get the best of both worlds by combining vector-based scoring with traditional relevancy metrics such as BM25.
Introduction to natural language processing (NLP)Alia Hamwi
The document provides an introduction to natural language processing (NLP). It defines NLP as a field of artificial intelligence devoted to creating computers that can use natural language as input and output. Some key NLP applications mentioned include data analysis of user-generated content, conversational agents, translation, classification, information retrieval, and summarization. The document also discusses various linguistic levels of analysis like phonology, morphology, syntax, and semantics that involve ambiguity challenges. Common NLP tasks like part-of-speech tagging, named entity recognition, parsing, and information extraction are described. Finally, the document outlines the typical steps in an NLP pipeline including data collection, text cleaning, preprocessing, feature engineering, modeling and evaluation.
his talk will feature some of my recent research into the alternative uses for Solr facets and facet metadata. I will develop the idea that facets can be used to discover similarities between items and attributes in a search index, and show some interesting applications of this idea. A common takeaway is that using facets and facet metadata in non-conventional ways enables the semantic context of a query to be automatically tuned. This has important implications for user-centric and semantically focused relevance.
This document provides an introduction to text mining, including defining key concepts such as structured vs. unstructured data, why text mining is useful, and some common challenges. It also outlines important text mining techniques like pre-processing text through normalization, tokenization, stemming, and removing stop words to prepare text for analysis. Text mining methods can be used for applications such as sentiment analysis, predicting markets or customer churn.
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and VocabulariesMax Irwin
Presentation as given to the Haystack Conference, which outlines research and techniques for automatic extraction of keywords, concepts, and vocabularies from text corpora.
The document provides an overview of the semantic web, which aims to extend the current web by giving information well-defined meaning. It discusses issues with traditional web searches and outlines the semantic web technology stack, including metadata, knowledge representation using XML, RDF, and ontologies with taxonomies and inference rules. The conclusion covers pros of semantic web like improved search accuracy and addressing complex questions, along with references for further reading.
This document summarizes Max De Marzi's presentation on ETL (extract, transform, load) processes for loading data into Neo4j. It discusses using the Neo4j REST API, Gremlin and Groovy, and the Neo4j Batch Importer for ETL. It also provides an example of ETL from a SQL database by identifying relationships between rows and importing the data without node IDs.
Haystack 2019 Lightning Talk - Relevance on 17 million full text documents - ...OpenSource Connections
HathiTrust is a shared digital repository containing over 17 million scanned books from over 140 member libraries, totaling around 5 billion pages. It faces challenges in providing large-scale full-text search across this multilingual collection where document quality and structure varies. Initial approaches involved a two-tiered index but relevance must balance weights between full text and shorter metadata fields. Further tuning of algorithms like BM25 is needed to properly rank longer documents in the collection against metadata.
Vectors in Search - Towards More Semantic MatchingSimon Hughes
With the advent of deep learning and algorithms like word2vec and doc2vec, vectors-based representations are increasingly being used in search to represent anything from documents to images and products. However, search engines work with documents made of tokens, and not vectors, and are typically not designed for fast vector matching out of the box. In this talk, I will give an overview of how vectors can be derived from documents to produce a semantic representation of a document that can be used to implement semantic / conceptual search without hurting performance. I will then I will describe a few different techniques for efficiently searching vector-based representations in an inverted index, such as learning sparse representations of vectors, clustering, and learning binary vectors. Finally, I will discuss some of the pitfalls of vector-based search, and how to get the best of both worlds by combining vector-based scoring with traditional relevancy metrics such as BM25.
Introduction to natural language processing (NLP)Alia Hamwi
The document provides an introduction to natural language processing (NLP). It defines NLP as a field of artificial intelligence devoted to creating computers that can use natural language as input and output. Some key NLP applications mentioned include data analysis of user-generated content, conversational agents, translation, classification, information retrieval, and summarization. The document also discusses various linguistic levels of analysis like phonology, morphology, syntax, and semantics that involve ambiguity challenges. Common NLP tasks like part-of-speech tagging, named entity recognition, parsing, and information extraction are described. Finally, the document outlines the typical steps in an NLP pipeline including data collection, text cleaning, preprocessing, feature engineering, modeling and evaluation.
his talk will feature some of my recent research into the alternative uses for Solr facets and facet metadata. I will develop the idea that facets can be used to discover similarities between items and attributes in a search index, and show some interesting applications of this idea. A common takeaway is that using facets and facet metadata in non-conventional ways enables the semantic context of a query to be automatically tuned. This has important implications for user-centric and semantically focused relevance.
This document provides an introduction to text mining, including defining key concepts such as structured vs. unstructured data, why text mining is useful, and some common challenges. It also outlines important text mining techniques like pre-processing text through normalization, tokenization, stemming, and removing stop words to prepare text for analysis. Text mining methods can be used for applications such as sentiment analysis, predicting markets or customer churn.
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and VocabulariesMax Irwin
Presentation as given to the Haystack Conference, which outlines research and techniques for automatic extraction of keywords, concepts, and vocabularies from text corpora.
The document provides an overview of the semantic web, which aims to extend the current web by giving information well-defined meaning. It discusses issues with traditional web searches and outlines the semantic web technology stack, including metadata, knowledge representation using XML, RDF, and ontologies with taxonomies and inference rules. The conclusion covers pros of semantic web like improved search accuracy and addressing complex questions, along with references for further reading.
This document summarizes Max De Marzi's presentation on ETL (extract, transform, load) processes for loading data into Neo4j. It discusses using the Neo4j REST API, Gremlin and Groovy, and the Neo4j Batch Importer for ETL. It also provides an example of ETL from a SQL database by identifying relationships between rows and importing the data without node IDs.
This document summarizes two Arabic question answering systems: QASAL and QARAB. It describes the main components of each system, including question analysis, passage retrieval, and answer extraction. It also discusses how each system handles yes/no questions in Arabic. The document concludes by comparing the performance of the two systems and different techniques for Arabic question answering.
Lazy man's learning: How To Build Your Own Text SummarizerSho Fola Soboyejo
This document discusses different approaches to text summarization, including extractive and abstractive summarization. It presents several naive extractive algorithms using word frequency, sentence intersection scores, and graph theory. It also discusses using neural networks with encoder-decoder models and attention mechanisms for abstractive summarization. The document provides resources for practicing summarization techniques and accessing text datasets.
JSON is a lightweight data format that is widely used for data interchange on the web. It stands for JavaScript Object Notation and uses human-readable text to transmit data objects consisting of attribute-value pairs and arrays. JSON is syntactically identical to JavaScript objects and is supported by many modern programming languages, making it ideal for data interchange. The document provides examples of JSON objects, arrays, and nested structures and explains how JSON is commonly used with web services to retrieve and display data in web pages.
A brief survey presentation about Arabic Question Answering touching the different Natural Language Processing and Information Retrieval Approaches to Question Analysis, Passage Retrieval and Answer Extraction. In addition to the listing of the different NLP tools used in AQA and the Challenges and future trends in this area.
Please if you want to cite this paper you can download it here:
http://www.acit2k.org/ACIT/2012Proceedings/13106.pdf
Presentation of Domain Specific Question Answering System Using N-gram Approach.Tasnim Ara Islam
Design an application for a domain specific question answering system. Built a solution for finding answers of factoid questions by using N-gram Mining Approach. Calculated percentage about the related answers for the specific question. Built this application in Java platform.
Question Answering - Application and ChallengesJens Lehmann
This document provides an overview of question answering applications and challenges. It defines question answering as receiving natural language questions and providing concise answers. Recent developments in question answering systems are discussed, including IBM Watson. Challenges for question answering over semantic data are explored, such as lexical gaps, ambiguity, granularity, and alternative resources. Large-scale linguistic resources and machine learning approaches for question answering are also covered. Applications of question answering technologies are examined.
Leveraging Lucene/Solr as a Knowledge Graph and Intent Engine: Presented by T...Lucidworks
The document discusses leveraging Lucene/Solr as a knowledge graph and intent engine. It describes building an intent engine that incorporates type-ahead prediction, spelling correction, entity and entity-type resolution, semantic query parsing, and query augmentation using a knowledge graph. The intent engine aims to understand the user's intent beyond the literal query string and help express their intent through an interactive search experience.
An overview of some core concept in natural language processing, some example (experimental for now!) use cases, and a brief survey of some tools I have explored.
Natural Language Processing, Techniques, Current Trends and Applications in I...RajkiranVeluri
The document discusses natural language processing (NLP) techniques, current trends, and applications in industry. It covers common NLP techniques like morphology, syntax, semantics, and pragmatics. It also discusses word embeddings like Word2Vec and contextual embeddings like BERT. Finally, it discusses applications of NLP in healthcare like analyzing clinical notes and brand monitoring through sentiment analysis of user reviews.
Detecting and Describing Historical Periods in a Large CorporaTraian Rebedea
Many historic periods (or events) are remembered
by slogans, expressions or words that are strongly linked to them. Educated people are also able to determine whether a particular word or expression is related to a specific period in human history. The present paper aims to establish correlations between significant historic periods (or events) and the texts written in that period. In order to achieve this, we have developed a system that automatically links words (and topics discovered using Latent Dirichlet Allocation) to periods of time in the recent history. For this analysis to be relevant and conclusive, it must be undertaken on a representative set of texts written throughout history. To this end, instead of relying on manually selected texts, the Google Books Ngram corpus has been chosen as a basis for the analysis. Although it provides only word n-gram statistics for the texts written in a given year, the resulting time series can be used to provide insights about the most important periods and events in recent history, by automatically linking them with specific keywords or even LDA topics.
This document provides an outline and overview of a seminar on text mining. It discusses basics of text mining including definitions, similarities to data mining, preprocessing operations, document features, and representational models of documents. It also describes general architectures of text mining systems and provides examples of system architectures for generic, domain-oriented, and advanced text mining systems with background knowledge bases.
This document summarizes discovery service adoption rates among major library vendors. It reports that EBSCO has the largest number of subscribers to its discovery service (EDS) at 5,612 libraries. OCLC reports 1,717 libraries using WorldCat Local, and Ex Libris has licensed Primo to 1,407 libraries. The document also provides subscriber numbers for ProQuest Summon. It examines themes from user research on discovery services and outlines features and capabilities of EBSCO's EDS product.
Interleaving, Evaluation to Self-learning Search @904LabsJohn T. Kane
Presented at Open Source Connections Haystack Relevance Conference on 904Labs' "Interleaving: from Evaluation to Self-Learning". 904Labs is the first to commercialize "Online Learning to Rank" as a state-of-art for technical Self-learning Search Ranking that automatically takes into account your customers human behaviors for personalized search results.
Lucene/Solr Revolution 2015: Where Search Meets Machine LearningJoaquin Delgado PhD.
Search engines have focused on solving the document retrieval problem, so their scoring functions do not handle naturally non-traditional IR data types, such as numerical or categorical. Therefore, on domains beyond traditional search, scores representing strengths of associations or matches may vary widely. As such, the original model doesn’t suffice, so relevance ranking is performed as a two-phase approach with 1) regular search 2) external model to re-rank the filtered items. Metrics such as click-through and conversion rates are associated with the users’ response to items served. The predicted selection rates that arise in real-time can be critical for optimal matching. For example, in recommender systems, predicted performance of a recommended item in a given context, also called response prediction, is often used in determining a set of recommendations to serve in relation to a given serving opportunity. Similar techniques are used in the advertising domain. To address this issue the authors have created ML-Scoring, an open source framework that tightly integrates machine learning models into a popular search engine (SOLR/Elasticsearch), replacing the default IR-based ranking function. A custom model is trained through either Weka or Spark and it is loaded as a plugin used at query time to compute custom scores.
Arabic is the 6th most wide-spread natural language in the world with more than 350 million native speakers. Arabic question answering systems are gaining great significance due to the increasing amounts of Arabic unstructured content on the Internet and the increasing demand for information that regular information retrieval techniques do not satisfy. Question answering systems generally, and Arabic systems are no exception, hit an upper bound of performance due to the propagation of error in their pipeline. This increases the significance of answer selection and validation systems as they enhance the certainty and accuracy of question answering systems. Very few works tackled the Arabic answer selection and validation problem, and they used the same question answering pipeline without any changes to satisfy the requirements of answer selection and validation. That is why they did not perform adequately well in this task. In this dissertation, a new approach to Arabic answer selection and validation is presented through “ALQASIM”, which is a QA4MRE (Question Answering for Machine Reading Evaluation) system. ALQASIM analyzes the reading test documents instead of the questions, utilizes sentence splitting, root expansion, and semantic expansion using an ontology built from the CLEF 2012 background collections. Our experiments have been conducted on the test-set provided by CLEF 2012 through the task of QA4MRE. This approach led to a promising performance of 0.36 Accuracy and 0.42 C@1, which is double the performance of the best performing Arabic QA4MRE system.
Publications:
http://scholar.google.com/citations?user=XGJiEioAAAAJ&hl=en
https://aast.academia.edu/AhmedMagdy
Vectors in Search – Towards More Semantic Matching - Simon Hughes, Dice.com Lucidworks
This document summarizes Simon Hughes' presentation on using vector representations for semantic matching in search. It discusses using word embeddings to learn vector representations of words that capture their semantic meaning based on context. Approaches for searching with word embeddings include expanding queries with related terms from the embedding model or clustering the embeddings and mapping queries to clusters. The document also covers techniques for indexing and searching vector representations in an inverted index, such as using locality-sensitive hashing or k-means trees to map vectors to discrete tokens that can be indexed.
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...Lucidworks
The document discusses implementing conceptual search in Solr. It describes how conceptual search aims to improve recall without reducing precision by matching documents based on concepts rather than keywords alone. It explains how Word2Vec can be used to learn related concepts from documents and represent words as vectors, which can then be embedded in Solr through synonym filters and payloads to enable conceptual search queries. This allows retrieving more relevant documents that do not contain the exact search terms but are still conceptually related.
Improving Search in Workday Products using Natural Language ProcessingDataWorks Summit
Workday is a leading provider of cloud-based enterprise software products such as Human Capital Management, Talent, Finance, Student, Planning etc. These products produce a wealth of natural language data. However, this data is unstructured and denormalized. Retrieving relevant information from such data is a challenging task. Using simple index-based search methods can only take us so far. The Data Science team at Workday is determined to apply Machine Learning and AI to make search better across Workday’s products.
In this session, we present to you, how we use word embeddings to normalize the data and add structure to it. We will also talk about using word representations to make search intelligent. The specific use cases we will discuss are adding synonyms detection and entity-recommendation.
In this talk, we will focus on the word-embeddings techniques explored, metrics used to evaluate Natural Language Processing Models, tools built, and future work as a part of improving search.
Speaker
Namrata Ghadi, Workday Inc, Software Development Engineer (Data Science)
Adam Baker, Workday Inc, Sr Software Engineer
This tutorial gives an overview of how search engines and machine learning techniques can be tightly coupled to address the need for building scalable recommender or other prediction based systems. Typically, most of them architect retrieval and prediction in two phases. In Phase I, a search engine returns the top-k results based on constraints expressed as a query. In Phase II, the top-k results are re-ranked in another system according to an optimization function that uses a supervised trained model. However this approach presents several issues, such as the possibility of returning sub-optimal results due to the top-k limits during query, as well as the prescence of some inefficiencies in the system due to the decoupling of retrieval and ranking.
To address this issue the authors created ML-Scoring, an open source framework that tightly integrates machine learning models into Elasticsearch, a popular search engine. ML-Scoring replaces the default information retrieval ranking function with a custom supervised model that is trained through Spark, Weka, or R that is loaded as a plugin in Elasticsearch. This tutorial will not only review basic methods in information retrieval and machine learning, but it will also walk through practical examples from loading a dataset into Elasticsearch to training a model in Spark, Weka, or R, to creating the ML-Scoring plugin for Elasticsearch. No prior experience is required in any system listed (Elasticsearch, Spark, Weka, R), though some programming experience is recommended.
This document summarizes two Arabic question answering systems: QASAL and QARAB. It describes the main components of each system, including question analysis, passage retrieval, and answer extraction. It also discusses how each system handles yes/no questions in Arabic. The document concludes by comparing the performance of the two systems and different techniques for Arabic question answering.
Lazy man's learning: How To Build Your Own Text SummarizerSho Fola Soboyejo
This document discusses different approaches to text summarization, including extractive and abstractive summarization. It presents several naive extractive algorithms using word frequency, sentence intersection scores, and graph theory. It also discusses using neural networks with encoder-decoder models and attention mechanisms for abstractive summarization. The document provides resources for practicing summarization techniques and accessing text datasets.
JSON is a lightweight data format that is widely used for data interchange on the web. It stands for JavaScript Object Notation and uses human-readable text to transmit data objects consisting of attribute-value pairs and arrays. JSON is syntactically identical to JavaScript objects and is supported by many modern programming languages, making it ideal for data interchange. The document provides examples of JSON objects, arrays, and nested structures and explains how JSON is commonly used with web services to retrieve and display data in web pages.
A brief survey presentation about Arabic Question Answering touching the different Natural Language Processing and Information Retrieval Approaches to Question Analysis, Passage Retrieval and Answer Extraction. In addition to the listing of the different NLP tools used in AQA and the Challenges and future trends in this area.
Please if you want to cite this paper you can download it here:
http://www.acit2k.org/ACIT/2012Proceedings/13106.pdf
Presentation of Domain Specific Question Answering System Using N-gram Approach.Tasnim Ara Islam
Design an application for a domain specific question answering system. Built a solution for finding answers of factoid questions by using N-gram Mining Approach. Calculated percentage about the related answers for the specific question. Built this application in Java platform.
Question Answering - Application and ChallengesJens Lehmann
This document provides an overview of question answering applications and challenges. It defines question answering as receiving natural language questions and providing concise answers. Recent developments in question answering systems are discussed, including IBM Watson. Challenges for question answering over semantic data are explored, such as lexical gaps, ambiguity, granularity, and alternative resources. Large-scale linguistic resources and machine learning approaches for question answering are also covered. Applications of question answering technologies are examined.
Leveraging Lucene/Solr as a Knowledge Graph and Intent Engine: Presented by T...Lucidworks
The document discusses leveraging Lucene/Solr as a knowledge graph and intent engine. It describes building an intent engine that incorporates type-ahead prediction, spelling correction, entity and entity-type resolution, semantic query parsing, and query augmentation using a knowledge graph. The intent engine aims to understand the user's intent beyond the literal query string and help express their intent through an interactive search experience.
An overview of some core concept in natural language processing, some example (experimental for now!) use cases, and a brief survey of some tools I have explored.
Natural Language Processing, Techniques, Current Trends and Applications in I...RajkiranVeluri
The document discusses natural language processing (NLP) techniques, current trends, and applications in industry. It covers common NLP techniques like morphology, syntax, semantics, and pragmatics. It also discusses word embeddings like Word2Vec and contextual embeddings like BERT. Finally, it discusses applications of NLP in healthcare like analyzing clinical notes and brand monitoring through sentiment analysis of user reviews.
Detecting and Describing Historical Periods in a Large CorporaTraian Rebedea
Many historic periods (or events) are remembered
by slogans, expressions or words that are strongly linked to them. Educated people are also able to determine whether a particular word or expression is related to a specific period in human history. The present paper aims to establish correlations between significant historic periods (or events) and the texts written in that period. In order to achieve this, we have developed a system that automatically links words (and topics discovered using Latent Dirichlet Allocation) to periods of time in the recent history. For this analysis to be relevant and conclusive, it must be undertaken on a representative set of texts written throughout history. To this end, instead of relying on manually selected texts, the Google Books Ngram corpus has been chosen as a basis for the analysis. Although it provides only word n-gram statistics for the texts written in a given year, the resulting time series can be used to provide insights about the most important periods and events in recent history, by automatically linking them with specific keywords or even LDA topics.
This document provides an outline and overview of a seminar on text mining. It discusses basics of text mining including definitions, similarities to data mining, preprocessing operations, document features, and representational models of documents. It also describes general architectures of text mining systems and provides examples of system architectures for generic, domain-oriented, and advanced text mining systems with background knowledge bases.
This document summarizes discovery service adoption rates among major library vendors. It reports that EBSCO has the largest number of subscribers to its discovery service (EDS) at 5,612 libraries. OCLC reports 1,717 libraries using WorldCat Local, and Ex Libris has licensed Primo to 1,407 libraries. The document also provides subscriber numbers for ProQuest Summon. It examines themes from user research on discovery services and outlines features and capabilities of EBSCO's EDS product.
Interleaving, Evaluation to Self-learning Search @904LabsJohn T. Kane
Presented at Open Source Connections Haystack Relevance Conference on 904Labs' "Interleaving: from Evaluation to Self-Learning". 904Labs is the first to commercialize "Online Learning to Rank" as a state-of-art for technical Self-learning Search Ranking that automatically takes into account your customers human behaviors for personalized search results.
Lucene/Solr Revolution 2015: Where Search Meets Machine LearningJoaquin Delgado PhD.
Search engines have focused on solving the document retrieval problem, so their scoring functions do not handle naturally non-traditional IR data types, such as numerical or categorical. Therefore, on domains beyond traditional search, scores representing strengths of associations or matches may vary widely. As such, the original model doesn’t suffice, so relevance ranking is performed as a two-phase approach with 1) regular search 2) external model to re-rank the filtered items. Metrics such as click-through and conversion rates are associated with the users’ response to items served. The predicted selection rates that arise in real-time can be critical for optimal matching. For example, in recommender systems, predicted performance of a recommended item in a given context, also called response prediction, is often used in determining a set of recommendations to serve in relation to a given serving opportunity. Similar techniques are used in the advertising domain. To address this issue the authors have created ML-Scoring, an open source framework that tightly integrates machine learning models into a popular search engine (SOLR/Elasticsearch), replacing the default IR-based ranking function. A custom model is trained through either Weka or Spark and it is loaded as a plugin used at query time to compute custom scores.
Arabic is the 6th most wide-spread natural language in the world with more than 350 million native speakers. Arabic question answering systems are gaining great significance due to the increasing amounts of Arabic unstructured content on the Internet and the increasing demand for information that regular information retrieval techniques do not satisfy. Question answering systems generally, and Arabic systems are no exception, hit an upper bound of performance due to the propagation of error in their pipeline. This increases the significance of answer selection and validation systems as they enhance the certainty and accuracy of question answering systems. Very few works tackled the Arabic answer selection and validation problem, and they used the same question answering pipeline without any changes to satisfy the requirements of answer selection and validation. That is why they did not perform adequately well in this task. In this dissertation, a new approach to Arabic answer selection and validation is presented through “ALQASIM”, which is a QA4MRE (Question Answering for Machine Reading Evaluation) system. ALQASIM analyzes the reading test documents instead of the questions, utilizes sentence splitting, root expansion, and semantic expansion using an ontology built from the CLEF 2012 background collections. Our experiments have been conducted on the test-set provided by CLEF 2012 through the task of QA4MRE. This approach led to a promising performance of 0.36 Accuracy and 0.42 C@1, which is double the performance of the best performing Arabic QA4MRE system.
Publications:
http://scholar.google.com/citations?user=XGJiEioAAAAJ&hl=en
https://aast.academia.edu/AhmedMagdy
Vectors in Search – Towards More Semantic Matching - Simon Hughes, Dice.com Lucidworks
This document summarizes Simon Hughes' presentation on using vector representations for semantic matching in search. It discusses using word embeddings to learn vector representations of words that capture their semantic meaning based on context. Approaches for searching with word embeddings include expanding queries with related terms from the embedding model or clustering the embeddings and mapping queries to clusters. The document also covers techniques for indexing and searching vector representations in an inverted index, such as using locality-sensitive hashing or k-means trees to map vectors to discrete tokens that can be indexed.
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...Lucidworks
The document discusses implementing conceptual search in Solr. It describes how conceptual search aims to improve recall without reducing precision by matching documents based on concepts rather than keywords alone. It explains how Word2Vec can be used to learn related concepts from documents and represent words as vectors, which can then be embedded in Solr through synonym filters and payloads to enable conceptual search queries. This allows retrieving more relevant documents that do not contain the exact search terms but are still conceptually related.
Improving Search in Workday Products using Natural Language ProcessingDataWorks Summit
Workday is a leading provider of cloud-based enterprise software products such as Human Capital Management, Talent, Finance, Student, Planning etc. These products produce a wealth of natural language data. However, this data is unstructured and denormalized. Retrieving relevant information from such data is a challenging task. Using simple index-based search methods can only take us so far. The Data Science team at Workday is determined to apply Machine Learning and AI to make search better across Workday’s products.
In this session, we present to you, how we use word embeddings to normalize the data and add structure to it. We will also talk about using word representations to make search intelligent. The specific use cases we will discuss are adding synonyms detection and entity-recommendation.
In this talk, we will focus on the word-embeddings techniques explored, metrics used to evaluate Natural Language Processing Models, tools built, and future work as a part of improving search.
Speaker
Namrata Ghadi, Workday Inc, Software Development Engineer (Data Science)
Adam Baker, Workday Inc, Sr Software Engineer
This tutorial gives an overview of how search engines and machine learning techniques can be tightly coupled to address the need for building scalable recommender or other prediction based systems. Typically, most of them architect retrieval and prediction in two phases. In Phase I, a search engine returns the top-k results based on constraints expressed as a query. In Phase II, the top-k results are re-ranked in another system according to an optimization function that uses a supervised trained model. However this approach presents several issues, such as the possibility of returning sub-optimal results due to the top-k limits during query, as well as the prescence of some inefficiencies in the system due to the decoupling of retrieval and ranking.
To address this issue the authors created ML-Scoring, an open source framework that tightly integrates machine learning models into Elasticsearch, a popular search engine. ML-Scoring replaces the default information retrieval ranking function with a custom supervised model that is trained through Spark, Weka, or R that is loaded as a plugin in Elasticsearch. This tutorial will not only review basic methods in information retrieval and machine learning, but it will also walk through practical examples from loading a dataset into Elasticsearch to training a model in Spark, Weka, or R, to creating the ML-Scoring plugin for Elasticsearch. No prior experience is required in any system listed (Elasticsearch, Spark, Weka, R), though some programming experience is recommended.
RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...S. Diana Hu
Search engines have focused on solving the document retrieval problem, so their scoring functions do not handle naturally non-traditional IR data types, such as numerical or categorical. Therefore, on domains beyond traditional search, scores representing strengths of associations or matches may vary widely. As such, the original model doesn’t suffice, so relevance ranking is performed as a two-phase approach with 1) regular search 2) external model to re-rank the filtered items. Metrics such as click-through and conversion rates are associated with the users’ response to items served. The predicted selection rates that arise in real-time can be critical for optimal matching. For example, in recommender systems, predicted performance of a recommended item in a given context, also called response prediction, is often used in determining a set of recommendations to serve in relation to a given serving opportunity. Similar techniques are used in the advertising domain. To address this issue the authors have created ML-Scoring, an open source framework that tightly integrates machine learning models into a popular search engine (SOLR/Elasticsearch), replacing the default IR-based ranking function. A custom model is trained through either Weka or Spark and it is loaded as a plugin used at query time to compute custom scores.
An introduction to Metadata Application Profileskcoylenet
These slides are from a DCMI/ASIS&T webinar on metadata application profiles. It gives a high level introduction to profiles, provides examples of what they might look like, and shows some work being done through W3C and DCMI.
First Steps in Semantic Data Modelling and Search & Analytics in the CloudOntotext
This webinar will break the roadblocks that prevent many from reaping the benefits of heavyweight Semantic Technology in small scale projects. We will show you how to build Semantic Search & Analytics proof of concepts by using managed services in the Cloud.
Dice.com Bay Area Search - Beyond Learning to Rank TalkSimon Hughes
This talk describes how to implement conceptual search (semantic search) within a modern search engine using the word2vec algorithm to learn concepts. We also cover how to auto-tune the search engine parameters using black box optimization techniques, and the problems of feedback loops encountered when building machine learning systems that modify the user behavior used to train the system.
AI presentation and introduction - Retrieval Augmented Generation RAG 101vincent683379
Brief Introduction to Generative AI and LLM in particular.
Overview of the market, and usages of LLMs.
What's it like to train and build a model.
Retrieval Augmented Generation 101, explained for non savvies, and a perspective of what are the moving parts making it complex
The document discusses metadata standards and practices. It begins by asking questions about how digital information is organized and found. It then discusses challenges like having to do new tasks without full knowledge and learning from others. The document provides overviews of various metadata standards like MODS, MIX, PREMIS, METS, and TEI. It also discusses topics such as metadata schemas, subject metadata, indexing metadata, and search relevance. Throughout, it offers advice on evaluating and implementing metadata standards.
Enhancing Enterprise Search with Machine Learning - Simon Hughes, Dice.comSimon Hughes
In the talk I describe two approaches for improve the recall and precision of an enterprise search engine using machine learning techniques. The main focus is improving relevancy with ML while using your existing search stack, be that Luce, Solr, Elastic Search, Endeca or something else.
Why do they call it Linked Data when they want to say...?Oscar Corcho
The four Linked Data publishing principles established in 2006 seem to be quite clear and well understood by people inside and outside the core Linked Data and Semantic Web community. However, not only when discussing with outsiders about the goodness of Linked Data but also when reviewing papers for the COLD workshop series, I find myself, in many occasions, going back again to the principles in order to see whether some approach for Web data publication and consumption is actually Linked Data or not. In this talk we will review some of the current approaches that we have for publishing data on the Web, and we will reflect on why it is sometimes so difficult to get into an agreement on what we understand by Linked Data. Furthermore, we will take the opportunity to describe yet another approach that we have been working on recently at the Center for Open Middleware, a joint technology center between Banco Santander and Universidad Politécnica de Madrid, in order to facilitate Linked Data consumption.
This document describes an approach for bridging the gap between natural language queries and linked data concepts using BabelNet. The approach uses BabelNet for word sense disambiguation, named entity recognition and disambiguation. It parses queries, matches terms to ontology concepts and properties, generates candidate triples, and integrates the triples to produce SPARQL queries. The approach was evaluated on test data from QALD-2, achieving a promising 76% of questions answered correctly.
This document provides an introduction and overview of Neo4j and graph databases. It begins with an explanation of the limitations of relational databases in modeling relationships and includes slides on Neo4j's native graph data model and architecture. Additional slides cover Neo4j use cases, modeling with graphs, the Neo4j platform and features like the cloud, drivers, and visualization tools. The document concludes with examples of recommender systems queries in Cypher.
Slides: Concurrent Inference of Topic Models and Distributed Vector Represent...Parang Saraf
Abstract: Topic modeling techniques have been widely used to uncover dominant themes hidden inside an unstructured document collection. Though these techniques first originated in the probabilistic analysis of word distributions, many deep learning approaches have been adopted recently. In this paper, we propose a novel neural network based architecture that produces distributed representation of topics to capture topical themes in a dataset. Unlike many state-of-the-art techniques for generating distributed representation of words and documents that directly use neighboring words for training, we leverage the outcome of a sophisticated deep neural network to estimate the topic labels of each document. The networks, for topic modeling and generation of distributed representations, are trained concurrently in a cascaded style with better runtime without sacrificing the quality of the topics. Empirical studies reported in the paper show that the distributed representations of topics represent intuitive themes using smaller dimensions than conventional topic modeling approaches.
For more information, please visit: http://people.cs.vt.edu/parang/ or contact parang at firstname at cs vt edu
Red Hat Summit Connect 2023 - Redis Enterprise, the engine of Generative AILuigi Fugaro
Redis è conosciuto come un database in tempo reale che può essere utilizzato come cache, per memorizzare sessioni utente o immagazzinare token d’autenticazione, documenti JSON, per gestire inventari in tempo reale, dati geografici, come feature store in scenari di machine learning, gestione di code, broker, stream e molto altro. Ma non tutti sanno che Redis può memorizzare e indicizzare vettori di embeddings, ovvero quelle strutture dati che sono alla base di applicativi come ChatGPT. In questo talk, esploreremo come utilizzare Redis come un database vettoriale per implementare casi d’uso moderni.
State of Search 2017 - Semantics and Science - Upasna GautamUpasna Gautam
What is latent semantic indexing, how does Google use it, and how understanding this core functionality of the Google algorithm will help you create better content.
Efficient Estimation of Word Representations in Vector Space, by T. Mikolov et al. (2013). Continuous vector representations of words by learning its context words.
Guus Schreiber gave a talk on knowledge engineering and the web. He discussed representing web data using standards like RDF and HTML5. He explained how categorization systems like SKOS, FOAF, and schema.org organize knowledge on the web. Schreiber also discussed aligning different category systems and using knowledge graphs for search and visualization, like locating artworks and finding relationships between artists. He emphasized modestly enriching and aligning existing vocabularies rather than creating new idiosyncratic ontologies.
Enriching Solr with Deep Learning for a Question Answering System - Sanket Sh...Lucidworks
The document discusses research into using deep learning to improve question answering systems. It describes using Solr to retrieve documents and then using machine learning models to rerank the results. The research compared various supervised and unsupervised models for question similarity and answer selection tasks. For question similarity, ensemble models using TFIDF and sentence embeddings performed best. For answer selection, deep learning models outperformed traditional models when sufficient training data was available.
Unveiling the Advantages of Agile Software Development.pdfbrainerhub1
Learn about Agile Software Development's advantages. Simplify your workflow to spur quicker innovation. Jump right in! We have also discussed the advantages.
8 Best Automated Android App Testing Tool and Framework in 2024.pdfkalichargn70th171
Regarding mobile operating systems, two major players dominate our thoughts: Android and iPhone. With Android leading the market, software development companies are focused on delivering apps compatible with this OS. Ensuring an app's functionality across various Android devices, OS versions, and hardware specifications is critical, making Android app testing essential.
Using Query Store in Azure PostgreSQL to Understand Query PerformanceGrant Fritchey
Microsoft has added an excellent new extension in PostgreSQL on their Azure Platform. This session, presented at Posette 2024, covers what Query Store is and the types of information you can get out of it.
Measures in SQL (SIGMOD 2024, Santiago, Chile)Julian Hyde
SQL has attained widespread adoption, but Business Intelligence tools still use their own higher level languages based upon a multidimensional paradigm. Composable calculations are what is missing from SQL, and we propose a new kind of column, called a measure, that attaches a calculation to a table. Like regular tables, tables with measures are composable and closed when used in queries.
SQL-with-measures has the power, conciseness and reusability of multidimensional languages but retains SQL semantics. Measure invocations can be expanded in place to simple, clear SQL.
To define the evaluation semantics for measures, we introduce context-sensitive expressions (a way to evaluate multidimensional expressions that is consistent with existing SQL semantics), a concept called evaluation context, and several operations for setting and modifying the evaluation context.
A talk at SIGMOD, June 9–15, 2024, Santiago, Chile
Authors: Julian Hyde (Google) and John Fremlin (Google)
https://doi.org/10.1145/3626246.3653374
Liberarsi dai framework con i Web Component.pptxMassimo Artizzu
In Italian
Presentazione sulle feature e l'utilizzo dei Web Component nell sviluppo di pagine e applicazioni web. Racconto delle ragioni storiche dell'avvento dei Web Component. Evidenziazione dei vantaggi e delle sfide poste, indicazione delle best practices, con particolare accento sulla possibilità di usare web component per facilitare la migrazione delle proprie applicazioni verso nuovi stack tecnologici.
Flutter is a popular open source, cross-platform framework developed by Google. In this webinar we'll explore Flutter and its architecture, delve into the Flutter Embedder and Flutter’s Dart language, discover how to leverage Flutter for embedded device development, learn about Automotive Grade Linux (AGL) and its consortium and understand the rationale behind AGL's choice of Flutter for next-gen IVI systems. Don’t miss this opportunity to discover whether Flutter is right for your project.
Consistent toolbox talks are critical for maintaining workplace safety, as they provide regular opportunities to address specific hazards and reinforce safe practices.
These brief, focused sessions ensure that safety is a continual conversation rather than a one-time event, which helps keep safety protocols fresh in employees' minds. Studies have shown that shorter, more frequent training sessions are more effective for retention and behavior change compared to longer, infrequent sessions.
Engaging workers regularly, toolbox talks promote a culture of safety, empower employees to voice concerns, and ultimately reduce the likelihood of accidents and injuries on site.
The traditional method of conducting safety talks with paper documents and lengthy meetings is not only time-consuming but also less effective. Manual tracking of attendance and compliance is prone to errors and inconsistencies, leading to gaps in safety communication and potential non-compliance with OSHA regulations. Switching to a digital solution like Safelyio offers significant advantages.
Safelyio automates the delivery and documentation of safety talks, ensuring consistency and accessibility. The microlearning approach breaks down complex safety protocols into manageable, bite-sized pieces, making it easier for employees to absorb and retain information.
This method minimizes disruptions to work schedules, eliminates the hassle of paperwork, and ensures that all safety communications are tracked and recorded accurately. Ultimately, using a digital platform like Safelyio enhances engagement, compliance, and overall safety performance on site. https://safelyio.com/
The Key to Digital Success_ A Comprehensive Guide to Continuous Testing Integ...kalichargn70th171
In today's business landscape, digital integration is ubiquitous, demanding swift innovation as a necessity rather than a luxury. In a fiercely competitive market with heightened customer expectations, the timely launch of flawless digital products is crucial for both acquisition and retention—any delay risks ceding market share to competitors.
What to do when you have a perfect model for your software but you are constrained by an imperfect business model?
This talk explores the challenges of bringing modelling rigour to the business and strategy levels, and talking to your non-technical counterparts in the process.
Preparing Non - Technical Founders for Engaging a Tech AgencyISH Technologies
Preparing non-technical founders before engaging a tech agency is crucial for the success of their projects. It starts with clearly defining their vision and goals, conducting thorough market research, and gaining a basic understanding of relevant technologies. Setting realistic expectations and preparing a detailed project brief are essential steps. Founders should select a tech agency with a proven track record and establish clear communication channels. Additionally, addressing legal and contractual considerations and planning for post-launch support are vital to ensure a smooth and successful collaboration. This preparation empowers non-technical founders to effectively communicate their needs and work seamlessly with their chosen tech agency.Visit our site to get more details about this. Contact us today www.ishtechnologies.com.au
2. Who Am I?
• Chief Data Scientist at DHI (owns Dice.com)
• Key Projects:
• Search and Match
• Dice Recommender Systems
• Dice Job Search
• Dice Talent Search 3.0 and 4.0
• Dice Skill Center
• Dice Career Advisory Pages
• Dice Salary Predictor
• Dice Career Paths
• PhD Candidate DePaul University
• Subject Area – Machine Learning and NLP
• Thesis – Extracting Causal Relations from Scientific Essays
• Contact Info:
• Email: simon.hughes@dhigroupinc.com
• Twitter: https://twitter.com/hughes_meister
3. Motivation
• Dice.com - leading US technology professional job board
• Jobs marketplace
• We connect technology talent with employers
• High quality searching and matching are critical to our value proposition, for both our customers
and our clients
• Need – high quality content-based recommender engine
• Automatically determine how well a job seeker matches a particular position, and vice versa
• Requirements:
• A semantic matching engine – goes beyond keyword search, to extracting semantic
information from job postings and resume
• Deployed at scale using existing search infrastructure (Solr and ElasticSearch)
• Github Repository for Talk:
• https://github.com/DiceTechJobs/VectorsInSearch
4. Agenda
• Why a Vector Representation?
• Learning Vector Representations
• Vector Based Search in an Inverted Index
5. Understanding Textual Data
Key Challenges:
• Synonymy – Multiple Words with the Same Meaning
• Related – typos, miss-spellings, acronyms, metonyms
• E.g. QA, Quality Assurance, Tester
• Polysemy – Ambiguity, a word has multiple meanings
• E.g. Bank, Book, Ape
• Hypernyms/Hyponyms – ‘type of’ relationships
• E.g. a dog (hyponym) is a type of animal (hypernym)
• Meronyms/Holonyms – ‘part of’ relationships
• E.g. finger (meronym) is a ‘part of’ a hand (holonym)
• What Words / Phrases are More Important?
• Named Entity Extraction (NER), Controlled Vocabularies
• Colocation (phrases) detection – e.g. “data scientist” vs “scientist who works with data”
• Stop words
• Term weighting schemes - e.g. tf.idf
6. How to Solve these Problems?
• Map documents and queries to a semantic space
• “From Strings to Things”?
• Google KG marketing
• Map words into concepts / semantics
• From strings to concepts
• How to represent?
Java
Technologies
Big Data Tools
Javascript
Frameworks
8. Representations
• Distributed Representation
• Dense vector
• Components of the vector represent learned concepts / latent variables
• Similar items have similar representations
• Most existing approaches produce dense vectors
Java
Java
• Local representation
• Non distributed
• Sparse
• E.g. one-hot-vector
• One vector component per unique word
• Similar items have different representations
9. Agenda
• Why a Vector Representation?
• Learning Vector Representations
• Vector Based Search in an Inverted Index
10. The Importance of Context
How do we learn the meaning (semantics) of words?
• Distributional Hypothesis
• Words occurring in similar contexts have similar meanings
• Harris 1954
• “a word is characterized by the company it keeps”
• Firth 1957
• Ignores word order, grammar and syntax
• Latent Relation Hypothesis
• Pairs of words occurring in similar patterns have similar semantic relations
• Turney et al, 2003
• Patterns – X cuts Y, X works with Y, etc
• Word order and grammatical relations matter
• Further reading - Distributional approaches to word meanings
11. Learning Meaning from Context
Bag of Words Approaches – ignore word order
• Latent Models
• Context - Documents
• LSA
• LDA
• Semantic Vector Space Model
• Word Embeddings
• Context – word window
• Word2vec
• Glove
• Simple linear language models
• History - http://blog.aylien.com/a-review-of-the-recent-history-of-natural-language-processing/
• For document embeddings
• Average or idf weighted average of word vectors
• Sentence / Document Embeddings
• Context – document + word window
• E.g. Doc2vec
• Context – surrounding sentences
• E.g. skip-thought vectors
12. Word2Vec
• By Aelu013 [CC BY-SA 4.0 (https://creativecommons.org/licenses/by-sa/4.0) ], from Wikimedia Commons
13. Limitations of BOW Approaches:
• Shallow representation
• Word embeddings – limited to the word level
• Latent models – document level but doesn’t encode relational information
• Synonymy - learn relatedness, not true synonyms
• E.g. Antonyms have similar vectors
• Polysemy – cannot encode different meanings of same word
• Global model not a local model
14. Beyond BOW - Deep Language Models
• Deep Language Model Embeddings
• Derived from the internal state of a deep LM
• Learns deep representation of sequences of
words in context
• Can adjust word vectors based on their current
context
• “NLP’s imagenet moment”
• Achieved state of the art results on many NLP tasks
• Consistently out-perform word embedding models
• Example models - ELMO, BERT, ULMFit, OpenAI
Transformer
• Used for encoding sentences not whole
documents
• Hard to scale
15. Deep Language Models
p(w1,w2,w3, w4,…,wn) = p(wn|w1,w2,…,wn-1)
…..
…..
…..
p(w1) p(w2|W1) p(w3|w1,w2) p(w4|w1,w2,w3)
Begin w1 w2 w3
LSTM LSTM LSTM LSTM
16. Embedding Models for Search
• Word Embedding Approaches
• Cluster Word Embeddings
• “Representing Documents and Queries as Sets of Word Embedded Vectors for Information
Retrieval”
• Clustered word2vec vectors using k-means
• Documents represented as clusters of word vectors
• Query - map query vectors as similarity to cluster centroids
• Out performed Jelinek Mercer LM similarity using VSM
• Average Word Embeddings
• From Chapter 5 of Deep Learning for Search
• Author - Tommaso Teofili
• Query and document represented as average of word2vec vectors
• Computing a weighted average using idf worked best
• Outperformed BM25 using cosine similarity
• BM25 + word2vec – highest NDCG score
17. Embedding Models for Search
• Dual Embedding Space Model (DESM)
• Research from Microsoft
• Extends word2vec
• Learns a dual embedding for queries and documents
• Paper - https://arxiv.org/pdf/1602.01137.pdf
• Evaluation
• Compared BM25, LSA and DESM on Bing Query Log Data
• Metrics - NDCG@1, NDCG@3, NDCG@5
• Results
• LSA and DESM both out-performed BM25
• DESM out-performed LSA
• DESM + BM25 out-performed all other approaches
18. Agenda
• Why a Vector Representation?
• Learning Vector Representations
• Vector Based Search in an Inverted Index
19. Vectors in Search
• Dense Embedding Vector:
• Dense
• D dimensional
• D = 50-1000
• Inverted index:
• Sparse
• Pivoted by term
• V = Vocabulary
• |V| =100k+
• Fast because sparse
[+0.12, -0.34, -0.12, +0.27, +0.63]
Term Posting List
Java 1,5,100,102
.NET 2,4,600,605,1000
C# 2,88,105,800
SQL 130,433,648,899,1200
Html 1,2,10,30,55,202,252,30,598,
20. Searching with Word Embeddings
Approaches for using word embeddings:
• Top N terms
• Expand query using top n terms from model
• Boost expansions by cosine similarity
• Can use as a boost query, a re-rank query or a straight term expansion
• Q = “java developer”^10
OR ”java j2ee developer”^0.91 OR “java architect”^0.89
OR “lead java developer”^0.87 OR “j2ee developer”^0.86
OR “java engineer”^0.86
• Term Clustering
• Cluster embeddings using a clustering algorithm
• E.g. k-means
• Compute different sized clusters, k=100,1000,10000
• Map clusters to tokens and index
• Different fields for each k
• Larger k fields – bigger boost or rely on idf scoring
• Query expands to top clusters, boosted by similarity
• Q = “java developer”^10
OR cluster_k1000:5894^5
OR cluster_k100:23^2.5
OR cluster_k10:8^1.25
• See https://github.com/DiceTechJobs/ConceptualSearch
21. Searching Vectors – k-NN Search
• K-NN search
• Find the k closest neighbors to query vector according to similarity metric
• Usually cosine similarity or Euclidean distance
• Definitions
• D = number of components in the vector
• N = number of documents
• Brute Force Search:
• O(ND) = linear
• What if N AND/OR D is(are) very large?
• Vs. Inverted Index
• Sublinear - makes uses of sparsity of terms
• BTree or Distributed Hash Table lookup for terms, iterate posting list, re-rank
matches - O(n log n)
22. Optimal Vector Representation In An Inverted Index?
What properties would such a representation have?
• For Performance
• Sparse representation necessary to leverage inverted index
• For Relevancy
• Distributed representation
• Each document should be a collection of tokens
• Tokens represent some semantic feature of the space
• Similarity is preserved
• Similar vectors must also be similar under this new representation
• Zipfian distribution of tokens
• “We need a Zipfian Distribution” – John Berryman (Co-author of ‘Relevant Search’)
• Tokenizing Embedding Spaces
23. Zipf’s Law
• The frequency of terms in a
corpus follow a power law
distribution
• Small number of tokens are
very common - filter out
irrelevant docs
• A large number of tokens
are very rare - discriminate
between similar matches
• Distribution of last names - By Thekohser [CC BY-SA 3.0
(https://creativecommons.org/licenses/by-sa/3.0 )], from Wikimedia Commons
24. Approximate Nearest Neighbor Search
• Faster than full k-NN, with some loss in accuracy
• Approaches can be either:
• Data Dependent
• Learns and adjusts from the data
• Makes indexing new documents hard
• Data Independent
• Some Approaches:
• KD Tree
• LSH
• Heuristic Methods
• K-Means Tree
• Randomized KD Forest
• Paper: https://arxiv.org/abs/1603.09596
• HNSW (Hierarchical Navigable Small World Graphs – Top on http://ann-benchmarks.com/
• Paper: https://arxiv.org/pdf/1603.09320.pdf
• Vector Thresholding
• Choice of similarity metric is important in choosing an algorithm
25. KD Trees
• Construction
• Constructs a binary search tree by partitioning the search space along each vector dimension using the
dimensions
• Partitions are chosen orthogonal to each dimension
• Usually the median
• Querying
• Described here - https://en.wikipedia.org/wiki/K-d_tree#Complexity
• Limitations
• How to implement efficiently in an inverted index?
• Lucene 6.0 dimensional points
• See also - https://www.elastic.co/blog/lucene-points-6.0
• Not exposed in Solr and Elastic Search AFAIK
• Tree needs rebalancing on each insertion
• Curse of dimensionality
• N >> 2d - for N points and D dimensions
• Complexity essentially linear for real world vectors (D>= 50)
• Approximate KNN Search
• Possible with KD tree – limit the number of searched nodes
• Typically out-performed by other ANNs approaches
26. Locality Sensitive Hashing
• LSH hashes items to discrete buckets
• More buckets – slower but more accurate
• Locality Preserving
• Maximizes the probability that similar items occupy the same buckets
• Random Projection LSH (sim Hash)
• LSH variant for cosine similarity
• Generate a random d-dimensional unit vector r, and for each vector v
• ℎ𝑎𝑠ℎ 𝑣 = 𝑠𝑖𝑔𝑛(𝑣. 𝑟)
• Produces a binary encoding, one bit for each hash function (random vector)
• Probability 2 vectors’ hashes match - proportional to cosine similarity
• Output of hash function can be indexed and searched using Hamming Distance
• Intuition - Van Durme and Lall - http://www.cs.jhu.edu/~vandurme/papers/VanDurmeLallACL10-slides.pdf
• Data independent, although data dependent variations exist
• However, for real data, it is typically out-performed by heuristic methods like k-means trees, and randomized KD-
trees
• https://lear.inrialpes.fr/pubs/2011/JDS11/jegou_searching_with_quantization.pdf
27. Encoding LSH Hash into the Index
• Hash into Bits
• Store hash fingerprint as a single token • Store each bit as a token using it’s position and value
• Use mm parameter to speed up search
• Or store shingles of the binary tokens
• This is not sparse!
[+0.08, -0.16, -0.12, +0.27, +0.63, -0.01, +0.16, -0.48]
[1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1]
[“10110110100101”] ["00_1","01_0","02_1","04_1","04_0","05_1","06_1","07_0","08_1","09_0","10_0","11_1","12_0","13_1”]
OR
29. K-Means Tree
• Hierarchical Clustering Algorithm
• Recursively partitions vector space using k-Means clustering
• Fast - k-means runs in linear time using Lloyd’s heuristic
• Most other clustering algorithms run in quadratic time or worse
• Tree Construction
• For some branching factor b create b clusters
• Create b nodes, store centroid for each node
• For each new cluster, cluster its members into b smaller clusters
• These form child nodes of their parent clusters, forming a tree structure
• Continue until < b members per cluster
• Paper
• "Scalable Nearest Neighbor Algorithms for High Dimensional Data" - Marius Muja,
2014 – implemented in the FLANN library
31. Lucene Implementation Details
• Pre-train a k-means tree on a representative subset of the index
• Indexing:
• Convert all nodes from tree into unique tokens
• For each vector, find the closest matching leaf node
• Index vector with tokens for that leaf node, and all parent nodes
• Querying
• Find top n matching nodes from tree
• Convert nodes into a query, boosted by similarity to query vector
• 'q': 'clusters:(“121”^0.9 “909”^0.88 ”523”^0.91)’
• Create a re-rank query to brute force re-rank the top matching documents
• 'rq’: '{!rerank reRankQuery=$rqq reRankDocs=1000 reRankWeight=99}’
• 'rqq': '{!payloadEdismax v=$vq}’
• ‘vq’: vector:(”0”^-0.0136 ”1”^0.05387 ”2”^0.070476 ”3”^0.14529 …)
• Uses a special payload query parser (payload_score is insufficient)
• See https://github.com/DiceTechJobs/VectorsInSearch
• *Better approach – use doc values field or Lucene dimensional points
• Trade speed for accuracy depending on depth of tree search, and how many vectors are re-ranked
• Tree nodes follow a Zipfian distribution
32. Lucene Implementation Details
• Cluster Field – stores cluster tokens
• Turn off all norms, tf and idf weighting, custom hamming similarity class
• Vector Field – stores vectors for re-ranking
• Stores components plus payloads, custom similarity class using payloads
• Similarity classes: https://github.com/DiceTechJobs/SolrPlugins
34. Other Heuristic Methods
• Randomized KD Forest
• Constructs a number of KD trees choosing axis to split on randomly
• Searches all trees in parallel to a fixed number of leaf nodes
• KD Trees are very deep
• How to implement efficiently in an inverted index?
• Hierarchical Navigable Small World Graphs
• Hierarchical graph based model - https://arxiv.org/pdf/1603.09320.pdf
• Consistently out-performs other ANNs methods on the ANNs benchmarks
page - http://ann-benchmarks.com/
35. Distribution of Vector
Components
• Distribution of components
from our vectors is Gaussian
• Mean is 0
• This means that most vector
components are very small
• These components will have
minimal impact on cosine
score
Histogram of components taken from 350k vectors
Mean = 0.0
36. Vector Thresholding with Tokenization
[+0.08, -0.16, -0.12, +0.27, +0.63, -0.01, +0.16, -0.48]
[ 0, 0, 0, 0, +0.63, 0, 0, -0.48]
• Drop all but the largest components
[“04i+0.6”, “07i-0.5”]
• Round weight to lower precision
• Encode position and weight as a single token
• Paper: “Semantic Vector Encoding and Similarity Search Using Fulltext Search Engines”
37. Vector Thresholding with Payloads
[+0.08, -0.16, -0.12, +0.27, +0.63, -0.01, +0.16, -0.48]
[ 0, 0, 0, 0, +0.63, 0, 0, -0.48]
• Drop all but the largest components
• I modified the previous idea, using payload score queries
• Indexing: Store remaining (non zero) tokens in index with payloads
• Querying: Uses custom payload query parser + similarity class
• See Github repo, and solr config in Kmeans tree section
Q=vector:(”3”^-0.0136 ”14”^0.05387 ”56”^-0.070476 ”71”^0.14529 …)
&defType=payloadEdismax
38. Performance Comparison - Initial Results
• Hardware - Mac Book Pro, 2.6Ghz i7 CPU, 16G Ram, SSD
• Search Engine:
• Solr 7.5, single shard
• Index: 700k documents
• 1000 sample vector queries, requests were single threaded
• Metric – precision @10 compared to brute force
• Updated results – check https://github.com/DiceTechJobs/VectorsInSearch
39. Performance Comparison - Initial Results
• Each algorithm was ran over a range of different parameter values, to show recall – speed trade off
40. Performance Comparison - Initial Results
Algorithm Precision@10 Queries Per Sec
(Mean Qry Time)
LSH (Hamming Similarity) 0.69 1.3 qps (757 ms)
Kmeans Tree (trained on index) 0.88 9.2 qps (170 ms)
Kmean Tree (trained on sample) 0.85 9.5 qps (105 ms)
Vector Thresholding with Tokenization
(top 40% of components)
0.85 3.5 qps (312 ms)
Vector Threshold with Payloads
(top 40% of components)
0.94 1.8 qps (547 ms)
41. The Ultimate Solution - Sparse Coding?
• Also called ‘Dictionary Learning’
• Learns a sparse ‘overcomplete’ representation of a vector
• Example Algorithms:
• Sparse Auto-Encoder
• K-SVD
• Encoding needs to preserve the Metric Space
• Similar items need to remain similar after encoding
Other Relevant Approaches
• Word2bits - learns binary quantized word vectors
• https://github.com/agnusmaximus/Word2Bits
42. Block Max WAND
• https://www.elastic.co/blog/faster-retrieval-of-top-hits-in-elasticsearch-with-
block-max-wand
• ‘Weak AND’ algorithm to be integrated into Lucene 8.0 and ES 7.0
• Speeds up large OR queries by pruning clauses that won’t occur in top N
matches
• Speed up can be 40% to 13x
• Can help address performance of these larger OR queries
Metrics – recall often used for measuring synonymy and related problems, while precision and traditional IR metrics are better at measuring the efficacy at disambiguating a user’s intent
Context – bag of words
Global - learn semantic representations of terms
Address synonymy (word level)
Learn colocations (phrases)
Local – can be used to disambiguate ambiguous terms
Address polysemy
By Aelu013 [CC BY-SA 4.0 (https://creativecommons.org/licenses/by-sa/4.0)], from Wikimedia Commons.
For LSA illustration, and an excellent explanation, see here - http://iv.slis.indiana.edu/sw/lsa.html
Word vectors - don’t learn true synonyms – don’t truly solve synonymy problem, and don’t handle polsemy as the same vector is used for a word regardless of it’s context.
Deep LM’s capture the meaning of a a sequence of words in context – not just individual words in isolation.
Context – bag of words
Global - learn semantic representations of terms
Address synonymy (word level)
Learn colocations (phrases)
Local – can be used to disambiguate ambiguous terms
Address polysemy
Word vectors - don’t learn true synonyms – don’t truly solve synonymy problem, and don’t handle polysemy as the same vector is used for a word regardless of it’s context.
Deep LM’s capture the meaning of a a sequence of words in context – not just individual words in isolation.
Context – bag of words
Global - learn semantic representations of terms
Address synonymy (word level)
Learn colocations (phrases)
Local – can be used to disambiguate ambiguous terms
Address polysemy
How do we represent dense vectors in a form that works inside an inverted index?
Dense
Note – important to do colocation (phrase detection) before building an embedding model.
Embeddings work better when phrases are passed as single tokens.
Excellent explanation of the simHash- Dan Durme and Lall presentation, slide 15