Full-text search allows searching the full text of documents for exact matches or substrings of search terms. It examines all words in every stored document to match search criteria. A common full-text search technique uses an inverted index to map terms to their locations in documents, allowing fast searching in O(m) time where m is the length of the search query. Updating an inverted index is challenging as it is optimized for reads and requires rewriting segments on changes.
The talk at TYPO3 DevDays 2015 in Nuremberg which explains the deep insights of how search works. TF-IDF algorithm, vector space model and how that is used in Lucene and therefore Solr and Elasticsearch.
This document introduces inverted files, which are a core data structure for text search engines. It describes inverted files and how they allow for efficient indexing, construction, and querying. The document then outlines some common extensions to inverted file indexes, such as compression, phrase querying, and distribution. It concludes by providing context on text search and information retrieval.
Information retrieval systems use indexes and inverted indexes to quickly search large document collections by mapping terms to their locations. Boolean retrieval uses an inverted index to process Boolean queries by intersecting postings lists to find documents that contain sets of terms. Key aspects of information retrieval systems include precision, recall, and ranking search results by relevance.
High level introduction to text mining analytics, which covers the building blocks or most commonly used techniques of text mining along with useful additional references/links where required for background/literature and R codes to get you started.
Text similarity measures are used to quantify the similarity between text strings and documents. Common text similarity measures include Levenshtein distance for word similarity and cosine similarity for document similarity. To apply cosine similarity, documents first need to be represented in a document-term matrix using techniques like count vectorization or TF-IDF. TF-IDF is often preferred as it assigns higher importance to rare terms compared to common terms.
This document provides an overview of the OpenNLP natural language processing tool. It discusses the various NLP tasks that OpenNLP can perform, including tokenization, POS tagging, named entity recognition, chunking, parsing, and co-reference resolution. It also describes how models for these tasks are trained in OpenNLP using annotated training data. The document concludes by listing some advantages and limitations of OpenNLP.
The document provides an introduction to text mining in R using the tm package. It discusses how to import text data from various sources into a corpus, transform and preprocess text within a corpus using mappings, and manage metadata for documents and corpora. Specific transformations demonstrated include converting documents to plain text, removing whitespace, converting to lowercase, removing stopwords, and stemming. The document also discusses filtering documents based on metadata values or text content.
The talk at TYPO3 DevDays 2015 in Nuremberg which explains the deep insights of how search works. TF-IDF algorithm, vector space model and how that is used in Lucene and therefore Solr and Elasticsearch.
This document introduces inverted files, which are a core data structure for text search engines. It describes inverted files and how they allow for efficient indexing, construction, and querying. The document then outlines some common extensions to inverted file indexes, such as compression, phrase querying, and distribution. It concludes by providing context on text search and information retrieval.
Information retrieval systems use indexes and inverted indexes to quickly search large document collections by mapping terms to their locations. Boolean retrieval uses an inverted index to process Boolean queries by intersecting postings lists to find documents that contain sets of terms. Key aspects of information retrieval systems include precision, recall, and ranking search results by relevance.
High level introduction to text mining analytics, which covers the building blocks or most commonly used techniques of text mining along with useful additional references/links where required for background/literature and R codes to get you started.
Text similarity measures are used to quantify the similarity between text strings and documents. Common text similarity measures include Levenshtein distance for word similarity and cosine similarity for document similarity. To apply cosine similarity, documents first need to be represented in a document-term matrix using techniques like count vectorization or TF-IDF. TF-IDF is often preferred as it assigns higher importance to rare terms compared to common terms.
This document provides an overview of the OpenNLP natural language processing tool. It discusses the various NLP tasks that OpenNLP can perform, including tokenization, POS tagging, named entity recognition, chunking, parsing, and co-reference resolution. It also describes how models for these tasks are trained in OpenNLP using annotated training data. The document concludes by listing some advantages and limitations of OpenNLP.
The document provides an introduction to text mining in R using the tm package. It discusses how to import text data from various sources into a corpus, transform and preprocess text within a corpus using mappings, and manage metadata for documents and corpora. Specific transformations demonstrated include converting documents to plain text, removing whitespace, converting to lowercase, removing stopwords, and stemming. The document also discusses filtering documents based on metadata values or text content.
This document provides an overview of Word2Vec, a model for generating word embeddings. It explains that Word2Vec uses a neural network to learn vector representations of words from large amounts of text such that words with similar meanings are located close to each other in the vector space. The document outlines how Word2Vec is trained using either the Continuous Bag-of-Words or Skip-gram architectures on sequences of words from text corpora. It also discusses how the trained Word2Vec model can be used for tasks like word similarity, analogy completion, and document classification. Finally, it provides a Python example of loading a pre-trained Word2Vec model and using it to find word vectors, similarities, analogies and outlier words.
This document provides an introduction to natural language processing (NLP). It defines NLP as teaching computers to process human language. The two main components of NLP are natural language understanding (NLU), which is deriving meaning from language, and natural language generation (NLG), which is generating language from meaning representations. The document discusses the history of NLP from early rule-based systems to current deep learning methods. It also outlines several applications of NLU like classification and summarization and applications of NLG like machine translation and caption generation.
1) The document describes a dictionary implementation using a TRIE data structure that allows users to search for word meanings, synonyms, antonyms, and examples. It also provides translation and a "word of the day" feature.
2) Algorithms for inserting words into the TRIE, searching the TRIE, finding similar words, and other functions are described along with their methods.
3) Advantages of using a TRIE over other data structures for a dictionary application are discussed, including faster searching and no collisions. Features, input/output screenshots, and limitations are also summarized.
This document discusses sorting and relevance in ElasticSearch. It provides examples of sorting search results by date or score. It also covers multilevel sorting, sorting on multivalue fields, and sorting on string fields after analyzing or not analyzing text. The document explains what determines relevance in ElasticSearch, including term frequency, inverse document frequency, and field length norm. It shows how to get explain plans and failure messages for queries. Finally, it provides a brief introduction to doc values in ElasticSearch and references a book for further information.
This document summarizes a presentation on the OpenNLP toolkit. OpenNLP is an open-source Java toolkit for natural language processing. It provides common NLP features like tokenization, sentence segmentation, part-of-speech tagging, and named entity extraction. The presentation discusses how these features work using pre-trained models for different languages. An example is also given showing how OpenNLP could be used to extract tags from a website and display them in a tag cloud. The presentation concludes by providing contact information for the presenter.
The document describes language-independent methods for clustering similar contexts without using syntactic or lexical resources. It discusses representing contexts as vectors of lexical features and clustering them based on similarity. Feature selection involves identifying unigrams, bigrams, and co-occurrences based on frequency or association measures. Contexts can then be represented in first-order or second-order feature spaces and clustered. Applications include word sense discrimination, document clustering, and name discrimination.
Presented by David Smiley, Software Systems Engineer, Lead, MITRE
OpenSextant is an unstructured-text geotagger. A core component of OpenSextant is a general-purpose text tagger that scans a text document for matching multi-word based substrings from a large dictionary. Harnessing the power of Lucene’s state-of-the-art finite state transducer (FST) technology, the text tagger was able to save over 40x the amount of memory estimated for a leading in-memory alternative. Lucene’s FSTs are elusive due to their technical complexity but overcoming the learning curve can pay off handsomely.
Text mining refers to extracting knowledge from unstructured text data. It is needed because most biological knowledge exists in unstructured research papers, making it difficult for scientists to manually analyze large amounts of papers. Text mining can help address this by automatically analyzing papers and identifying relevant information. However, text mining also faces challenges like dealing with unstructured text, word ambiguities, and noisy data. The basic steps of text mining involve preprocessing text through tasks like tokenization, feature selection to identify important terms, and parsing to separate words and punctuation.
Latent semantic analysis (LSA) is a technique used in natural language processing to analyze relationships between documents and terms by producing concepts related to them. LSA assumes words with similar meanings will occur in similar texts, and uses a documents-terms matrix and singular value decomposition to discover hidden concepts and represent words and documents as vectors in a semantic vector space. Apache OpenNLP is a machine learning toolkit that can be used for various natural language processing tasks like part-of-speech tagging and parsing, and LSA can be seen as part of natural language processing.
Good dictionaries are a key for text mining. We present an idea to build a platform where users can create their own dictionary and text-mining pipeline.
This document discusses file structures and record access methods. It introduces the central problem of locating stored data within a file. Record keys like primary and secondary keys are used to search for records. There are two major search methods - sequential search which starts at the beginning and looks sequentially, and binary search which starts in the middle and removes half the list each time. Direct access allows going straight to the desired record. File headers contain metadata about the file structure and organization to allow different software to access the contents. The file organization and access method must work together - fixed-length records are needed for direct access while variable-length records require indexing. Files can contain different types of data objects by treating each as a variable-length record.
Scalable Discovery Of Hidden Emails From Large Foldersfeiwin
The document describes a framework for reconstructing hidden emails from email folders by identifying quoted fragments and using a precedence graph to represent relationships between emails. It introduces optimizations like email filtering using word indexing and LCS anchoring using indexing to handle large folders and long emails efficiently. An evaluation on the Enron dataset showed the framework could reconstruct hidden emails for many users, and optimizations improved effectiveness.
This document describes a project to detect duplicate documents from the Hoaxy dataset using linguistic features and propagation dynamics on Twitter. It discusses collecting documents and diffusion networks from Hoaxy, preprocessing text, using LDA, LSI, and HDP for document clustering, extracting features on propagation dynamics, and training a random forest classifier on the clustered documents and features. The random forest achieves an F1-score of 0.72 for LDA, 0.75 for LSI, and 0.71 for HDP clusters in determining if document pairs are duplicates. The approach aims to predict topics of "dead" web pages using their diffusion networks on Twitter.
This document discusses topic modelling and APIs. It proposes representing algorithms like topic modelling as "mills" that encapsulate work without owning data. Mills for topic modelling are described, including resources for creating a topic model, classifying text with a trained model, and getting the classification results. Finally, the current state of machine learning APIs and some references are acknowledged.
Word embeddings are a technique for converting words into vectors of numbers so that they can be processed by machine learning algorithms. Words with similar meanings are mapped to similar vectors in the vector space. There are two main types of word embedding models: count-based models that use co-occurrence statistics, and prediction-based models like CBOW and skip-gram neural networks that learn embeddings by predicting nearby words. Word embeddings allow words with similar contexts to have similar vector representations, and have applications such as document representation.
The session focused on Data Mining using R Language where I analyzed a large volume of text files to find out some meaningful insights using concepts like DocumentTermMatrix and WordCloud.
Use text mining method to support criminal case judgmentZhongLI28
This is not my original work. Copyright belongs to the original author. If there is any infringement, please contact us immediately, we will deal with it promptly.
Multilingual search requires the developer to address challenges that don’t exist in the monolingual case. In Solr, a robust multilingual search engine requires different analysis chains for each language because each language has its own logic for tokenization, lemmatization, stemming, synonyms, and stop words. To make multilingual search even harder, query strings are typically no longer than a handful of words, making language identification of query strings more difficult, or at worst ambiguous even to a human (“pie” could be an English or Spanish query). We’ll explore the breadth of Solr schema and configuration options available to a multilingual search application developer to balance functionality, performance, and complexity. We’ll dive deep into specific experiments with a practical application.
Speaker Bio: David Troiano
David Troiano is a Principal Software Engineer at Basis Technology who develops the services and applications that consume the core natural language processing products that Basis delivers. Over the past five years, he has worked on content search, discovery, and recommendation systems built on Lucene / Solr, with an eye toward scalability and performance. He also has professional experience with machine learning and predictive analytics tools in the quantitative finance industry. David holds a bachelor’s degree in Computer Science from Harvard College.
This document discusses text mining in R. It introduces important text mining concepts like tokenization, tagging, and stemming. It outlines popular R packages for text mining like tm, SnowballC, qdap, and dplyr. The document explains how to create a corpus from text files, explore and transform a corpus, create a document term matrix, and analyze term frequencies. Visualization techniques like word clouds and heatmaps are also summarized.
The document discusses processing Boolean queries in an information retrieval system using an inverted index. It describes the steps to process a simple conjunctive query by locating terms in the dictionary, retrieving their postings lists, and intersecting the lists. More complex queries involving OR and NOT operators are also processed in a similar way. The document also discusses optimizing query processing by considering the order of accessing postings lists.
Text based search engine on a fixed corpus and utilizing indexation and ranki...Soham Mondal
A software prototype of a text based search engine which will work on millions of wikipedia pages retrived in xml format and automatically bring-up and analyse the top 10 relevant Wikipedia documents that matches the input query of user. This takes Wikipedia corpus in XML format which is available at Wikipedia.org as input. Then it indices millions of Wikipedia pages involving a comparable number of distinct terms. Given a query, it retrieves relevant ranked documents and their titles using index. It uses OOPs application, ranking algorithms and indexation techniques used in modern search engines. It also showcases high level system design, software architecture modelling and development sprints/implementations
This document provides an overview of Word2Vec, a model for generating word embeddings. It explains that Word2Vec uses a neural network to learn vector representations of words from large amounts of text such that words with similar meanings are located close to each other in the vector space. The document outlines how Word2Vec is trained using either the Continuous Bag-of-Words or Skip-gram architectures on sequences of words from text corpora. It also discusses how the trained Word2Vec model can be used for tasks like word similarity, analogy completion, and document classification. Finally, it provides a Python example of loading a pre-trained Word2Vec model and using it to find word vectors, similarities, analogies and outlier words.
This document provides an introduction to natural language processing (NLP). It defines NLP as teaching computers to process human language. The two main components of NLP are natural language understanding (NLU), which is deriving meaning from language, and natural language generation (NLG), which is generating language from meaning representations. The document discusses the history of NLP from early rule-based systems to current deep learning methods. It also outlines several applications of NLU like classification and summarization and applications of NLG like machine translation and caption generation.
1) The document describes a dictionary implementation using a TRIE data structure that allows users to search for word meanings, synonyms, antonyms, and examples. It also provides translation and a "word of the day" feature.
2) Algorithms for inserting words into the TRIE, searching the TRIE, finding similar words, and other functions are described along with their methods.
3) Advantages of using a TRIE over other data structures for a dictionary application are discussed, including faster searching and no collisions. Features, input/output screenshots, and limitations are also summarized.
This document discusses sorting and relevance in ElasticSearch. It provides examples of sorting search results by date or score. It also covers multilevel sorting, sorting on multivalue fields, and sorting on string fields after analyzing or not analyzing text. The document explains what determines relevance in ElasticSearch, including term frequency, inverse document frequency, and field length norm. It shows how to get explain plans and failure messages for queries. Finally, it provides a brief introduction to doc values in ElasticSearch and references a book for further information.
This document summarizes a presentation on the OpenNLP toolkit. OpenNLP is an open-source Java toolkit for natural language processing. It provides common NLP features like tokenization, sentence segmentation, part-of-speech tagging, and named entity extraction. The presentation discusses how these features work using pre-trained models for different languages. An example is also given showing how OpenNLP could be used to extract tags from a website and display them in a tag cloud. The presentation concludes by providing contact information for the presenter.
The document describes language-independent methods for clustering similar contexts without using syntactic or lexical resources. It discusses representing contexts as vectors of lexical features and clustering them based on similarity. Feature selection involves identifying unigrams, bigrams, and co-occurrences based on frequency or association measures. Contexts can then be represented in first-order or second-order feature spaces and clustered. Applications include word sense discrimination, document clustering, and name discrimination.
Presented by David Smiley, Software Systems Engineer, Lead, MITRE
OpenSextant is an unstructured-text geotagger. A core component of OpenSextant is a general-purpose text tagger that scans a text document for matching multi-word based substrings from a large dictionary. Harnessing the power of Lucene’s state-of-the-art finite state transducer (FST) technology, the text tagger was able to save over 40x the amount of memory estimated for a leading in-memory alternative. Lucene’s FSTs are elusive due to their technical complexity but overcoming the learning curve can pay off handsomely.
Text mining refers to extracting knowledge from unstructured text data. It is needed because most biological knowledge exists in unstructured research papers, making it difficult for scientists to manually analyze large amounts of papers. Text mining can help address this by automatically analyzing papers and identifying relevant information. However, text mining also faces challenges like dealing with unstructured text, word ambiguities, and noisy data. The basic steps of text mining involve preprocessing text through tasks like tokenization, feature selection to identify important terms, and parsing to separate words and punctuation.
Latent semantic analysis (LSA) is a technique used in natural language processing to analyze relationships between documents and terms by producing concepts related to them. LSA assumes words with similar meanings will occur in similar texts, and uses a documents-terms matrix and singular value decomposition to discover hidden concepts and represent words and documents as vectors in a semantic vector space. Apache OpenNLP is a machine learning toolkit that can be used for various natural language processing tasks like part-of-speech tagging and parsing, and LSA can be seen as part of natural language processing.
Good dictionaries are a key for text mining. We present an idea to build a platform where users can create their own dictionary and text-mining pipeline.
This document discusses file structures and record access methods. It introduces the central problem of locating stored data within a file. Record keys like primary and secondary keys are used to search for records. There are two major search methods - sequential search which starts at the beginning and looks sequentially, and binary search which starts in the middle and removes half the list each time. Direct access allows going straight to the desired record. File headers contain metadata about the file structure and organization to allow different software to access the contents. The file organization and access method must work together - fixed-length records are needed for direct access while variable-length records require indexing. Files can contain different types of data objects by treating each as a variable-length record.
Scalable Discovery Of Hidden Emails From Large Foldersfeiwin
The document describes a framework for reconstructing hidden emails from email folders by identifying quoted fragments and using a precedence graph to represent relationships between emails. It introduces optimizations like email filtering using word indexing and LCS anchoring using indexing to handle large folders and long emails efficiently. An evaluation on the Enron dataset showed the framework could reconstruct hidden emails for many users, and optimizations improved effectiveness.
This document describes a project to detect duplicate documents from the Hoaxy dataset using linguistic features and propagation dynamics on Twitter. It discusses collecting documents and diffusion networks from Hoaxy, preprocessing text, using LDA, LSI, and HDP for document clustering, extracting features on propagation dynamics, and training a random forest classifier on the clustered documents and features. The random forest achieves an F1-score of 0.72 for LDA, 0.75 for LSI, and 0.71 for HDP clusters in determining if document pairs are duplicates. The approach aims to predict topics of "dead" web pages using their diffusion networks on Twitter.
This document discusses topic modelling and APIs. It proposes representing algorithms like topic modelling as "mills" that encapsulate work without owning data. Mills for topic modelling are described, including resources for creating a topic model, classifying text with a trained model, and getting the classification results. Finally, the current state of machine learning APIs and some references are acknowledged.
Word embeddings are a technique for converting words into vectors of numbers so that they can be processed by machine learning algorithms. Words with similar meanings are mapped to similar vectors in the vector space. There are two main types of word embedding models: count-based models that use co-occurrence statistics, and prediction-based models like CBOW and skip-gram neural networks that learn embeddings by predicting nearby words. Word embeddings allow words with similar contexts to have similar vector representations, and have applications such as document representation.
The session focused on Data Mining using R Language where I analyzed a large volume of text files to find out some meaningful insights using concepts like DocumentTermMatrix and WordCloud.
Use text mining method to support criminal case judgmentZhongLI28
This is not my original work. Copyright belongs to the original author. If there is any infringement, please contact us immediately, we will deal with it promptly.
Multilingual search requires the developer to address challenges that don’t exist in the monolingual case. In Solr, a robust multilingual search engine requires different analysis chains for each language because each language has its own logic for tokenization, lemmatization, stemming, synonyms, and stop words. To make multilingual search even harder, query strings are typically no longer than a handful of words, making language identification of query strings more difficult, or at worst ambiguous even to a human (“pie” could be an English or Spanish query). We’ll explore the breadth of Solr schema and configuration options available to a multilingual search application developer to balance functionality, performance, and complexity. We’ll dive deep into specific experiments with a practical application.
Speaker Bio: David Troiano
David Troiano is a Principal Software Engineer at Basis Technology who develops the services and applications that consume the core natural language processing products that Basis delivers. Over the past five years, he has worked on content search, discovery, and recommendation systems built on Lucene / Solr, with an eye toward scalability and performance. He also has professional experience with machine learning and predictive analytics tools in the quantitative finance industry. David holds a bachelor’s degree in Computer Science from Harvard College.
This document discusses text mining in R. It introduces important text mining concepts like tokenization, tagging, and stemming. It outlines popular R packages for text mining like tm, SnowballC, qdap, and dplyr. The document explains how to create a corpus from text files, explore and transform a corpus, create a document term matrix, and analyze term frequencies. Visualization techniques like word clouds and heatmaps are also summarized.
The document discusses processing Boolean queries in an information retrieval system using an inverted index. It describes the steps to process a simple conjunctive query by locating terms in the dictionary, retrieving their postings lists, and intersecting the lists. More complex queries involving OR and NOT operators are also processed in a similar way. The document also discusses optimizing query processing by considering the order of accessing postings lists.
Text based search engine on a fixed corpus and utilizing indexation and ranki...Soham Mondal
A software prototype of a text based search engine which will work on millions of wikipedia pages retrived in xml format and automatically bring-up and analyse the top 10 relevant Wikipedia documents that matches the input query of user. This takes Wikipedia corpus in XML format which is available at Wikipedia.org as input. Then it indices millions of Wikipedia pages involving a comparable number of distinct terms. Given a query, it retrieves relevant ranked documents and their titles using index. It uses OOPs application, ranking algorithms and indexation techniques used in modern search engines. It also showcases high level system design, software architecture modelling and development sprints/implementations
This document discusses different types of query languages used for information retrieval systems. It describes keyword queries where documents are retrieved based on the presence of query words. Phrase queries search for an exact sequence of words. Boolean queries use logical operators like AND, OR and NOT to combine search terms. Natural language queries allow users to enter searches in a free-form manner but require translation to a formal query language. The document provides examples and explanations of each query language type over its 12 sections.
In this presentation we discuss several concepts that include Word Representation using SVD as well as neural networks based techniques. In addition we also cover core concepts such as cosine similarity, atomic and distributed representations.
Information Retrieval-4(inverted index_&_query handling)Jeet Das
The document describes the process of creating an inverted index to support keyword searching of documents. It discusses storing term postings lists that map terms to the documents that contain them. It also describes techniques like skip pointers, phrase queries, and proximity searches to improve query processing efficiency and support more complex search needs. Precision, recall, and f-score metrics for evaluating information retrieval systems are also summarized.
This document discusses recommender systems and how they work. It covers two main types of recommender systems: content-based and collaborative filtering. Content-based systems recommend items similar to what a user has liked in the past. Collaborative filtering recommends items liked by similar users. The document then discusses key concepts like term frequency-inverse document frequency (TF-IDF) for representing text as vectors and calculating cosine similarity between vectors to find similar documents. It includes code examples for building a basic book recommender system using these techniques.
Webinar: Simpler Semantic Search with SolrLucidworks
Hear from Lucidworks Senior Solutions Consultant Ted Sullivan about how you can leverage Apache Solr and Lucidworks Fusion to improve semantic awareness of your search applications.
The document provides an overview of how search engines and the Lucene library work. It explains that search engines use web crawlers to index documents, which are then stored and searched. Lucene is an open source library for indexing and searching documents. It works by analyzing documents to extract terms, indexing the terms, and allowing searches to match indexed terms. The document details Lucene's indexing and searching process including analyzing text, creating an inverted index, different query types, and using the Luke tool.
Words and sentences are the basic units of text. In this lecture we discuss basics of operations on words and sentences such as tokenization, text normalization, tf-idf, cosine similarity measures, vector space models and word representation
Vectorization is the process of converting words into numerical representations. Common techniques include bag-of-words which counts word frequencies, and TF-IDF which weights words based on frequency and importance. Word embedding techniques like Word2Vec and GloVe generate vector representations of words that encode semantic and syntactic relationships. Word2Vec uses the CBOW and Skip-gram models to predict words from contexts to learn embeddings, while GloVe uses global word co-occurrence statistics from a corpus. These pre-trained word embeddings can then be used for downstream NLP tasks.
The document discusses different techniques for topic modeling of documents, including TF-IDF weighting and cosine similarity. It proposes a semi-supervised approach that uses predefined topics from Prismatic to train an LDA model on Wikipedia articles. This model classifies news articles into topics. The accuracy is improved by redistributing term weights based on their relevance within topic clusters rather than just document frequency. An experiment on over 5000 news articles found that the combined weighting approach outperformed TF-IDF alone on articles with multiple topics or limited content.
Introduction to search engine-building with LuceneKai Chan
These are the slides for the session I presented at SoCal Code Camp Los Angeles on October 14, 2012.
http://www.socalcodecamp.com/session.aspx?sid=a4774b3c-7a2d-45db-8721-f54c5a314e17
The document discusses various natural language processing (NLP) techniques including implementing search, document level analysis, sentence level analysis, and concept extraction. It provides details on tokenization, word normalization, stop word removal, stemming, evaluating search results, parsing and part-of-speech tagging, entity extraction, word sense disambiguation, concept extraction, dependency analysis, coreference, question parsing systems, and sentiment analysis. Implementation details and useful tools are mentioned for various techniques.
The document describes how Sphinx, an open source full-text search engine, was used to optimize searching and reporting on a large dataset of over 160 million cross-links. The data was partitioned across 8 servers each with 4 Sphinx instances and 2 indexes. Queries were run in parallel across the instances to return results faster than could be achieved with a single database, with average query times under 0.125 seconds and 95% of queries returning under 0.352 seconds. The document outlines the partitioning, indexing, and querying approach used to optimize performance for the dataset.
Textrank is a graph-based method for summarization that works by (1) separating text into sentences, (2) building a sparse matrix of word counts in each sentence, (3) normalizing words with tf-idf scores, (4) constructing a similarity matrix between sentences based on shared words, and (5) using PageRank to score sentences based on their similarity to other sentences.
Elasticsearch is an open-source, distributed, real-time document indexer with support for online analytics. It has features like a powerful REST API, schema-less data model, full distribution and high availability, and advanced search capabilities. Documents are indexed into indexes which contain mappings and types. Queries retrieve matching documents from indexes. Analysis converts text into searchable terms using tokenizers, filters, and analyzers. Documents are distributed across shards and replicas for scalability and fault tolerance. The REST APIs can be used to index, search, and inspect the cluster.
1. TFIDF is a common technique for encoding documents as vectors based on the terms they contain. It weights terms based on their frequency in the given document (TF) and their overall frequency in the entire corpus (IDF).
2. To create a TFIDF vector, documents are first represented as bags of words based on a master list of terms. The vector entries correspond to terms, with the value being the TFIDF score.
3. Stopwords like "the" and "and" are excluded from the master term list since they provide little information about document content. Stemming reduces inflected words like "fights" and "fighting" to their stem "fight" to group related concepts.
An inverted file indexes a text collection to speed up searching. It contains a vocabulary of distinct words and occurrences lists with information on where each word appears. For each term in the vocabulary, it stores a list of pointers to occurrences called an inverted list. Coarser granularity indexes use less storage but require more processing, while word-level indexes enable proximity searches but use more space. The document describes how inverted files are structured and constructed from text and discusses techniques like block addressing that reduce their space requirements.
Introduction to search engine-building with LuceneKai Chan
These are the slides for the session I presented at SoCal Code Camp San Diego on June 24, 2012.
http://www.socalcodecamp.com/session.aspx?sid=f9e83f56-3c56-4aa1-9cff-154c6537ccbe
Codeless Generative AI Pipelines
(GenAI with Milvus)
https://ml.dssconf.pl/user.html#!/lecture/DSSML24-041a/rate
Discover the potential of real-time streaming in the context of GenAI as we delve into the intricacies of Apache NiFi and its capabilities. Learn how this tool can significantly simplify the data engineering workflow for GenAI applications, allowing you to focus on the creative aspects rather than the technical complexities. I will guide you through practical examples and use cases, showing the impact of automation on prompt building. From data ingestion to transformation and delivery, witness how Apache NiFi streamlines the entire pipeline, ensuring a smooth and hassle-free experience.
Timothy Spann
https://www.youtube.com/@FLaNK-Stack
https://medium.com/@tspann
https://www.datainmotion.dev/
milvus, unstructured data, vector database, zilliz, cloud, vectors, python, deep learning, generative ai, genai, nifi, kafka, flink, streaming, iot, edge
Open Source Contributions to Postgres: The Basics POSETTE 2024ElizabethGarrettChri
Postgres is the most advanced open-source database in the world and it's supported by a community, not a single company. So how does this work? How does code actually get into Postgres? I recently had a patch submitted and committed and I want to share what I learned in that process. I’ll give you an overview of Postgres versions and how the underlying project codebase functions. I’ll also show you the process for submitting a patch and getting that tested and committed.
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...Social Samosa
The Modern Marketing Reckoner (MMR) is a comprehensive resource packed with POVs from 60+ industry leaders on how AI is transforming the 4 key pillars of marketing – product, place, price and promotions.
Build applications with generative AI on Google CloudMárton Kodok
We will explore Vertex AI - Model Garden powered experiences, we are going to learn more about the integration of these generative AI APIs. We are going to see in action what the Gemini family of generative models are for developers to build and deploy AI-driven applications. Vertex AI includes a suite of foundation models, these are referred to as the PaLM and Gemini family of generative ai models, and they come in different versions. We are going to cover how to use via API to: - execute prompts in text and chat - cover multimodal use cases with image prompts. - finetune and distill to improve knowledge domains - run function calls with foundation models to optimize them for specific tasks. At the end of the session, developers will understand how to innovate with generative AI and develop apps using the generative ai industry trends.
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeWalaa Eldin Moustafa
Dynamic policy enforcement is becoming an increasingly important topic in today’s world where data privacy and compliance is a top priority for companies, individuals, and regulators alike. In these slides, we discuss how LinkedIn implements a powerful dynamic policy enforcement engine, called ViewShift, and integrates it within its data lake. We show the query engine architecture and how catalog implementations can automatically route table resolutions to compliance-enforcing SQL views. Such views have a set of very interesting properties: (1) They are auto-generated from declarative data annotations. (2) They respect user-level consent and preferences (3) They are context-aware, encoding a different set of transformations for different use cases (4) They are portable; while the SQL logic is only implemented in one SQL dialect, it is accessible in all engines.
#SQL #Views #Privacy #Compliance #DataLake
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataKiwi Creative
Harness the power of AI-backed reports, benchmarking and data analysis to predict trends and detect anomalies in your marketing efforts.
Peter Caputa, CEO at Databox, reveals how you can discover the strategies and tools to increase your growth rate (and margins!).
From metrics to track to data habits to pick up, enhance your reporting for powerful insights to improve your B2B tech company's marketing.
- - -
This is the webinar recording from the June 2024 HubSpot User Group (HUG) for B2B Technology USA.
Watch the video recording at https://youtu.be/5vjwGfPN9lw
Sign up for future HUG events at https://events.hubspot.com/b2b-technology-usa/
3. Agenda
❖ What is Full-text Search
❖ Searching for exact substrings.
❖ Common search concerns.
❖ Solr Search platform.
4. Full-text Search
“In a full-text search, a search engine examines all of the
words in every stored document as it tries to match search
criteria” - wikipedia
5. Full-text Search
It is a bit like LIKE:-
… where LOWER(column_name) LIKE
LOWER(‘%query_string%');
What will be the big O for above SQL query?
6. Big O for finding a word in text ?
Length of Text = n
Length of text = m
Start comparing relevant characters in both strings
and repeat from next index if a character match fails.
W E R Q W E
Q W E R
QQText
Query
7. Find String Algorithm
W E R Q W E
Q W E R
QQ
For our example:-
starting with index 0, the comparison fails at index 1
8. Find String Algorithm
W E R Q W EQQ
Q W E R
We increment the index by 1; and start the comparison again,
eventually finding a match.
Comparing all characters of text with all characters of query
will take O(n*m)
9. Find String Algorithm: Example 2
E Q W E R T
Q W E R
WQ
0 1 2 3 4 5 6 7
Where should we start after this mismatch ?
10. Find String Algorithm: Example 2
E Q W E R T
Q W E R
WQ
0 1 2 3 4 5 6 7
We have enough knowledge in the query word to decide if we should
start the next match at ‘1’, ‘2’, ’3’ !
Query preprocessing time: O(m)
Execution time: O(n)
Reference: Knuth-Morris-Platt algorithm O(n+m)
11. Preprocessing the text instead of query
❖ What if we pre-process the text instead of query?
❖ Pre-processing the text makes the query execution
faster; decreasing the query execution time to O(m)
❖ E.g. Suffix trees
12. Pre-process the text
E Q W E R TWQ
0 1 2 3 4 5 6 7
E Q W E R TW
E Q W E R T
Q W E R T
W E R T
E R T
R T
T
Find all the suffixes of the text
Original Text
13. Pre-process the text
E Q W E R TWQ
0 1 2 3 4 5 6 7
E Q W E R TW
E Q W E R T
Q W E R T
W E R T
E R T
R T
T
Make (compressed) trie* of the
suffixes.
Trie is a prefix tree, where common
prefixes are extracted in the parent
15. Search in O(m): m = length of query
Q
E
W
R
T
E
Q W E R T
W E
Q W E R T
Q W E R T
R T R T
R T
T
$
Each node also tracks the offset in
original string
Search for ‘QWER’
16. Suffix trie
❖ Take O(n) space and O(n*lgn) time for construction.
❖ Efficient for exact match. O(m) execution time per
query!
❖ where m = length of query
❖ n = length of text
❖ Good for text which don’t change that often. (e.g.,
search within a book).
17. Precision and Recall
❖ Precision:
❖ Number of relevant instances that have been
retrieved / total retrieved instances
❖ Recall:
❖ Number of relevant instances that have been
retrieved / total relevant instances
19. Search concerns: 1
❖ We commonly search on words (delimiters and spaces are
ignored)
❖ Spell correction
❖ Query ‘runnind’ should show results for running
❖ Synonyms
❖ Should query:’killer’ show results for ‘murderer’,
‘assassin’ ?
❖ Removing elision: it’s -> it is , don’t -> do not etc.
20. Search concerns: 2
❖ Match words with same root even if the form is different
❖ Searching for ‘eating’ should also show results for
‘eaten’, ‘eat’ and ‘ate’.
❖ Stemming generalizes some language rules (e.g.,
trimming ing, ed from the end of words to make root) It
will fail in special cases e.g, eaten. However, it is faster.
❖ Lemmatization: takes the language dictionary into
account. So, eaten’s root can be correctly found as eat.
21. Search concerns: 3
❖ Should we skip indexing of common words (aka stop
words)? e.g., is, are, I, the, not ?
❖ Advantage: Smaller index size. Mostly ignoring these
words will not affect search results, since these are
present in almost all the documents - rendering each
document equally valuable (or useless).
❖ Disadvantage: we won’t be able to search queries like:
‘to be or not to be’
22. Search concerns: 4
❖ Ignore case
❖ Even though the term doesn’t match exactly, but most
users will like to see results with résumé, resumé,
resume, or RESUME when searching for resumé
23. Search Design: 1
❖ While indexing, break the text into terms.
❖ Where terms are generally defined as substrings
separated by whitespace (space, tabs, unicode space,…),
or delimiters.
❖ Create term to locations mapping.
❖ aka inverted index or postings list.
❖ Logically, you can think of:
❖ HashMap <Term, List<Occurrence>>
24. Search Design: 2
❖ For any query, split into terms and find each term in the
above created index.
❖ Separately find each term in the above created index to find
sets of documents which match each term respectively.
❖ An intersection of all these sets gives the documents which
have all the terms.
❖ This is a very basic way of query execution. Real world is more
complex.
25. Inverted Index: Sample texts
Doc text Tokens
It’s your
job, not
mine
It is your job not mine
It is a gold
mine
It is a gold mine
mine the
bitcoin
mine the bitcoin
26. Inverted Index and term dictionary
Words Doc Frequency Document Id: Number of occurrences
a 1 1 : 1
It 2 1: 1 2: 1
is 2 1: 1 2: 1
your 1 1: 1
not 1 1: 1
job 1 1: 1
mine 3 1: 1 2: 1 3:1
bitcoin 1 3:1
the 1 3:1
27. Inverted Index and term dictionary
Words Doc Frequency Document: Number of occurrences
a 1 1 : 1
It 2 1: 1 2: 1
is 2 1: 1 2: 1
your 1 1: 1
not 1 1: 1
job 1 1: 1
mine 3 1: 1 2: 1 3:1
bitcoin 1 3:1
the 1 3:1
This is Term dictionary.
Usually small
and kept in memory
28. Inverted Index and term dictionary
Words Doc Frequency Document: Number of occurrences
a 1 1 : 1
It 2 1: 1 2: 1
is 2 1: 1 2: 1
your 1 1: 1
not 1 1: 1
job 1 1: 1
mine 3 1: 1 2: 1 3:1
bitcoin 1 3:1
the 1 3:1
This are postings lists.
Usually kept in files.
relevant lists are read as needed
for each query execution.
This can be configured to track the
count of terms in each document,
as well as the offset within the document.
We can also keep term offsets for each document
occurrence instead of just the count.
29. Inverted Index and term dictionary
Words Doc Frequency Document: Number of occurrences
a 1 1 : 1
It 2 1: 1 2: 1
is 2 1: 1 2: 1
your 1 1: 1
not 1 1: 1
job 1 1: 1
mine 3 1: 1 2: 1 3:1
bitcoin 1 3:1
the 1 3:1
Doc Freq: is used to calculate score for each
matching document.
If a term is in fewer documents across the whole
index
then a match on this term should give a higher
total score for the document.
So, a query : bitcoin it
should rank document 3 higher than 1 & 2.
30. Calculating score for a document
❖ TF-IDF
❖ Term Factor (TF) = This is proportional to the number of
occurrences of the term in the document.
❖ Inverse Document Frequency (IDF) = Inverse of the
number of documents in the your collection which
contain the term.
❖ Different implementations may have slightly different
formulas.
31. Calculating score for a document
Words DF
occure
TF
a 1 1: 1
It 2 1: 1 2: 1
is 2 1: 1 2: 1
your 1 1: 1
not 1 1: 1
job 1 1: 1
mine 3 1: 1 2: 1 3:1
bitcoin 1 3:1
the 1 3:1
Query = bitcoin it
Score Formula = TF * Inverse-DF
Term = it
DF = 2
Score (doc1, ‘it’) = 1/2
Score (doc2, ‘it’) = 1 /2
Score(doc3, ‘it’) = 0
Term = bitcoin
DF = 1
Score (doc1, ‘bitcoin’) = 0
Score (doc2, ‘bitcoin’) = 0
Score (doc3, ‘bitcoin’) = 1
Total score(doc) = Score(doc, term1) +
Score(doc, term2) …
Score(doc1) = 1/2
Score(doc2) = 1/2
Score(doc3) = 1
32. BM-25
❖ Relevance score calculation with more theoretical
foundation
❖ Many search platforming are changing default to BM25.
❖ Also, considers average length of document in the
whole collection.
❖ http://opensourceconnections.com/blog/2015/10/16/
bm25-the-next-generation-of-lucene-relevation/
33. Updating the Inverted Index
❖ Inverted Index is optimized for disk-storage and reads;
but not for updates.
❖ Lets say if the term ‘mine’ is replaced in Doc3 with
‘whine’
❖ If we delete ‘Doc3’ from occurrence list of ‘mine’, it will
require shifting the data structure on file and may
trigger a full rewrite of the file.
34. Inverted Index and Segments
❖ We will mark doc3 as
deleted in its original
segment.
❖ Append the document to
segment which is
currently open for editing.
❖ Collection of all the
segments (and their delete
files) is your complete
index.
Words DF TF
a 1 1: 1
It 2 1: 1 2: 1
is 2 1: 1 2: 1
your 1 1: 1
not 1 1: 1
job 1 1: 1
mine 3 1: 1 2: 1 3:1
bitcoin 1 3:1
the 1 3:1
Solr adds the deleted document in
“deleted" file for the same segment.
Words DF TF
whine 1 3:1
the 1 3:1
bitcoin 1 3:1
Deleted
documents
Doc3
35. Inverted Index and Segments: 2
❖ Important segue: Solr keeps writing to an inverted index until a
“commit” is called, or it fills the allocated memory. At which point,
it is flushed to the drive (as Lucene immutable segment). In case of
commit, it is also added to a segments file which contains list of all
committed segments.
❖ All the transactions for which Solr has acknowledged success, are
written to Tlog (and flushed to drive), so even if commit is not
executed on index, indexed data is not lost.
❖ Solr will issue commit before shutting down (if graceful), and,
Solr will also replay the Tlog (between now and last_commit)
when it restarts.
36. Lucene
❖ Lucene Core
❖ A set of Java JARs for indexing and search
❖ spellchecking,
❖ hit highlighting
❖ advanced analysis/tokenization capabilities.
37. Solr
❖ search platform built on Apache Lucene™.
❖ web admin interface
❖ XML/HTTP and JSON/Python/Ruby APIs
❖ hit highlighting
❖ faceted search
❖ caching
❖ replication
38. Zookeeper
❖ Leader election for Solr replicas
❖ Zookeeper is needed index has multiple replicas
❖ Optionally also used for storing ‘core’ config.
❖ Ref: https://lucene.apache.org/solr/guide/6_6/
setting-up-an-external-zookeeper-ensemble.html
39. Nutch
❖ Website crawler
❖ Can run on a single machine or a Hadoop cluster.
❖ Specify a list of URLs to start crawling
❖ Parse the text (customizable options)
❖ Send the parsed documents to Solr
40. Start your solr server
❖ JDK 8 (get latest. Solr hits a bug in some earlier JDK 8)
❖ Ant
❖ Ivy
❖ If you want to follow my text analysis examples:
❖ https://github.com/niqbal/solr_cores
41. Documents to bag of words!
For indexing, different document formats should be converted to
streams of text. (Ref: Apache Tike/ Solr Cell)
Doc: A
critical
abstract
provides …
Movie: Tom
and Jerry
Jump
Chart:
Population
7Billion
Book: text
.
text
42. Solr Glossary: 1
❖ Solr is a JVM process
❖ Core
❖ This is like a table in RDBMS
❖ schema.xml
❖ Schema of the “table”.
❖ Document
❖ One row in the “table”
43. Solr Glossary: 2
❖ Each solr process has
❖ zoo.cfg
❖ solr.xml
❖ One solr process can host multiple cores (“table”), each
with its own config files.
44. Solr Core configs
❖ Solr core can be pre-created or created using an api while sever is
already running
❖ Following config are specific to each core
❖ solrconfig.xml
❖ Query parser config, Indexing memory and threads.
❖ core.properties
❖ This file identifies its containing folder as a solr core; so that it
gets discovered by the solr process.
❖ Schema: schema definition of this core
45. schema.xml: Field classes
❖ Solr has Field classes, which can be used to define Field
types. e.g.,
❖ binary : base64
❖ TextField: general text, usually multiple words/tokens
❖ DoublePointField: double
❖ DatePointField: Date type data
❖ …
46. schema.xml: Field types
❖ Field type contains following specification:
❖ Name of field-type
❖ Parent field class (e.g., DoublePointField, TextField, etc.)
❖ DocValues: For efficient sorting, and faceting on this
field. Available for a subset of all field classes.
❖ Index-time analysis config
❖ Query-time analysis config
47. schema.xml: Field definition
❖ A Field in Solr schema is like a column in Relational databases.
❖ Field definition consists of:
❖ Field name
❖ Field type (also defined in your schema)
❖ Stored ? : Needed for hit highlighting, field retrieval, etc.
❖ When you retrieve field value from Solr, Stored values can
be returned.
❖ Indexed ?: Needed for searching in this text
48. schema.xml: Dynamic Fields
❖ Ability to create new fields in your “table” (core in Solr
terms) on the fly.
❖ May not be needed for most use cases.
49. schema.xml: copyField
❖ copyField works like a trigger to populate another field
based on a value in one field.
❖ Possible use: if you want to do multiple types of
analyses on same value but don’t want to expose the
detail to indexing client.
❖ <copyField source="title" dest=“tokenized_title”/>
50. solrconfig.xml: Index config
❖ ramBufferSizeMB and maxBufferedDocs
❖ When to flush a segment to the disk
❖ MergePolicy and MergeScheduler
❖ when to trigger merging of segments
51. solrconfig.xml: Query config
<requestHandler name="/query" class="solr.SearchHandler">
<lst name="defaults">
<str name="wt">json</str>
<str name="defType">edismax</str>
<str name="qf">text^0.5 features^1.0 name^1.2</str>
<str name="hl">on</str>
<str name=“hl.fl">features name</str>
</lst>
</requestHandler>
❖ Specify a request handler for the core
❖ Configure Query parser here: Lucene, Dismax, eDismax,
etc.
hl fields should have stored=true
52. solrconfig.xml: hard & soft commits
❖ Hard commit: flushes the in-memory indexes to new
Lucene segments. and optionally makes the available
for new incoming queries.
❖ Soft commit: makes in-memory index available for new
incoming queries, without flushing to disk.
❖ You can tune both intervals independently.
53. Solr Glossary: 3
❖ Shard: When data grows more than the capacity of one
core, then the core can be divided into “shards”. Each
shard will have a piece of the table.
❖ Collection: a collection of all shards of one type of data.
❖ Replica: Number of copies of identical data.
❖ Leader: For all the replicas of a shard, there is only one
leader. i.e., there is one leader for each shard/partition.
Zookeeper is used to elect the shard leader from replicas.
54. Solr Cloud
❖ All replicas of each shard participate in indexing and
querying.
❖ Queries are equally distributed across all copies/
replicas of the shards.
❖ Indexing/querying continues as long as there is one
healthy replica per shard.
55. Indexing flow
❖ Index request can be sent to any node
❖ Node gets shard distribution from zookeeper and finds the leader
core of the target shard.
❖ Leader receives the indexing request and indexes locally.
❖ Even if the indexed document is not flushed to disk, the request
is still logged in tlog and persisted.
❖ Leader sends indexing request to healthy replicas.
❖ Sends SUCCESS code after sending indexing request to replicas.
(doesn’t wait for replica’s response. tlog is there for consistency)
56. Indexing flow (on each core)
❖ A core receives the indexing request, and analyzes each field based on its analysis
configured in schema.xml
❖ If indexed=true, then analyzed text is saved in index (inverted list!).
❖ If stored=true, then exact text is stored.
❖ Both indexed and stored can be used independent of each other (all 4
combinations are valid)
❖ Even if the indexed document is not flushed to disk, the request is still logged in
tlog and persisted.
❖ Leader sends indexing request to healthy replicas.
❖ Sends SUCCESS code after sending indexing request to replicas. (doesn’t wait
for replica’s response. tlog is there for consistency)
57. Query flow (SolrCloud)
❖ Query can be sent to any node.
❖ This first node becomes the controller or aggregator
node and uses information in zk to determine replicas.
❖ The query is distributed to one replica of each shard.
❖ Top n results from the merged results from all shards is
determined.
❖ Second query to selective subset of shards to get more
fields of top n documents.
58. Query flow (on each core)
❖ Query reaches a query handler. (e.g., solr.SearchHandler)
❖ Parsed using one of the parsers (Lucene, DisMax, eDisMax)
❖ Options like split-on-whitespace (sow=true/false) are important
❖ Each term in the query is then analyzed by each Field definition specified in the
call parameters. Then the term is searched in that field in all* documents of the
node.
❖ You can also add a filter so that all documents are searched on.
❖ Score is generated on each field of a document and then cumulative score for the
whole document.
❖ Documents are ranked and returned to client, which can be a user, or aggregator
node (in case of SolrCloud)
60. Solr standalone: 1
❖ bin/solr start -s /path/to/solr_core_root
❖ many optional parameters to specify port, solrCloud (or
no solrCloud), number of replicas, etc.
❖ default port: 8983
❖ bin/solr stop
❖ stops the solr server
❖ Production install scripts are different. ‘bin/solr’ is good
for sample code and quick-start.
61. Solr standalone: 2
❖ Index document in itu
core:
❖ http://localhost:8983/
solr/#/itu/documents
❖ sample json:
❖ {“id”:"3",
"author":"Javed",
“impactfactor”:5}
❖ You can also use other api
clients.
Ensure that the double
quotes are in ASCII not
unicode.
64. Some text analysis!
❖ {"id":"4", "author_shingles":"pitb is in technology park”}
❖ /select?defType=edismax&q=technology%20park&qf=author_shingles
❖ Supporting phrase query - somewhat efficiently.
❖ {"id":"5", "author_ngram”:"kilogram is bigger”}
❖ /select?defType=edismax&q=gram&qf=author_ngram
❖ similar to wildcard functionality (without the overhead of wildcard)
❖ {"id":"6", "author_edge_ngram":"milligram is smaller”}
❖ select?defType=edismax&q=milli&qf=author_edge_ngram
❖ Similar to wildcard at end only
65. ❖ ifconfig (ipconfig in Windows) to find your local ip
❖ SolrCloud requires an entry in hosts file for the hostname (see last entry above)
❖ /etc/hosts on Mac
Solr cloud config demo: 1
66. Solr cloud config demo: 2
❖ ./bin/solr start -e cloud
❖ a utility to create multiple Solr processes on localhost
❖ Specify
❖ number of Solr processes
❖ number of shards
❖ number of replicas for each shard
❖ if you don’t specify ports as parameter, then Solr runs on
default ports (printed in console output)
67. ❖ This script relies on embedded zookeeper. In
production, the recommended practice is to run
zookeeper on separate hosts.
❖ To stop all processes at once: bin/solr stop -all
❖ Find more: https://lucene.apache.org/solr/guide/6_6/
shards-and-indexing-data-in-solrcloud.html
❖ Some Solr documentation pages are obsolete. Make
sure you read the latest documents.
Solr cloud config demo: 3
68. ❖ ‘’
Solr cloud config demo: 4
http://localhost:8983/solr/#/~cloud OR http://localhost:7574/solr/#/~cloud
SolrCloud intelligently distributes shards
among available solr processes. The replicas of a shard
*don’t* reside on same process (see port).
69. ❖ Use any host to insert
following or similar
documents.
❖ {"id":"1", “_text_":"asad"}
❖ {"id":"2", “_text_":"iqbal"}
❖ These will distributed
among the two shards.
Solr cloud config demo: 5
70. ❖ Query from any node:
❖ [Host]/solr/ferozepur/
select?
defType=edismax&q=asad%2
0OR%20iqbal
❖ where host can be any of:
❖ http://192.168.8.101:7574
❖ http://192.168.8.101:8983
❖ Note the checkbox on
‘edismax’ !
Solr cloud config demo: 6
72. Measuring search quality
❖ Click-Through Rate (CTR): What fraction of searches
resulted in navigation to a search result.
❖ Mean Reciprocal Rank (MRR): If user clicked, then was
the clicked document high in the results?
❖ Normalized Discounted Cumulative Gain (nDCG):
❖ Considers ranking of top few results instead of only
the clicked document (which is MRR).
❖ Requires more involved user studies.
73. More…
❖ Streaming
❖ Deep paging (use cursor mark):
❖ Graph search
❖ Solr Joins (employ wisely!)
❖ 1:N relationship in Solr (expensive for updates)
❖ Query Completion / Autosuggest
❖ Phrase search/ Proximity search
74. More… More…
❖ Language Identification (see: LangDetect or Tika)
❖ Custom document routing
❖ Parallel SQL
❖ Tuning caches (documentCache, queryResultsCache,
filterCache, …)
❖ Tuning soft commit and hard commit; and near-realtime
search.
75. ❖ This is the last slide.
❖ Devil is in the detail and it surfaces at scale.