Synchronizing a search application across multiple clusters is a complex challenge and the solution evolves with our tools (Solr and Fusion AppStudio). Paul Anderson discusses how Dynatrace's cluster synchronization strategy changed over the last two years to ensure that customers worldwide have a consistent search experience. The talk focuses on two Solr features, CDCR, and Streaming Expressions, explaining what they do well, where they fall down, and where they need to improve. Paul also covers how to modify your index pipelines and signal aggregations to support cluster synchronization.
Speakers: Paul Anderson & Daniel Drahnak, Dynatrace
SolrCloud in Public Cloud: Scaling Compute Independently from Storage - Ilan ...Lucidworks
Running SolrCloud in Public Cloud is the future. This presentation and the code that will be contributed back to the community will allow such clusters to be highly efficient, scalable and elastic. Attendees will understand the challenges and potential of sharing index data between servers.
Speakers: Ilan Ginzburg & Yonik Seeley, Salesforce
Search at Twitter: Presented by Michael Busch, TwitterLucidworks
Twitter processes over 500 million tweets per day and more than 2 billion search queries per day. The company uses a search architecture based on Lucene with custom extensions. This includes an in-memory real-time index optimized for concurrency without locks, and a schema-based document factory. Future work includes support for parallel index segments and additional Lucene features.
Twitter provides a platform for user-generated content in the form of short messages called tweets. It handles a massive volume of data, with over 230 million tweets and 2 billion search queries per day. Twitter has developed a customized search and indexing system to handle this scale. It uses a modular system that is scalable, cost-effective, and allows for incremental development. The system includes components for crawling Twitter data, preprocessing and aggregating tweets, building an inverted index, and distributing the index across server machines for low-latency search.
Solr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabsLucidworks
This document discusses Solr distributed indexing at WalmartLabs. It describes customizing an existing MapReduce indexing tool to index large XML files in a distributed manner across multiple servers. Key points covered include using two custom utilities for index generation and merging, experiments showing indexing is CPU-bound while merging is I/O-bound, and lessons learned around data locality and using n-way merging of shards for best performance. Solutions discussed include dedicating an indexing Hadoop cluster to improve I/O speeds for merging indexes.
This talk is aimed at understanding how ranking of documents works with Solr and ways to improve relevancy your search application.
The first part of the talk will cover how a user query gets parsed in Solr and the default scoring which comes with it.
The second part of the talk covers how to customize scoring to work better with your dataset by experimenting with the available similarity implementations and writing your own similarity implementation.
Finally I will talk about adding different relevancy signals into your ranking algorithm and customizing results for your top N queries.
This document discusses Flipkart's use of Solr indexes to organize product data and search. It describes how Flipkart moved from indexing all data in a single CMS to a more distributed approach using services and streams to index static vs. dynamic data separately. It also discusses challenges with partial document updates in Lucene and how Flipkart leveraged updatable docvalues and value sources to integrate real-time signals for ranking and filtering.
The document provides an overview of the Spark framework for lightning fast cluster computing. It discusses how Spark addresses limitations of MapReduce-based systems like Hadoop by enabling interactive queries and iterative jobs through caching data in-memory across clusters. Spark allows loading datasets into memory and querying them repeatedly for interactive analysis. The document covers Spark's architecture, use of resilient distributed datasets (RDDs), and how it provides a unified programming model for batch, streaming, and interactive workloads.
The document describes Twitter's search architecture. It discusses how Twitter uses modified versions of Lucene called Earlybird to build real-time and archive search indexes. The real-time indexes are partitioned and replicated across clusters. New tweets are continuously added and searchable with low latency. Archive indexes contain older tweets on HDFS and are optimized for throughput over low latency. The system uses an analyzer to preprocess tweets before indexing and a service called the Blender to merge search results.
SolrCloud in Public Cloud: Scaling Compute Independently from Storage - Ilan ...Lucidworks
Running SolrCloud in Public Cloud is the future. This presentation and the code that will be contributed back to the community will allow such clusters to be highly efficient, scalable and elastic. Attendees will understand the challenges and potential of sharing index data between servers.
Speakers: Ilan Ginzburg & Yonik Seeley, Salesforce
Search at Twitter: Presented by Michael Busch, TwitterLucidworks
Twitter processes over 500 million tweets per day and more than 2 billion search queries per day. The company uses a search architecture based on Lucene with custom extensions. This includes an in-memory real-time index optimized for concurrency without locks, and a schema-based document factory. Future work includes support for parallel index segments and additional Lucene features.
Twitter provides a platform for user-generated content in the form of short messages called tweets. It handles a massive volume of data, with over 230 million tweets and 2 billion search queries per day. Twitter has developed a customized search and indexing system to handle this scale. It uses a modular system that is scalable, cost-effective, and allows for incremental development. The system includes components for crawling Twitter data, preprocessing and aggregating tweets, building an inverted index, and distributing the index across server machines for low-latency search.
Solr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabsLucidworks
This document discusses Solr distributed indexing at WalmartLabs. It describes customizing an existing MapReduce indexing tool to index large XML files in a distributed manner across multiple servers. Key points covered include using two custom utilities for index generation and merging, experiments showing indexing is CPU-bound while merging is I/O-bound, and lessons learned around data locality and using n-way merging of shards for best performance. Solutions discussed include dedicating an indexing Hadoop cluster to improve I/O speeds for merging indexes.
This talk is aimed at understanding how ranking of documents works with Solr and ways to improve relevancy your search application.
The first part of the talk will cover how a user query gets parsed in Solr and the default scoring which comes with it.
The second part of the talk covers how to customize scoring to work better with your dataset by experimenting with the available similarity implementations and writing your own similarity implementation.
Finally I will talk about adding different relevancy signals into your ranking algorithm and customizing results for your top N queries.
This document discusses Flipkart's use of Solr indexes to organize product data and search. It describes how Flipkart moved from indexing all data in a single CMS to a more distributed approach using services and streams to index static vs. dynamic data separately. It also discusses challenges with partial document updates in Lucene and how Flipkart leveraged updatable docvalues and value sources to integrate real-time signals for ranking and filtering.
The document provides an overview of the Spark framework for lightning fast cluster computing. It discusses how Spark addresses limitations of MapReduce-based systems like Hadoop by enabling interactive queries and iterative jobs through caching data in-memory across clusters. Spark allows loading datasets into memory and querying them repeatedly for interactive analysis. The document covers Spark's architecture, use of resilient distributed datasets (RDDs), and how it provides a unified programming model for batch, streaming, and interactive workloads.
The document describes Twitter's search architecture. It discusses how Twitter uses modified versions of Lucene called Earlybird to build real-time and archive search indexes. The real-time indexes are partitioned and replicated across clusters. New tweets are continuously added and searchable with low latency. Archive indexes contain older tweets on HDFS and are optimized for throughput over low latency. The system uses an analyzer to preprocess tweets before indexing and a service called the Blender to merge search results.
Building a real time big data analytics platform with solrTrey Grainger
Having “big data” is great, but turning that data into actionable intelligence is where the real value lies. This talk will demonstrate how you can use Solr to build a highly scalable data analytics engine to enable customers to engage in lightning fast, real-time knowledge discovery.
At CareerBuilder, we utilize these techniques to report the supply and demand of the labor force, compensation trends, customer performance metrics, and many live internal platform analytics. You will walk away from this talk with an advanced understanding of faceting, including pivot-faceting, geo/radius faceting, time-series faceting, function faceting, and multi-select faceting. You’ll also get a sneak peak at some new faceting capabilities just wrapping up development including distributed pivot facets and percentile/stats faceting, which will be open-sourced.
The presentation will be a technical tutorial, along with real-world use-cases and data visualizations. After this talk, you'll never see Solr as just a text search engine again.
Relevance in the Wild - Daniel Gomez Vilanueva, FindwiseLucidworks
This document discusses relevance in information retrieval systems. It begins with definitions of relevance and how relevance is measured. It then covers similarity functions like TF-IDF and BM25 that are used to calculate relevance scores. Configuration options for similarity in Solr are presented, including setting similarity globally or per field. The edismax query parser is described along with parameters that impact relevance. Methods for evaluating relevance through testing and analysis are provided. Finally, examples of applying relevance techniques to real systems are briefly outlined.
Object detection is a central problem in computer vision and underpins many applications from medical image analysis to autonomous driving. In this talk, we will review the basics of object detection from fundamental concepts to practical techniques. Then, we will dive into cutting-edge methods that use transformers to drastically simplify the object detection pipeline while maintaining predictive performance. Finally, we will show how to train these models at scale using Determined’s integrated deep learning platform and then serve the models using MLflow.
What you will learn:
Basics of object detection including main concepts and techniques
Main ideas from the DETR and Deformable DETR approaches to object detection
Overview of the core capabilities of Determined’s deep learning platform, with a focus on its support for effortless distributed training
How to serve models trained in Determined using MLflow
Deep Postgres Extensions in Rust | PGCon 2019 | Jeff DavisCitus Data
Postgres relies heavily on an extension ecosystem, but that is almost 100% dependent on C; which cuts out developers, libraries, and ideas from the world of Postgres. postgres-extension.rs changes that by supporting development of extensions in Rust. Rust is a memory-safe language that integrates nicely in any environment, has powerful libraries, a vibrant ecosystem, and a prolific developer community.
Rust is a unique language because it supports high-level features but all the magic happens at compile-time, and the resulting code is not dependent on an intrusive or bulky runtime. That makes it ideal for integrating with postgres, which has a lot of its own runtime, like memory contexts and signal handlers. postgres-extension.rs offers this integration, allowing the development of extensions in rust, even if deeply-integrated into the postgres internals, and helping handle tricky issues like error handling. This is done through a collection of Rust function declarations, macros, and utility functions that allow rust code to call into postgres, and safely handle resulting errors.
This document provides an excerpt from the book "Spark: The Definitive Guide" which introduces some of the core concepts of Apache Spark. It discusses Spark's basic architecture including the driver program, executors, and cluster managers. It also covers Spark applications, DataFrames, transformations and actions. Finally, it provides a sample end-to-end example reading CSV flight data to demonstrate these concepts.
Tutorial: Implementing your first Postgres extension | PGConf EU 2019 | Burak...Citus Data
One of the strongest features of any database is its extensibility and PostgreSQL comes with a rich extension API. It allows you to define new functions, types, and operators. It even allows you to modify some of its core parts like planner, executor or storage engine. You read it right, you can even change the behavior of PostgreSQL planner. How cool is that?
Such freedom in extensibility created strong extension community around PostgreSQL and made way for a vast amount of extensions such as pg_stat_statements, citus, postgresql-hll and many more.
In this tutorial, we will look at how you can create your own PostgreSQL extension. We will start with more common stuff like defining new functions and types but gradually explore less known parts of the PostgreSQL's extension API like C level hooks which lets you change the behavior of planner, executor and other core parts of the PostgreSQL. We will see how to code, debug, compile and test our extension. After that, we will also look into how to package and distribute our extension for other people to use.
To get the best benefit from the tutorial, C and SQL knowledge would be beneficial. Some knowledge on PostgreSQL internals would also be useful but we will cover the necessary details, so it is not necessary.
See conference video - http://www.lucidimagination.com/devzone/events/conferences/ApacheLuceneEurocon2011
This talk describes how you can practically apply some of Lucene 4's new features (such as flexible indexing, scoring improvements, column-stride fields) to improve your search application.
The talk will give a brief description of these new features and some example use-cases, to address practical use cases you can try yourself in and around the new features now available in Lucene 4. We'll cover application of functions where you can configure Solr to:
Set up the schema to use Pulsing or Memory codec for a primary key field
Not use a separate spellcheck index, controlling character-level swaps from the query processor
Sorting with a different locale
Per-field similarity configurations, such as using a non-vector-space algorithm
Suneel Marthi - Deep Learning with Apache Flink and DL4JFlink Forward
http://flink-forward.org/kb_sessions/deep-learning-with-apache-flink-and-dl4j/
Deep Learning has become very popular over the last few years in areas such as Image Recognition, Fraud Detection, Machine Translation etc. Deep Learning has proved to be very useful in handling unstructured data and extracting value from them. A big challenge with having to build deep learning models was the high cost of training them. With the recent advent of distributed frameworks like Apache Flink, Apache Spark etc.. it’s faster to train Deep Learning models in parallel on modern platform architecture. In this talk, we’ll be showing how to use Apache Flink Streaming with the open source Deep Learning framework, DeepLearning4j to perform large scale deep learning model training. We will show a demo of a Recurrent Neural Net that is trained for language modeling and have it generate text.
From Python Scikit-learn to Scala Apache Spark—The Road to Uncovering Botnets...Databricks
The landscape of security threats an enterprise faces is vast. It is imperative for an organization to know when one of the machines within the network has been compromised. One layer of detection can take advantage of the DNS requests made by machines within the network. A request to a Command & Control (CNC) domain can act as an indication of compromise. It is thus advisable to find these domains before they come into play. The team at Akamai aims to do just that.
In this session, Aminov will share Akamai’s experience in porting their PoC detection algorithms, written in Python, to a reliable production-level implementation using Scala and Apache Spark. He will specifically cover their experience regarding an algorithm they developed to detect botnet domains based on passive DNS data. The session will also include some useful insights Akamai has learned while handing out solutions from research to development, including the transition from small-scale to large-scale data consumption, model export/import using PMML and sampling techniques. This information is valuable for researchers and developers alike.
Feature Extraction for Large-Scale Text CollectionsSease
Feature engineering is a fundamental but poorly documented component in LTR search applications.
As a result, there are still few open access software packages that allow researchers and practitioners to easily simulate a feature extraction pipeline and conduct experiments in a lab setting.
This talk introduces Fxt, an open-source framework to perform efficient and scalable feature extraction. Fxt may be integrated into complex, high-performance software applications to help solve a wide variety of text-based machine learning problems.
The talk details how we built and documented a reproducible feature extraction pipeline with LTR experiments using the ClueWeb09B collection.
This LTR dataset is publicly available.
We’ll also discuss some of the benefits (feature extraction efficiency, model interpretation) of having open access tooling in this area for researchers and practitioners alike.
Spark r under the hood with Hossein FalakiDatabricks
SparkR is a new and evolving interface to Apache Spark. It offers a wide range of APIs and capabilities to Data Scientists and Statisticians. Being a distributed system with a JVM core some R users find SparkR errors unfamiliar. In this talk we will show what goes on under the hood when you interact with SparkR. We will look at SparkR architecture, performance bottlenecks and API semantics. Equipped with those, we will show how some common errors can be eliminated. I will use debugging examples based on our experience with real SparkR use cases.
This presentation will start by introducing how Apache Lucene can be used to classify documents using data structures that already exist in your index instead of having to generate and supply external training sets. Building on the introduction the focus will be on extensions of the Lucene Classification module that come in Lucene 6.0 and the Lucene Classification module's incorporation in to Solr 6.1. These extensions will allow you to classify at a document level with individual field weighting, numeric field support, lat/lon fields etc. The Solr ClassificationUpdateProcessor will be explored, such as how it works, and how to use it including basic and advanced features like multi class support and classification context filtering. The presentation will include practical examples and real world use cases.
Grant Ingersoll presented on using Apache Solr and Apache Spark for data engineering. He discussed how Solr can be used for indexing and searching large amounts of data, while Spark enables large-scale processing on the indexed data. Lucidworks' Fusion product combines Solr and Spark capabilities to allow search-driven applications and machine learning on indexed content.
It is one thing to write an Apache Spark application that gets you to an answer. It’s another thing to know you used all the tricks in the book to make you run, run as fast as possible. This session will focus on those tricks.
Discover patterns and approaches that may not be apparent at first glance, but that can be game-changing when applied to your use cases. You’ll learn about nested Types, multi threading, skew, reducing, cartesian joins and fun stuff like that.hreading, skew, reducing, cartesian joins, and fun stuff like that.
Deep Learning to Big Data Analytics on Apache Spark Using BigDL with Xianyan ...Databricks
This document provides an overview and summary of BigDL, an open source distributed deep learning library for Apache Spark. It describes how BigDL allows users to run deep learning on Spark by supporting common deep learning frameworks and algorithms. Specific capabilities and examples discussed include using BigDL to run Deep Speech 2 for speech recognition on LibriSpeech data and using BigDL to run Faster R-CNN and SSD for object detection on PASCAL VOC data. Performance comparisons show BigDL achieving comparable or better results than other frameworks.
Finite-State Queries in Lucene:
* Background, improvement/evolution of MultiTermQuery API in 2.9 and Flex
* Implementing existing Lucene queries with NFA/DFA for better performance: Wildcard, Regex, Fuzzy
* How you can use this Query programmatically to improve relevance (I'll use an English test collection/English examples)
Quick overview of other Lucene features in development, such as:
* Flexible Indexing
* "More-Flexible" Scoring: challenges/supporting BM25, more vector-space models, field-specific scoring, etc.
* Improvements to analysis
Bonus:
* Lucene / Solr merger explanation and future plans
About the presenter:
Robert Muir is a super-active Lucene developer. He works as a software developer for Abraxas Corporation. Robert received his MS in Computer Science from Johns Hopkins and BS in CS from Radford University. For the last few years Robert has been working on foreign language NLP problems - "I really enjoy working with Lucene, as it's always receptive to better int'l/language support, even though everyone seems to be a performance freak... such a weird combination!"
Building, Debugging, and Tuning Spark Machine Leaning Pipelines-(Joseph Bradl...Spark Summit
This document discusses Spark ML pipelines for machine learning workflows. It begins with an introduction to Spark MLlib and the various algorithms it supports. It then discusses how ML workflows can be complex, involving multiple data sources, feature transformations, and models. Spark ML pipelines allow specifying the entire workflow as a single pipeline object. This simplifies debugging, re-running on new data, and parameter tuning. The document provides an example text classification pipeline and demonstrates how data is transformed through each step via DataFrames. It concludes by discussing upcoming improvements to Spark ML pipelines.
IBM Spark Technology Center: Real-time Advanced Analytics and Machine Learnin...DataStax Academy
The audience will participate in a live, interactive demo that generates high-quality recommendations using the latest Spark-Cassandra integration for real time, approximate, and advanced analytics including machine learning, graph processing, and text processing.
This document summarizes Apache Flink community updates from June 2015. It discusses the 0.9.0 release of Apache Flink, an open source platform for scalable batch and stream data processing. Key points include the addition of two new committers, blog posts and workshops promoting Flink, and various conference and meetup talks about Flink occurring that month. It encourages registration for the Flink Forward conference in October 2015.
Exploring Direct Concept Search - Steve Rowe, LucidworksLucidworks
This document discusses direct concept search using word embeddings. It describes mapping query and index terms to vector representations in a conceptual space to improve recall by expanding queries with related concepts. Word2vec is used to generate 127-dimensional word embeddings from Wikipedia text. The embeddings are indexed in Lucene to enable nearest neighbor search. Queries are expanded by searching for terms nearest to query terms in the embedding space. While building high-dimensional point indexes is slow in Lucene, this approach demonstrates the potential of using word embeddings for query expansion in information retrieval.
Solr Compute Cloud – An Elastic Solr Infrastructure: Presented by Nitin Sharm...Lucidworks
Solr Compute Cloud (SC2) is an elastic Solr infrastructure that allows for dynamic provisioning of Solr clusters on demand. This allows each search pipeline or job to have its own isolated cluster, improving stability, throughput, and cost optimization. The key benefits of SC2 are pipeline isolation, dynamic scaling, production cluster safeguards, and built-in high availability and disaster recovery features through technologies like the Solr HAFT service.
Solr Compute Cloud - An Elastic SolrCloud Infrastructure Nitin S
Scaling search platforms for serving hundreds of millions of documents with low latency and high throughput workloads at an optimized cost is an extremely hard problem. BloomReach has implemented Sc2, which is an elastic Solr infrastructure for Big Data applications, supporting heterogeneous workloads and hosted in the cloud. It dynamically grows/shrinks search servers to provide application and pipeline level isolation, NRT search and indexing, latency guarantees, and application-specific performance tuning. In addition, it provides various high availability features such as differential real-time streaming, disaster recovery, context aware replication, and automatic shard and replica rebalancing, all with a zero downtime guarantee for all consumers. This infrastructure currently serves hundreds of millions of documents in millisecond response times with a load ranging in the order of 200-300K QPS.
This presentation will describe an innovate implementation of scaling Solr in an elastic fashion. It will review the architecture and take a deep dive into how each of these components interact to make the infrastructure truly elastic, real time, and robust while serving latency needs.
Building a real time big data analytics platform with solrTrey Grainger
Having “big data” is great, but turning that data into actionable intelligence is where the real value lies. This talk will demonstrate how you can use Solr to build a highly scalable data analytics engine to enable customers to engage in lightning fast, real-time knowledge discovery.
At CareerBuilder, we utilize these techniques to report the supply and demand of the labor force, compensation trends, customer performance metrics, and many live internal platform analytics. You will walk away from this talk with an advanced understanding of faceting, including pivot-faceting, geo/radius faceting, time-series faceting, function faceting, and multi-select faceting. You’ll also get a sneak peak at some new faceting capabilities just wrapping up development including distributed pivot facets and percentile/stats faceting, which will be open-sourced.
The presentation will be a technical tutorial, along with real-world use-cases and data visualizations. After this talk, you'll never see Solr as just a text search engine again.
Relevance in the Wild - Daniel Gomez Vilanueva, FindwiseLucidworks
This document discusses relevance in information retrieval systems. It begins with definitions of relevance and how relevance is measured. It then covers similarity functions like TF-IDF and BM25 that are used to calculate relevance scores. Configuration options for similarity in Solr are presented, including setting similarity globally or per field. The edismax query parser is described along with parameters that impact relevance. Methods for evaluating relevance through testing and analysis are provided. Finally, examples of applying relevance techniques to real systems are briefly outlined.
Object detection is a central problem in computer vision and underpins many applications from medical image analysis to autonomous driving. In this talk, we will review the basics of object detection from fundamental concepts to practical techniques. Then, we will dive into cutting-edge methods that use transformers to drastically simplify the object detection pipeline while maintaining predictive performance. Finally, we will show how to train these models at scale using Determined’s integrated deep learning platform and then serve the models using MLflow.
What you will learn:
Basics of object detection including main concepts and techniques
Main ideas from the DETR and Deformable DETR approaches to object detection
Overview of the core capabilities of Determined’s deep learning platform, with a focus on its support for effortless distributed training
How to serve models trained in Determined using MLflow
Deep Postgres Extensions in Rust | PGCon 2019 | Jeff DavisCitus Data
Postgres relies heavily on an extension ecosystem, but that is almost 100% dependent on C; which cuts out developers, libraries, and ideas from the world of Postgres. postgres-extension.rs changes that by supporting development of extensions in Rust. Rust is a memory-safe language that integrates nicely in any environment, has powerful libraries, a vibrant ecosystem, and a prolific developer community.
Rust is a unique language because it supports high-level features but all the magic happens at compile-time, and the resulting code is not dependent on an intrusive or bulky runtime. That makes it ideal for integrating with postgres, which has a lot of its own runtime, like memory contexts and signal handlers. postgres-extension.rs offers this integration, allowing the development of extensions in rust, even if deeply-integrated into the postgres internals, and helping handle tricky issues like error handling. This is done through a collection of Rust function declarations, macros, and utility functions that allow rust code to call into postgres, and safely handle resulting errors.
This document provides an excerpt from the book "Spark: The Definitive Guide" which introduces some of the core concepts of Apache Spark. It discusses Spark's basic architecture including the driver program, executors, and cluster managers. It also covers Spark applications, DataFrames, transformations and actions. Finally, it provides a sample end-to-end example reading CSV flight data to demonstrate these concepts.
Tutorial: Implementing your first Postgres extension | PGConf EU 2019 | Burak...Citus Data
One of the strongest features of any database is its extensibility and PostgreSQL comes with a rich extension API. It allows you to define new functions, types, and operators. It even allows you to modify some of its core parts like planner, executor or storage engine. You read it right, you can even change the behavior of PostgreSQL planner. How cool is that?
Such freedom in extensibility created strong extension community around PostgreSQL and made way for a vast amount of extensions such as pg_stat_statements, citus, postgresql-hll and many more.
In this tutorial, we will look at how you can create your own PostgreSQL extension. We will start with more common stuff like defining new functions and types but gradually explore less known parts of the PostgreSQL's extension API like C level hooks which lets you change the behavior of planner, executor and other core parts of the PostgreSQL. We will see how to code, debug, compile and test our extension. After that, we will also look into how to package and distribute our extension for other people to use.
To get the best benefit from the tutorial, C and SQL knowledge would be beneficial. Some knowledge on PostgreSQL internals would also be useful but we will cover the necessary details, so it is not necessary.
See conference video - http://www.lucidimagination.com/devzone/events/conferences/ApacheLuceneEurocon2011
This talk describes how you can practically apply some of Lucene 4's new features (such as flexible indexing, scoring improvements, column-stride fields) to improve your search application.
The talk will give a brief description of these new features and some example use-cases, to address practical use cases you can try yourself in and around the new features now available in Lucene 4. We'll cover application of functions where you can configure Solr to:
Set up the schema to use Pulsing or Memory codec for a primary key field
Not use a separate spellcheck index, controlling character-level swaps from the query processor
Sorting with a different locale
Per-field similarity configurations, such as using a non-vector-space algorithm
Suneel Marthi - Deep Learning with Apache Flink and DL4JFlink Forward
http://flink-forward.org/kb_sessions/deep-learning-with-apache-flink-and-dl4j/
Deep Learning has become very popular over the last few years in areas such as Image Recognition, Fraud Detection, Machine Translation etc. Deep Learning has proved to be very useful in handling unstructured data and extracting value from them. A big challenge with having to build deep learning models was the high cost of training them. With the recent advent of distributed frameworks like Apache Flink, Apache Spark etc.. it’s faster to train Deep Learning models in parallel on modern platform architecture. In this talk, we’ll be showing how to use Apache Flink Streaming with the open source Deep Learning framework, DeepLearning4j to perform large scale deep learning model training. We will show a demo of a Recurrent Neural Net that is trained for language modeling and have it generate text.
From Python Scikit-learn to Scala Apache Spark—The Road to Uncovering Botnets...Databricks
The landscape of security threats an enterprise faces is vast. It is imperative for an organization to know when one of the machines within the network has been compromised. One layer of detection can take advantage of the DNS requests made by machines within the network. A request to a Command & Control (CNC) domain can act as an indication of compromise. It is thus advisable to find these domains before they come into play. The team at Akamai aims to do just that.
In this session, Aminov will share Akamai’s experience in porting their PoC detection algorithms, written in Python, to a reliable production-level implementation using Scala and Apache Spark. He will specifically cover their experience regarding an algorithm they developed to detect botnet domains based on passive DNS data. The session will also include some useful insights Akamai has learned while handing out solutions from research to development, including the transition from small-scale to large-scale data consumption, model export/import using PMML and sampling techniques. This information is valuable for researchers and developers alike.
Feature Extraction for Large-Scale Text CollectionsSease
Feature engineering is a fundamental but poorly documented component in LTR search applications.
As a result, there are still few open access software packages that allow researchers and practitioners to easily simulate a feature extraction pipeline and conduct experiments in a lab setting.
This talk introduces Fxt, an open-source framework to perform efficient and scalable feature extraction. Fxt may be integrated into complex, high-performance software applications to help solve a wide variety of text-based machine learning problems.
The talk details how we built and documented a reproducible feature extraction pipeline with LTR experiments using the ClueWeb09B collection.
This LTR dataset is publicly available.
We’ll also discuss some of the benefits (feature extraction efficiency, model interpretation) of having open access tooling in this area for researchers and practitioners alike.
Spark r under the hood with Hossein FalakiDatabricks
SparkR is a new and evolving interface to Apache Spark. It offers a wide range of APIs and capabilities to Data Scientists and Statisticians. Being a distributed system with a JVM core some R users find SparkR errors unfamiliar. In this talk we will show what goes on under the hood when you interact with SparkR. We will look at SparkR architecture, performance bottlenecks and API semantics. Equipped with those, we will show how some common errors can be eliminated. I will use debugging examples based on our experience with real SparkR use cases.
This presentation will start by introducing how Apache Lucene can be used to classify documents using data structures that already exist in your index instead of having to generate and supply external training sets. Building on the introduction the focus will be on extensions of the Lucene Classification module that come in Lucene 6.0 and the Lucene Classification module's incorporation in to Solr 6.1. These extensions will allow you to classify at a document level with individual field weighting, numeric field support, lat/lon fields etc. The Solr ClassificationUpdateProcessor will be explored, such as how it works, and how to use it including basic and advanced features like multi class support and classification context filtering. The presentation will include practical examples and real world use cases.
Grant Ingersoll presented on using Apache Solr and Apache Spark for data engineering. He discussed how Solr can be used for indexing and searching large amounts of data, while Spark enables large-scale processing on the indexed data. Lucidworks' Fusion product combines Solr and Spark capabilities to allow search-driven applications and machine learning on indexed content.
It is one thing to write an Apache Spark application that gets you to an answer. It’s another thing to know you used all the tricks in the book to make you run, run as fast as possible. This session will focus on those tricks.
Discover patterns and approaches that may not be apparent at first glance, but that can be game-changing when applied to your use cases. You’ll learn about nested Types, multi threading, skew, reducing, cartesian joins and fun stuff like that.hreading, skew, reducing, cartesian joins, and fun stuff like that.
Deep Learning to Big Data Analytics on Apache Spark Using BigDL with Xianyan ...Databricks
This document provides an overview and summary of BigDL, an open source distributed deep learning library for Apache Spark. It describes how BigDL allows users to run deep learning on Spark by supporting common deep learning frameworks and algorithms. Specific capabilities and examples discussed include using BigDL to run Deep Speech 2 for speech recognition on LibriSpeech data and using BigDL to run Faster R-CNN and SSD for object detection on PASCAL VOC data. Performance comparisons show BigDL achieving comparable or better results than other frameworks.
Finite-State Queries in Lucene:
* Background, improvement/evolution of MultiTermQuery API in 2.9 and Flex
* Implementing existing Lucene queries with NFA/DFA for better performance: Wildcard, Regex, Fuzzy
* How you can use this Query programmatically to improve relevance (I'll use an English test collection/English examples)
Quick overview of other Lucene features in development, such as:
* Flexible Indexing
* "More-Flexible" Scoring: challenges/supporting BM25, more vector-space models, field-specific scoring, etc.
* Improvements to analysis
Bonus:
* Lucene / Solr merger explanation and future plans
About the presenter:
Robert Muir is a super-active Lucene developer. He works as a software developer for Abraxas Corporation. Robert received his MS in Computer Science from Johns Hopkins and BS in CS from Radford University. For the last few years Robert has been working on foreign language NLP problems - "I really enjoy working with Lucene, as it's always receptive to better int'l/language support, even though everyone seems to be a performance freak... such a weird combination!"
Building, Debugging, and Tuning Spark Machine Leaning Pipelines-(Joseph Bradl...Spark Summit
This document discusses Spark ML pipelines for machine learning workflows. It begins with an introduction to Spark MLlib and the various algorithms it supports. It then discusses how ML workflows can be complex, involving multiple data sources, feature transformations, and models. Spark ML pipelines allow specifying the entire workflow as a single pipeline object. This simplifies debugging, re-running on new data, and parameter tuning. The document provides an example text classification pipeline and demonstrates how data is transformed through each step via DataFrames. It concludes by discussing upcoming improvements to Spark ML pipelines.
IBM Spark Technology Center: Real-time Advanced Analytics and Machine Learnin...DataStax Academy
The audience will participate in a live, interactive demo that generates high-quality recommendations using the latest Spark-Cassandra integration for real time, approximate, and advanced analytics including machine learning, graph processing, and text processing.
This document summarizes Apache Flink community updates from June 2015. It discusses the 0.9.0 release of Apache Flink, an open source platform for scalable batch and stream data processing. Key points include the addition of two new committers, blog posts and workshops promoting Flink, and various conference and meetup talks about Flink occurring that month. It encourages registration for the Flink Forward conference in October 2015.
Exploring Direct Concept Search - Steve Rowe, LucidworksLucidworks
This document discusses direct concept search using word embeddings. It describes mapping query and index terms to vector representations in a conceptual space to improve recall by expanding queries with related concepts. Word2vec is used to generate 127-dimensional word embeddings from Wikipedia text. The embeddings are indexed in Lucene to enable nearest neighbor search. Queries are expanded by searching for terms nearest to query terms in the embedding space. While building high-dimensional point indexes is slow in Lucene, this approach demonstrates the potential of using word embeddings for query expansion in information retrieval.
Solr Compute Cloud – An Elastic Solr Infrastructure: Presented by Nitin Sharm...Lucidworks
Solr Compute Cloud (SC2) is an elastic Solr infrastructure that allows for dynamic provisioning of Solr clusters on demand. This allows each search pipeline or job to have its own isolated cluster, improving stability, throughput, and cost optimization. The key benefits of SC2 are pipeline isolation, dynamic scaling, production cluster safeguards, and built-in high availability and disaster recovery features through technologies like the Solr HAFT service.
Solr Compute Cloud - An Elastic SolrCloud Infrastructure Nitin S
Scaling search platforms for serving hundreds of millions of documents with low latency and high throughput workloads at an optimized cost is an extremely hard problem. BloomReach has implemented Sc2, which is an elastic Solr infrastructure for Big Data applications, supporting heterogeneous workloads and hosted in the cloud. It dynamically grows/shrinks search servers to provide application and pipeline level isolation, NRT search and indexing, latency guarantees, and application-specific performance tuning. In addition, it provides various high availability features such as differential real-time streaming, disaster recovery, context aware replication, and automatic shard and replica rebalancing, all with a zero downtime guarantee for all consumers. This infrastructure currently serves hundreds of millions of documents in millisecond response times with a load ranging in the order of 200-300K QPS.
This presentation will describe an innovate implementation of scaling Solr in an elastic fashion. It will review the architecture and take a deep dive into how each of these components interact to make the infrastructure truly elastic, real time, and robust while serving latency needs.
The document discusses Solr Compute Cloud (SC2), an elastic Solr infrastructure developed by BloomReach to address challenges of scaling search platforms for big data applications. SC2 dynamically provisions Solr clusters in the cloud for pipelines and indexing jobs, providing isolation. It ensures latency guarantees, dynamic scaling, high availability and disaster recovery. SC2 addresses issues BloomReach faced with a shared cluster approach like throughput limitations, stability problems and indexing challenges.
This document discusses how to download and play the mobile game Subway Surfers on a personal computer. It describes using BlueStacks, an Android emulator, to install and run the game normally played on phones and tablets. BlueStacks allows users to access Google Play to download Subway Surfers and other Android apps. Once installed through BlueStacks, the game can be played offline on a PC like a mobile game, allowing users to enjoy Subway Surfers on a larger screen without being limited to a phone.
EclipseCon 2016 - OCCIware : one Cloud API to rule them allMarc Dutoo
This document provides an overview of OCCIware, a project that aims to create a cloud consumer platform using the Open Cloud Computing Interface (OCCI) standard. It discusses the need for such a platform given the fragmented state of existing cloud solutions. OCCIware takes a model-driven engineering approach, using Eclipse modeling tools to generate an OCCI extension, designer, and runtime configuration from a domain model. The document demonstrates using these tools to model a Linked Data application and deploy its configuration to Docker. Upcoming work on OCCIware includes improving existing generators, integrating additional capabilities like simulation, and contributing back to the OCCI standard.
OCCIware Project at EclipseCon France 2016, by Marc Dutoo, Open WideOCCIware
Hear hear dev & ops alike - ever got bitten by the fragmentation of the Cloud space at deployment time, By AWS vs Azure, Open Shift vs Heroku ? in a word, ever dreamt of configuring at once your Cloud application along with both its VMs and database ? Well, the extensible Open Cloud Computing Interface (OCCI) REST API (see http://occi-wg.org/) allows just that, by addressing the whole XaaS spectrum.
And now, OCCI is getting powerboosted by Eclipse Modeling and formal foundations. Enter Cloud Designer and other outputs of the OCCIware project (See http://www.occiware.org) : multiple visual representations, one per Cloud layer and technology. XaaS Cloud extension model validation, documentation & ops scripting generation. Simulation, decision-making comparison. Connectors that bring those models to life by getting their status from common Cloud services. Runtime middleware, deployed, monitored, adminstrated. And tackling the very interesting challenge of modeling a meta API in EMF's metamodel, while staying true to EMF, Eclipse tools and the OCCI standard.
Featuring Eclipse Sirius, Acceleo generators, EMF at runtime. Coming soon to a new Eclipse Foundation project near you, if so you'd like.
This talk includes a demonstration of the Docker connector and of how to use Cloud Designer to configure a simple Cloud application's deployment on the Roboconf PaaS system and OpenStack infrastructure.
"Introducing Distributed Tracing in a Large Software System", Kostiantyn Sha...Fwdays
Software systems are growing in size and complexity when the business is growing, and sometimes it is hard to figure out what is going on. Various teams make different changes for different business capabilities. Distributed Tracing is a useful way to look under the hood and see for yourself what operations are being performed, what services are used in a certain use case, and how performant are they. In this talk, I will present what Distributed Tracing is and how we introduced it into our software system with some tips and tricks on what you should focus on if you want to do the same.
What's Next in OpenStack? A Glimpse At The RoadmapShamailXD
YouTube Recording: https://www.youtube.com/watch?v=cCdqOxD5G0M
Whether you are a newbie to OpenStack looking at building your first cloud or an experienced operator with years of OpenStack success behind you, you've probably spent some time wondering what to expect from the OpenStack project over the next several releases. Will it finally support that new capability you've been waiting for? Should you plan for an upgrade in the next 6 months? While the development community is always working and planning new features, its takes a lot of time on IRC to get a complete view across the different projects. The OpenStack Product WG spent time this cycle working with the project teams and PTLs to understand their priorities for the next several OpenStack releases. Where we have always had an understanding of what's to come in the next release, we're hoping to present a long-term view of the future landscape of OpenStack. In this session, we'll present our findings across the different projects in an effort to give users a glimpse into the OpenStack roadmap
Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/2lGNybu.
Stefan Krawczyk discusses how his team at StitchFix use the cloud to enable over 80 data scientists to be productive. He also talks about prototyping ideas, algorithms and analyses, how they set up & keep schemas in sync between Hive, Presto, Redshift & Spark and make access easy for their data scientists, etc. Filmed at qconsf.com..
Stefan Krawczyk is Algo Dev Platform Lead at StitchFix, where he’s leading development of the algorithm development platform. He spent formative years at Stanford, LinkedIn, Nextdoor & Idibon, working on everything from growth engineering, product engineering, data engineering, to recommendation systems, NLP, data science and business intelligence.
The document provides an overview of distributed systems patterns and practices. It discusses why distributed systems are used to solve problems like single points of failure and elastic demand. Common distributed system patterns are explained, including leader-follower models, data replication across nodes, and handling failures. Specific distributed systems like Zookeeper, HDFS and Cassandra are described as examples of implementing patterns like quorum management and consistent hashing for replicated data.
Operational Visibiliy and Analytics - BU SeminarCanturk Isci
The document discusses building operational visibility and analytics directly into cloud platforms. It describes an agentless system crawler that can provide deep visibility into cloud instances without requiring any action from end users. The crawler collects various system data which is then analyzed to provide operational insights and solve real-world problems. Specific applications discussed include vulnerability advising, configuration analysis, and license discovery. The goal is to design monitoring and analytics that are seamlessly integrated and optimized for cloud environments.
Splunk All the Things: Our First 3 Months Monitoring Web Service APIs - Splun...Dan Cundiff
A presentation titled "Splunk All the Things: Our First 3 Months Monitoring Web Service APIs" that Dan Cundiff and Eric Helgeson from Target Corporation gave at Splunk .conf2012.
BloomReach developed an elastic Solr infrastructure called Solr Compute Cloud (SC2) to address the challenges of scaling their search platform. SC2 allows search pipelines and indexing jobs to dynamically provision isolated Solr clusters from an API to run in, improving throughput, stability and availability. It utilizes a Solr HAFT service to replicate data between clusters and provide disaster recovery by cloning clusters. This elastic approach isolates workloads, allows individual scaling and prevents performance issues caused by shared clusters.
The document describes an Android application developed for the Remote Triggered Laboratory project. It provides a brief overview of the app's objectives, development tools used, and design structure. The app was created to act as an interface between users and experiments on mobile devices. It communicates with a LabVIEW server application through the SCCT library. The app's code is organized modularly into packages for each experiment. Future improvements could include adding a login system and developing a hybrid version using PhoneGap and HTML5 for increased flexibility.
The data streaming processing paradigm and its use in modern fog architecturesVincenzo Gulisano
Invited lecture at the University of Trieste.
The lecture covers (briefly) the data streaming processing paradigm, research challenges related to distributed, parallel and deterministic streaming analysis and the research of the DCS (Distributed Computing and Systems) groups at Chalmers University of Technology.
Production Readiness Strategies in an Automated WorldSean Chittenden
This document discusses strategies for making a software service production ready. It begins by outlining the typical software life cycle from idea to production. It then discusses some of the organizational prerequisites needed for a production service, including standardized terminology, naming conventions, and rules for incident response. The document also provides examples of what to include in a production readiness checklist, such as an overview of the service, its consumers, release process, health metrics, and quality metrics.
Druid is a high performance, column-oriented distributed data store that is widely used at Oath for big data analysis. Druid has a JSON schema as its query language, making it difficult for new users unfamiliar with the schema to start querying Druid quickly. The JSON schema is designed to work with the data ingestion methods of Druid, so it can provide high performance features such as data aggregations in JSON, but many are unable to utilize such features, because they not familiar with the specifics of how to optimize Druid queries. However, most new Druid users at Yahoo are already very familiar with SQL, and the queries they want to write for Druid can be converted to concise SQL.
We found that our data analysts wanted an easy way to issue ad-hoc Druid queries and view the results in a BI tool in a way that's presentable to nontechnical stakeholders. In order to achieve this, we had to bridge the gap between Druid, SQL, and our BI tools such as Apache Superset. In this talk, we will explore different ways to query a Druid datasource in SQL and discuss which methods were most appropriate for our use cases. We will also discuss our open source contributions so others can utilize our work. GURUGANESH KOTTA, Software Dev Eng, Oath and JUNXIAN WU, Software Engineer, Oath Inc.
Ähnlich wie Synchronizing Clusters in Fusion: CDCR and Streaming Expressions (20)
Search is the Tip of the Spear for Your B2B eCommerce StrategyLucidworks
With ecommerce experiencing explosive growth, it seems intuitive that the B2B segment of that ecosystem is mirroring the same trajectory. That said, B2B has very different needs when it comes to transacting with the same style of experiences that we see in B2C. For instance, B2B ecommerce is about precision findability, whereas B2C customers can convert at higher rates when they’re just browsing online. In order for the B2B buying experience to be successful, search needs to be tuned to meet the unique needs of the segment.
In this webinar with Forrester senior analyst Joe Cicman, you’ll learn:
-Which verticals in B2B will drive the most growth, and how machine-learning powered personalization tactics can be deployed to support those specific verticals
-Why an omnichannel selling approach must be deployed in order to see success in B2B
-How deploying content search capabilities will support a longer sales cycle at scale
-What the next steps are to support a robust B2B commerce strategy supported by new technology
Speakers
Joe Cicman, Senior Analyst, Forrester
Jenny Gomez, VP of Marketing, Lucidworks
Customer loyalty starts with quickly responding to your customer’s needs. When it comes to resolving open support cases, time is of the essence. Time spent searching for answers adds up and creates inefficiencies in resolving cases at scale. Relevant answers need to be a few clicks away and easily accessible for agents directly from their service console.
We will explore how Lucidworks’ Agent Insights application automatically connects agents with the correct answers and resources. You’ll learn how to:
-Configure a proactive widget in an agent’s case view page to access resources across third-party systems (such as Sharepoint, Confluence, JIRA, Zendesk, and ServiceNow).
-Easily set up query pipelines to autonomously route assets and resources that are relevant to the case-at-hand—directly to the right agent.
-Identify subject matter experts within your support data and access tribal knowledge with lightning-fast speed.
How Crate & Barrel Connects Shoppers with Relevant ProductsLucidworks
Lunch and Learn during Retail TouchPoints #RIC21 virtual event.
***
Crate & Barrel’s previous search solution couldn’t provide its shoppers with an online search and browse experience consistent with the customer-centric Crate & Barrel brand. Meanwhile, Crate & Barrel merchandisers spent the bulk of their time manually creating and maintaining search rules. The search experience impacted customer retention, loyalty, and revenue growth.
Join this lunch & learn for an interactive chat on how Crate & Barrel partnered with Lucidworks to:
-Improve search and browse by modernizing the technology stack with ML-based personalization and merchandising solutions
-Enhance the experience for both shoppers and merchandisers
-Explore signals to transform the omnichannel shopping experience
Questions? Visit https://lucidworks.com/contact/
Learn how to guide customers to relevant products using eCommerce search, hyper-personalisation, and recommendations in our ‘Best-In-Class Retail Product Discovery’ webinar.
Nowadays, shoppers want their online experience to be engaging, inspirational and fulfilling. They want to find what they’re looking for quickly and easily. If the sought after item isn’t available, they want the next best product or content surfaced to them. They want a website to understand their goals as though they were talking to a sales assistant in person, in-store.
In this webinar, we explore IMRG industry data insights and a best-in-class example of retail product discovery. You’ll learn:
- How AI can drive increased revenue through hyper-personalised experiences
- How user intent can be easily understood and results displayed immediately
- How merchandisers can be empowered to curate results and product placement – all without having to rely on IT.
Presented by:
Dave Hawkins, Principal Sales Engineer - Lucidworks
Matthew Walsh, Director of Data & Retail - IMRG
Connected Experiences Are Personalized ExperiencesLucidworks
Many companies claim personalization and omnichannel capabilities are top priorities. Few are able to deliver on those experiences.
For a recent Lucidworks-commissioned study, Forrester Consulting surveyed 350+ global business decision-makers to see what gets in the way of achieving these goals. They discovered that inefficient technology, lack of behavioral insights, and failure to tie initiatives to enterprise-wide goals are some of the most frequent blockers to personalization success.
Join guest speaker, Forrester VP and Principal Analyst, Brendan Witcher, and Lucidworks CEO, Will Hayes, to hear the results of the Forrester Consulting study, how to avoid “digital blindness,” and how to apply VoC data in real-time to delight customers with personalized experiences connected across every touchpoint.
In this webinar, you’ll learn:
- Why companies who utilize real-time customer signals report more effective personalization
- How to connect employees and customers in a shared experience through search and browse
- How Lucidworks clients Lenovo, Morgan Stanley and Red Hat fast-tracked improvements in conversion, engagement and customer satisfaction
Featuring
- Will Hayes, CEO, Lucidworks
- Brendan Witcher, VP, Principal Analyst, Forrester
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...Lucidworks
Intelligent Policing. Leveraging Data to more effectively Serve Communities.
Policing in the next decade is anticipated to be very different from historical methods. More data driven, more focused on the intricacies of communities they serve and more open and collaborative to make informed recommendations a reality. Whether its social populations, NIBRS or organization improvement that’s the driver, the IT requirement is largely the same. Provide 360 access to large volumes of siloed data to gain a full 360 understanding of existing connections and patterns for improved insight and recommendation.
Join us for a round table discussion of how the Toronto Police Service is better serving their community through deploying a unified intelligent data platform.
Data innovation improves officers' engagement with existing data and streamlines investigation workflows by enhancing collaboration. This improved visibility into existing police data allows for a more intelligent and responsive police force.
In this webinar, we'll cover:
-The technology needs of an intelligent police force.
-How a Global Search improves an officer's interaction with existing data.
Featuring:
-Simon Taylor, VP, Worldwide Channels & Alliances, Lucidworks
-Michael Cizmar, Managing Director, MC+A
-Ian Williams, Manager of Analytics & Innovation, Toronto Police Service
[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...Lucidworks
Policing in the next decade is anticipated to be very different from historical methods. More data driven, more focused on the intricacies of communities they serve and more open and collaborative to make informed recommendations a reality. Whether its social populations, NIBRS or organization improvement that’s the driver, the IT requirement is largely the same. Provide 360 access to large volumes of siloed data to gain a full 360 understanding of existing connections and patterns for improved insight and recommendation.
Join us for a round table discussion of how the Toronto Police Service is better serving their community through deploying a unified intelligent data platform.
Data innovation improves officers' engagement with existing data and streamlines investigation workflows by enhancing collaboration. This improved visibility into existing police data allows for a more intelligent and responsive police force.
In this webinar, we'll cover:
The technology needs of an intelligent police force.
How a Global Search improves an officer's interaction with existing data.
Featuring
-Simon Taylor, VP, Worldwide Channels & Alliances, Lucidworks
-Michael Cizmar, Managing Director, MC+A
-Ian Williams, Manager of Analytics & Innovation, Toronto Police Service
Preparing for Peak in Ecommerce | eTail Asia 2020Lucidworks
This document provides a framework for prioritizing onsite search problems and key performance indicators (KPIs) to measure for e-commerce search optimization. It recommends prioritizing fixing searches that yield no results, improving relevance of results, and reducing false positives. The most essential KPIs to measure include query latency, throughput, result relevance through click-through rates and NDCG scores. The document also provides tips for self-benchmarking search performance and examples of search performance benchmarks across nine e-commerce sites from various industries.
Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...Lucidworks
Wish your conversion rates were higher? Can’t figure out how to efficiently and effectively serve all the visitors on your site? Embarrassed by the quality of your product discovery experience? The bar is high and the influx of online shopping over recent months has reminded us that the opportunities are real. We’re all deep in holiday prep, but let’s take a few minutes to think about January 2021 and beyond. How can we position ourselves for success with our customers and against our competition?
Grab your lunch and let’s dive into three strategies that need to be part of your 2021 roadmap. You don’t need an army to get there. But you do need to take action and capitalize on the shoppers abandoning the product discovery journey on your site.
In this session, attendees will find out how to:
-Take control of merchandising at scale;
-Implement hands-free search relevancy; and
-Address personalization challenges.
AI-Powered Linguistics and Search with Fusion and RosetteLucidworks
For a personalized search experience, search curation requires robust text interpretation, data enrichment, relevancy tuning and recommendations. In order to achieve this, language and entity identification are crucial.
For teams working on search applications, advanced language packages allow them to achieve greater recall without sacrificing precision.
Join us for a guided tour of our new Advanced Linguistics packages, available in Fusion, thanks to the technology partnership between Lucidworks and Basistech.
We’ll explore the application of language identification and entity extraction in the context of search, along with practical examples of personalizing search and enhancing entity extraction.
In this webinar, we’ll cover:
-How Fusion uses the Rosette Basic Linguistics and Entity Extraction packages
-Tips for improving language identification and treatment as well as data enrichment for personalization
-Speech2 demo modeling Active Recommendation
-Use Rosette’s packages with Fusion Pipelines to build custom entities for specific domain use cases
Featuring:
-Radu Miclaus, Director of Product, AI and Cloud, Lucidworks, Lucidworks
-Robert Lucarini, Senior Software Engineer, Lucidworks
-Nick Belanger, Solutions Engineer, Basis Technology
The Service Industry After COVID-19: The Soul of Service in a Virtual MomentLucidworks
Before COVID-19, almost 80% of the US workforce worked service in jobs that involve in-person interaction with strangers. Now, leaders of service organizations must reshape their offerings during the pandemic and prepare for whatever the new normal turns out to be. Our three panelists will share ideas for adapting their service businesses, now that closer-than-six-feet isn’t an option.
Join Lucidworks as we talk shop with 3 service business leaders, covering:
-Common impacts of the pandemic on service businesses (and what to do about them),
-How service teams can maintain a human touch across virtual channels, and
-Plans for the future, before and after the pandemic subsides.
Featuring
-Sara Nathan, President & CEO, AMIGOS
-Anthony Carruesco, Founder, AC Fly Fishing
-sara bradley, chef and proprietor, freight house
-Justin Sears, VP Product Marketing, Lucidworks
Webinar: Smart answers for employee and customer support after covid 19 - EuropeLucidworks
The COVID-19 pandemic has forced companies to support far more customers and employees through digital channels than ever before. Many are turning to chatbots to help meet increasing demand, but traditional rules-based approaches can’t keep up. Our new Smart Answers add-on to Lucidworks Fusion makes existing chatbots and virtual assistants more intelligent and more valuable to the people you serve.
Smart Answers for Employee and Customer Support After COVID-19Lucidworks
Watch our on-demand webinar showcasing Smart Answers on Lucidworks Fusion. This technology makes existing chatbots and virtual assistants more intelligent and more valuable to the people you serve.
In this webinar, we’ll cover off:
-How search and deep learning extend conversational frameworks for improved experiences
-How Smart Answers improves customer care, call deflection, and employee self-service
-A live demo of Smart Answers for multi-channel self-service support
Applying AI & Search in Europe - featuring 451 ResearchLucidworks
In the current climate, it’s now more important than ever to digitally enable your workforce and customers.
Hear from Simon Taylor, VP Global Partners & Alliances, Lucidworks and Matt Aslett, Research Vice President, 451 Research to get the inside scoop on how industry leaders in Europe are developing and executing their digital transformation strategies.
In this webinar, we’ll discuss:
The top challenges and aspirations European business and technology leaders are solving using AI and search technology
Which search and AI use cases are making the biggest impact in industries such as finance, healthcare, retail and energy in Europe
What technology buyers should look for when evaluating AI and search solutions
Webinar: Accelerate Data Science with Fusion 5.1Lucidworks
This document introduces Fusion 5.1 and its new capabilities for integrating with data science tools like Tensorflow, Scikit-Learn, and Spacy.
It provides an overview of Fusion's capabilities for understanding content, users, and delivering insights at scale. The document then demonstrates Fusion's Jupyter Notebook integration for reading and writing data and running SQL queries.
Finally, it shows how Fusion integrates with Seldon Core to easily deploy machine learning models with tools like Tensorflow and Scikit-Learn. A live demo is provided of deploying a custom model and using it in Fusion's query and indexing pipelines.
Webinar: 5 Must-Have Items You Need for Your 2020 Ecommerce StrategyLucidworks
In this webinar with 451 Research, you'll understand how retailers are using AI to predict customer intent and learn which key performance metrics are used by more than 120 online retailers in Lucidworks’ 2019 Retail Benchmark Survey.
In this webinar, you’ll learn:
● What trends and opportunities are facing the ecommerce industry in 2020
● Why search is the universal path to understanding customer intent
● How large online retailers apply AI to maximize the effectiveness of their personalization efforts
Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...Lucidworks
Nordstrom Rack | Hautelook curates and serves customers a wide selection of on-trend apparel, accessories, and shoes at an everyday savings of up to 75 percent off regular prices. With over a million visitors shopping across different platforms every day, and a realization that customers have become accustomed to robust and personalized search interactions, Nordstrom Rack | Hautelook launched an initiative over a year ago to provide data science-driven digital experiences to their customers.
In this session, we’ll discuss Nordstrom Rack | Hautelook’s journey of operationalizing a hefty strategy, optimizing a fickle infrastructure, and rallying troops around a single vision of building an expansible machine-learning driven product discovery engine.
The audience will learn about:
-The key technical challenges and outcomes that come with onboarding a solution
-The lessons learned of creating and executing operational design
-The use of Lucidworks Fusion to plug custom data science models into search and browse applications to understand user intent and deliver personalized experiences
Apply Knowledge Graphs and Search for Real-World Decision IntelligenceLucidworks
Knowledge graphs and machine learning are on the rise as enterprises hunt for more effective ways to connect the dots between the data and the business world. With newer technologies, the digital workplace can dramatically improve employee engagement, data-driven decisions, and actions that serve tangible business objectives.
In this webinar, you will learn
-- Introduction to knowledge graphs and where they fit in the ML landscape
-- How breakthroughs in search affect your business
-- The key features to consider when choosing a data discovery platform
-- Best practices for adopting AI-powered search, with real-world examples
Webinar: Building a Business Case for Enterprise SearchLucidworks
The document discusses building a business case for enterprise search. It notes that 85% of information is unstructured data locked in various locations and applications. Many knowledge workers spend a significant portion of their day searching across multiple systems for information. The rise of unstructured data and AI capabilities can help organizations unlock value from their information assets. Effective enterprise search powered by AI can provide real-time intelligence, personalized information, and more efficient research to help knowledge workers.
Taking AI to the Next Level in Manufacturing.pdfssuserfac0301
Read Taking AI to the Next Level in Manufacturing to gain insights on AI adoption in the manufacturing industry, such as:
1. How quickly AI is being implemented in manufacturing.
2. Which barriers stand in the way of AI adoption.
3. How data quality and governance form the backbone of AI.
4. Organizational processes and structures that may inhibit effective AI adoption.
6. Ideas and approaches to help build your organization's AI strategy.
Have you ever been confused by the myriad of choices offered by AWS for hosting a website or an API?
Lambda, Elastic Beanstalk, Lightsail, Amplify, S3 (and more!) can each host websites + APIs. But which one should we choose?
Which one is cheapest? Which one is fastest? Which one will scale to meet our needs?
Join me in this session as we dive into each AWS hosting service to determine which one is best for your scenario and explain why!
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfChart Kalyan
A Mix Chart displays historical data of numbers in a graphical or tabular form. The Kalyan Rajdhani Mix Chart specifically shows the results of a sequence of numbers over different periods.
Trusted Execution Environment for Decentralized Process MiningLucaBarbaro3
Presentation of the paper "Trusted Execution Environment for Decentralized Process Mining" given during the CAiSE 2024 Conference in Cyprus on June 7, 2024.
HCL Notes and Domino License Cost Reduction in the World of DLAUpanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-notes-and-domino-license-cost-reduction-in-the-world-of-dlau/
The introduction of DLAU and the CCB & CCX licensing model caused quite a stir in the HCL community. As a Notes and Domino customer, you may have faced challenges with unexpected user counts and license costs. You probably have questions on how this new licensing approach works and how to benefit from it. Most importantly, you likely have budget constraints and want to save money where possible. Don’t worry, we can help with all of this!
We’ll show you how to fix common misconfigurations that cause higher-than-expected user counts, and how to identify accounts which you can deactivate to save money. There are also frequent patterns that can cause unnecessary cost, like using a person document instead of a mail-in for shared mailboxes. We’ll provide examples and solutions for those as well. And naturally we’ll explain the new licensing model.
Join HCL Ambassador Marc Thomas in this webinar with a special guest appearance from Franz Walder. It will give you the tools and know-how to stay on top of what is going on with Domino licensing. You will be able lower your cost through an optimized configuration and keep it low going forward.
These topics will be covered
- Reducing license cost by finding and fixing misconfigurations and superfluous accounts
- How do CCB and CCX licenses really work?
- Understanding the DLAU tool and how to best utilize it
- Tips for common problem areas, like team mailboxes, functional/test users, etc
- Practical examples and best practices to implement right away
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUpanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-notes-und-domino-lizenzkostenreduzierung-in-der-welt-von-dlau/
DLAU und die Lizenzen nach dem CCB- und CCX-Modell sind für viele in der HCL-Community seit letztem Jahr ein heißes Thema. Als Notes- oder Domino-Kunde haben Sie vielleicht mit unerwartet hohen Benutzerzahlen und Lizenzgebühren zu kämpfen. Sie fragen sich vielleicht, wie diese neue Art der Lizenzierung funktioniert und welchen Nutzen sie Ihnen bringt. Vor allem wollen Sie sicherlich Ihr Budget einhalten und Kosten sparen, wo immer möglich. Das verstehen wir und wir möchten Ihnen dabei helfen!
Wir erklären Ihnen, wie Sie häufige Konfigurationsprobleme lösen können, die dazu führen können, dass mehr Benutzer gezählt werden als nötig, und wie Sie überflüssige oder ungenutzte Konten identifizieren und entfernen können, um Geld zu sparen. Es gibt auch einige Ansätze, die zu unnötigen Ausgaben führen können, z. B. wenn ein Personendokument anstelle eines Mail-Ins für geteilte Mailboxen verwendet wird. Wir zeigen Ihnen solche Fälle und deren Lösungen. Und natürlich erklären wir Ihnen das neue Lizenzmodell.
Nehmen Sie an diesem Webinar teil, bei dem HCL-Ambassador Marc Thomas und Gastredner Franz Walder Ihnen diese neue Welt näherbringen. Es vermittelt Ihnen die Tools und das Know-how, um den Überblick zu bewahren. Sie werden in der Lage sein, Ihre Kosten durch eine optimierte Domino-Konfiguration zu reduzieren und auch in Zukunft gering zu halten.
Diese Themen werden behandelt
- Reduzierung der Lizenzkosten durch Auffinden und Beheben von Fehlkonfigurationen und überflüssigen Konten
- Wie funktionieren CCB- und CCX-Lizenzen wirklich?
- Verstehen des DLAU-Tools und wie man es am besten nutzt
- Tipps für häufige Problembereiche, wie z. B. Team-Postfächer, Funktions-/Testbenutzer usw.
- Praxisbeispiele und Best Practices zum sofortigen Umsetzen
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on integration of Salesforce with Bonterra Impact Management.
Interested in deploying an integration with Salesforce for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...alexjohnson7307
Predictive maintenance is a proactive approach that anticipates equipment failures before they happen. At the forefront of this innovative strategy is Artificial Intelligence (AI), which brings unprecedented precision and efficiency. AI in predictive maintenance is transforming industries by reducing downtime, minimizing costs, and enhancing productivity.
Main news related to the CCS TSI 2023 (2023/1695)Jakub Marek
An English 🇬🇧 translation of a presentation to the speech I gave about the main changes brought by CCS TSI 2023 at the biggest Czech conference on Communications and signalling systems on Railways, which was held in Clarion Hotel Olomouc from 7th to 9th November 2023 (konferenceszt.cz). Attended by around 500 participants and 200 on-line followers.
The original Czech 🇨🇿 version of the presentation can be found here: https://www.slideshare.net/slideshow/hlavni-novinky-souvisejici-s-ccs-tsi-2023-2023-1695/269688092 .
The videorecording (in Czech) from the presentation is available here: https://youtu.be/WzjJWm4IyPk?si=SImb06tuXGb30BEH .
Dive into the realm of operating systems (OS) with Pravash Chandra Das, a seasoned Digital Forensic Analyst, as your guide. 🚀 This comprehensive presentation illuminates the core concepts, types, and evolution of OS, essential for understanding modern computing landscapes.
Beginning with the foundational definition, Das clarifies the pivotal role of OS as system software orchestrating hardware resources, software applications, and user interactions. Through succinct descriptions, he delineates the diverse types of OS, from single-user, single-task environments like early MS-DOS iterations, to multi-user, multi-tasking systems exemplified by modern Linux distributions.
Crucial components like the kernel and shell are dissected, highlighting their indispensable functions in resource management and user interface interaction. Das elucidates how the kernel acts as the central nervous system, orchestrating process scheduling, memory allocation, and device management. Meanwhile, the shell serves as the gateway for user commands, bridging the gap between human input and machine execution. 💻
The narrative then shifts to a captivating exploration of prominent desktop OSs, Windows, macOS, and Linux. Windows, with its globally ubiquitous presence and user-friendly interface, emerges as a cornerstone in personal computing history. macOS, lauded for its sleek design and seamless integration with Apple's ecosystem, stands as a beacon of stability and creativity. Linux, an open-source marvel, offers unparalleled flexibility and security, revolutionizing the computing landscape. 🖥️
Moving to the realm of mobile devices, Das unravels the dominance of Android and iOS. Android's open-source ethos fosters a vibrant ecosystem of customization and innovation, while iOS boasts a seamless user experience and robust security infrastructure. Meanwhile, discontinued platforms like Symbian and Palm OS evoke nostalgia for their pioneering roles in the smartphone revolution.
The journey concludes with a reflection on the ever-evolving landscape of OS, underscored by the emergence of real-time operating systems (RTOS) and the persistent quest for innovation and efficiency. As technology continues to shape our world, understanding the foundations and evolution of operating systems remains paramount. Join Pravash Chandra Das on this illuminating journey through the heart of computing. 🌟
Programming Foundation Models with DSPy - Meetup SlidesZilliz
Prompting language models is hard, while programming language models is easy. In this talk, I will discuss the state-of-the-art framework DSPy for programming foundation models with its powerful optimizers and runtime constraint system.
5th LF Energy Power Grid Model Meet-up SlidesDanBrown980551
5th Power Grid Model Meet-up
It is with great pleasure that we extend to you an invitation to the 5th Power Grid Model Meet-up, scheduled for 6th June 2024. This event will adopt a hybrid format, allowing participants to join us either through an online Mircosoft Teams session or in person at TU/e located at Den Dolech 2, Eindhoven, Netherlands. The meet-up will be hosted by Eindhoven University of Technology (TU/e), a research university specializing in engineering science & technology.
Power Grid Model
The global energy transition is placing new and unprecedented demands on Distribution System Operators (DSOs). Alongside upgrades to grid capacity, processes such as digitization, capacity optimization, and congestion management are becoming vital for delivering reliable services.
Power Grid Model is an open source project from Linux Foundation Energy and provides a calculation engine that is increasingly essential for DSOs. It offers a standards-based foundation enabling real-time power systems analysis, simulations of electrical power grids, and sophisticated what-if analysis. In addition, it enables in-depth studies and analysis of the electrical power grid’s behavior and performance. This comprehensive model incorporates essential factors such as power generation capacity, electrical losses, voltage levels, power flows, and system stability.
Power Grid Model is currently being applied in a wide variety of use cases, including grid planning, expansion, reliability, and congestion studies. It can also help in analyzing the impact of renewable energy integration, assessing the effects of disturbances or faults, and developing strategies for grid control and optimization.
What to expect
For the upcoming meetup we are organizing, we have an exciting lineup of activities planned:
-Insightful presentations covering two practical applications of the Power Grid Model.
-An update on the latest advancements in Power Grid -Model technology during the first and second quarters of 2024.
-An interactive brainstorming session to discuss and propose new feature requests.
-An opportunity to connect with fellow Power Grid Model enthusiasts and users.
Your One-Stop Shop for Python Success: Top 10 US Python Development Providersakankshawande
Simplify your search for a reliable Python development partner! This list presents the top 10 trusted US providers offering comprehensive Python development services, ensuring your project's success from conception to completion.
Presentation of the OECD Artificial Intelligence Review of Germany
Synchronizing Clusters in Fusion: CDCR and Streaming Expressions
1.
2. STAY CONNECTED
Twitter @activate_conf
Facebook @activateconf
#Activate19
Log in to wifi, follow Activate on social media,
and download the event app where you can
submit an evaluation after the session
WIFI NETWORK: Activate2019
PASSWORD: Lucidworks
DOWNLOAD THE ACTIVATE 2019 MOBILE APP
Search Activate2019 in the App/Play store
Or visit: http://crowd.cc/activate19
3. Today’s speaker…
Who is the hippie mad scientist giving this talk?
PA U L A N D E R S O N
Information Architect
Dynatrace
Synchronizing Clusters in Fusion: CDCR and Streaming
Expressions
Synchronizing a search application across multiple clusters is a complex
challenge and the solution evolves with our tools (Solr and Fusion
AppStudio). Paul Anderson discusses how Dynatrace's cluster
synchronization strategy changed over the last two years to ensure that
customers worldwide have a consistent search experience. The talk
focuses on two Solr features, CDCR, and Streaming Expressions,
explaining what they do well, where they fall down, and where they need
to improve. Paul also covers how to modify your index pipelines and
signal aggregations to support cluster synchronization.
4. Dynatrace is software intelligence built
for the enterprise cloud
Go beyond APM with the Dynatrace all-in-one platform
Application
performance monitoring
Cloud infrastructure
monitoring
AIOps
Digital experience
management
Software Intelligence Platform
5. Dynatrace is the clear leader
#1
#1
Ecosystem
25
Gartner APM highest ability to execute and furthest
completenessof vision
MajorReleasesperYear
Employees2000+
6. Agenda
• Why multiple clusters?
• What needs to be synced?
• How to sync?
• How to monitor the sync?
• Q & A
8. Multiple clusters enhance performance
Search Apdex with one cluster in US
Apdex is an open standard for measuring performance of software applications in computing (https://en.wikipedia.org/wiki/Apdex).
Search Apdex with two clusters, US and EU
9. Multiple clusters support failover
us-east-1
Region Datacenter
eu-west-1
Region Datacenter
Incoming
Search
Traffic
11. First, the obvious stuff…
• Infrastructure (same for on-premises or in the cloud)
• Installed applications
– Java
– Fusion
• Your Fusion application
– Some cluster-specific differences in fusion.properties, solrconfig, etc.
• With the obvious out of the way… what about collection data?
12. Your search data…
• Search index
– Can’t you just index independently in each cluster?
– Sure, but indexing is expensive
– Recent test showed that the only search requests that weren’t fast were during a crawl.
– Syncing the index is preferable
• Signal data
– Really? Why?
13. Why sync signals?
Signals power several aspects of relevance boosting
• They provide click counts for determining popularity
– Perhaps a slight impact, based on locale-based differences in user preference
• They can power re-ranking algorithms
• They can serve as the ground truth for our learning-to-rank models
• If you want consistent results across clusters, signals should be synced
14. Why sync signals?
Personal boosting example
us-east-1
Region Datacenter
eu-west-1
Region Datacenter
Incoming
Search
Traffic
15. Search sync to-do list
• Main document collection (search index)
• Signals
– In both directions (Yikes!)
• Anything else?
– User permission data? Transient, short lived cache, no big advantage.
– System logs? Heck no, that needs to remain unique.
– Aggregated signal data? Easy to regenerate, no big advantage.
– Bueller?
17. How to sync the search index?
Three options: none of them perfect…
• Use Solr’s Cross Data Center Replication service (aka. CDCR)
– Configure one cluster as the source, the other as the target.
– Crawl in the source cluster, all changes (adds, updates, deletes) replicated to target.
• Set up separate crawl schedules in each cluster
– Doing each crawl twice; not ideal.
– Shouldn’t crawl in both clusters at the same time (leave one fully available for queries).
– More chance for minor differences in search index.
• What about streaming expressions?
– Negative; can’t handle deletions.
• Which one to pick…?
18. Isn’t CDCR the better option?
It can be, but…
• We used unidirectional CDCR with great success for over a year
• We introduced new datasources with different update logic with no crawl DB
– Delete existing docs first…
– Then recrawl the same docs (single fast REST call)
• CDCR stopped working... and never worked correctly again
• Suggestion:
– Try CDCR in a test environment (two test clusters)
– If it works, try it in production…
– …but have a crawl schedule for each cluster ready to go.
19. Search index recommendation
• Make crawl schedules for your clusters, even if you plan to try CDCR first
– Avoid crawl schedule overlap between clusters for maximum performance
– A traffic policy based on latency and cluster health is a really good idea
• If CDCR fails, enable the crawl jobs in the target cluster
• More about the future of CDCR later
20. How to sync signals?
• That depends on your Fusion version…
21. Signal sync in Fusion 3.x
Nothing but click signals to worry about…
US Twigkit/AppkitEU Twigkit/Appkit Other web properties (API)
US _signals_ingest
index pipeline
US Signals Collection (Solr)
US Signals Aggregation Job
US Signals Aggr Collection
(Solr)
EU Signals Collection (Solr)
EU Signals Aggregation Job
EU Signals Aggr Collection
(Solr)
Unidirectional CDCR
Unidirectional CDCR
Tried this instead.
22. But I thought CDCR was bad?
• Signals collections are all about adding
– Never an update
– Rarely a delete (periodic history cleanup for GDPR)
• In this scenario, we found that CDCR can be relatively stable
23. Signal sync in Fusion 4.x
Click signals and response signals and session signals, oh my…
US Appkit/AppstudioEU Appkit/Appstudio Other web properties (API)
US _signals_ingest
index pipeline
US Signals Collection (Solr)
US Signals Aggregation Job
US Signals Aggr Collection
(Solr)
EU Signals Collection (Solr)
EU Signals Aggregation Job
EU Signals Aggr Collection
(Solr)
Bidirectional CDCR (?)
EU _signals_ingest
index pipeline
US Query PipelineEU Query Pipeline
EU Session Rollup
Job
US Session Rollup
Job
Challenging part!
24. Perfect job for bidirectional CDCR?
Sadly, no.
• Bidirectional CDCR was designed for:
– Easy failover,…
– …without having to edit your Solrconfig.
– Source and target can swap their behavior automatically
• Activate 2018 discussions…
– Since signals are always additive…
– …and neither cluster will create the same signal ID…
– …it… should… work!
• But it doesn’t
25. Bidirectional CDCR’s fatal flaw
• The source/target swap logic is not very fault tolerant.
• In a test environment, it works quite nicely.
• In a production environment, under load, it quietly stops working,…
– Replicating in one direction only,…
– And starts accumulating tlog files like a banshee.
26. The verdict on CDCR
• Simple unidirectional implementations can work, but…
• It’s too fragile
• It fails for unexplained reasons
• It doesn’t support the Solr authentication or authorization plugins
• Any bidirectional implementation is bound to fail
• Solr committers admit that it has serious design flaws
• I can’t recommend it right now.
• Just say no… for now
27. Hope for CDCR
Dying, but due for resuscitation
• A band of Lucidworks developers want to champion fixes for CDCR
• They need your help to gather buy-in from the rest of the Solr project
• If you need/want CDCR to be righteous, let the Solr project know
29. What are streaming expressions?
• The gospel according to the Solr doc:
Streaming expressions are a suite of functions that can be combined to
perform many different parallel computing tasks.
• How many of you are are already using streaming expressions?
• What do we need for our signal sync scenario:
– We need one streaming expression to push new signals from the US cluster to the EU cluster.
– We need one streaming expression to push new signals from the EU cluster to the US cluster.
• We’re going to leverage:
– A Topic stream source, nested inside…
– An Update stream decorator, nested inside…
– A Daemon stream decorator.
30. Configuring the Topic
First, a supporting requirement
• The topic is a stream source, in this case, a query into the signals collection
• We only want to query signals that originated in the US
– Remember, we’re going to create another streaming expression to send EU signals to the US
– We don’t want the new signals being passed back and forth forever
• In Fusion, create field in the _signal_ingest index pipeline that sets the cluster.
– useast1
– euwest1
– From a source control standpoint, this is a difference between the clusters.
31. Configuring the Topic
Second, another supporting requirement
• Session signals (rollup aggregation) are written back into the signals collection
• If you want them synced, they need to have a cluster field value
– From a source control standpoint, this is a global change common to all clusters.
WITH session_agg AS (
SELECT COUNT(1) AS activity_count,
MIN(timestamp_tdt) AS start,
MAX(timestamp_tdt) AS end,
timediff(MAX(timestamp_tdt), MIN(timestamp_tdt), "MINUTES") AS duration,
'session' AS type,
first(user_id) AS user,
first(cluster) AS cluster,
session_keywords(query) AS keywords,
session
FROM ${inputCollection}
WHERE timestamp_tdt IS NOT NULL
AND type != 'session'
AND session IS NOT NULL
AND session NOT IN (SELECT session FROM ${inputCollection} WHERE type = 'session' AND session IS NOT NULL)
GROUP BY session
HAVING timediff(current_timestamp(), MAX(timestamp_tdt), "SECONDS") >= ${elapsedSecsSinceLastActivity} OR timediff(current_timestamp(),
MIN(timestamp_tdt), "SECONDS") >= ${elapsedSecsSinceSessionStart})
SELECT activity_count, start, end, duration, type, user, cluster, keywords, session FROM session_agg
32. Configuring the Topic
• A collection to store the stream progress
(checkpoints)
• The collection to query
• The query to execute (cluster specific)
• The fields to return (all)
• An ID to associate with the stored
checkpoints
topic(stream_checkpoints,
my_signals,
q="cluster:useast1",
fl="*",
id="signals_topic"
)
The expression itself, in order of parameters
33. Configuring the Topic
• A collection to store the checkpoints for the
stream
• You'll need to create this collection before
you run start the streaming expression
daemon.
• I used a two-shard, single replica
collection for this purpose, but it could
have easily been single shard and two
replica.
• In this example, the collection is
stream_checkpoints. If you have multiple
streams configured, you'll want a more
descriptive name.
topic(stream_checkpoints,
my_signals,
q="cluster:useast1",
fl="*",
id="signals_topic"
)
Checkpoints collection, part 1
34. Configuring the Topic
• The checkpoint “document” in the
stream_checkpoints collection is shown
to the right.
• The id for the topic (signals_topic)
identifies the checkpoints, so you can use
a single checkpoints collection to store all
the checkpoints for the topics in your
cluster.
• One checkpoint is stored per shard and it
is the version number of the last
processed document; version numbers
always go up.
{
"id":"signals_topic",
"checkpoint_ss":[
"shard2~1643423444105166848",
"shard1~1643424248539119616"
],
"_version_":1643424257580990464}]
}
Checkpoints collection, part 2
35. Configuring the Topic
• The target collection for the query:
my_signals
• No zkhost string is necessary since the
target collection is local topic(stream_checkpoints,
my_signals,
q="cluster:useast1",
fl="*",
id="signals_topic"
)
Target collection
36. Configuring the Topic
• The query to run against the target
collection
• Query should leverage the new cluster
field we added to:
– The _signals_ingest index pipeline
– The session rollup aggregation job
topic(stream_checkpoints,
my_signals,
q="cluster:useast1",
fl="*",
id="signals_topic"
)
Query to execute
37. Configuring the Topic
• The fields to return in the results.
– For signals, return them all: "*".
• Note that we don't have to exclude the
_version_ field. After the push to the
target cluster, the same signal will have
the same ID in both clusters but different
_version_ field values.
topic(stream_checkpoints,
my_signals,
q="cluster:useast1",
fl="*",
id="signals_topic"
)
Fields to return
38. Configuring the Topic
• A name for the signals topic, used to
identify it's checkpoints in the checkpoints
collection.
• It appears in the checkpoints collection as
seen below.
topic(stream_checkpoints,
my_signals,
q="cluster:useast1",
fl="*",
id="signals_topic"
)
Topic ID
{
"id":"signals_topic",
"checkpoint_ss":[
"shard2~1643423444105166848",
"shard1~1643424248539119616"
],
"_version_":1643424257580990464}]
}
39. One note about Topics
• In the Solr doc on topics, you'll notice the following warning:
The topic function should be considered in beta until SOLR-8709 is
committed and released.
• This has to do with the possibility of out-of-order version numbers that
would make the topic miss certain documents because a new document
appeared with a lower version number than in the checkpoint for the last
execution.
• My spy network of Solr committers reports that several efforts have been
made to break topic with out-of-order version numbers, but nobody has
been successful.
• In other words, nothing to see here..
Solr doc needs an update…
40. Configuring the Update
This sets the target of the results returned by
the topic, which we want to be the same
collection in the target cluster.
• The collection to write to (my_signals)
• The batch size.
• The zkhost string for the target cluster,
including solr path
• The topic expression we created earlier.
update(my_signals,
batchSize=500,
zkHost="10.123.1.7:9983,
10.123.1.8:9983,
10.123.1.9:9983/lwfusion/4.1.2/solr",
topic(stream_checkpoints,
my_signals,
q="cluster:useast1",
fl="*",
id="signals_topic“
)
)
The expression itself, in order of parameters
41. Configuring the Update
• The collection to write to (my_signals)
• The collection must already exist in the
target cluster
• Unlike CDCR configurations, you're not
required to maintain the same number of
shards and replicas for this collection
across clusters.
– I do anyway, but you don't have to.
update(my_signals,
batchSize=500,
zkHost="10.123.1.7:9983,
10.123.1.8:9983,
10.123.1.9:9983/lwfusion/4.1.2/solr",
topic(stream_checkpoints,
my_signals,
q="cluster:useast1",
fl="*",
id="signals_topic“
)
)
Target collection
42. Configuring the Update
• The number of documents in each batch
sent.
• Signals are small, so I set it to 500.
• I like to set this in concert with the
runInterval (described later) to, if
possible, process all new signals in a
single batch.
update(my_signals,
batchSize=500,
zkHost="10.123.1.7:9983,
10.123.1.8:9983,
10.123.1.9:9983/lwfusion/4.1.2/solr",
topic(stream_checkpoints,
my_signals,
q="cluster:useast1",
fl="*",
id="signals_topic“
)
)
Batch size
43. Configuring the Update
• The zkHost string of the target cluster,
including any solr path specification.
• Make sure you open up the Zookeeper
port 9983 between your clusters.
Note: zkhost shown on multiple lines for
convenience; don’t break it up.
update(my_signals,
batchSize=500,
zkHost="10.123.1.7:9983,
10.123.1.8:9983,
10.123.1.9:9983/lwfusion/4.1.2/solr",
topic(stream_checkpoints,
my_signals,
q="cluster:useast1",
fl="*",
id="signals_topic“
)
)
zkHost string
44. Configuring the Daemon
A daemon decorator that wraps the update
decorator and topic stream source
• Give the daemon an ID
• How often to run?
• Whether to keep running
• The update and topic we configured earlier
daemon(id="signals_daemon",
runInterval="10000",
terminate="false",
update(my_signals,
batchSize=500,
zkHost="10.123.1.7:9983,
10.123.1.8:9983,
10.123.1.9:9983/lwfusion/4.1.2/solr",
topic(stream_checkpoints,
my_signals,
q="cluster:useast1",
fl="*",
id="signals_topic“
)
)
)
The expression itself, in order of parameters
45. Configuring the Daemon
• The id of the daemon: signals_daemon
• This name will appear in subsequent
action list requests that report on the
status of each daemon (below).
• If you have multiple daemons, you would
want a more descriptive name.
daemon(id="signals_daemon",
runInterval="10000",
terminate="false",
update(my_signals,
batchSize=500,
zkHost="10.123.1.7:9983,
10.123.1.8:9983,
10.123.1.9:9983/lwfusion/4.1.2/solr",
topic(stream_checkpoints,
my_signals,
q="cluster:useast1",
fl="*",
id="signals_topic“
)
)
)
Daemon ID
{
"result-set":{
"docs":[{
"startTime":1567272917772,
"stopTime":0,
"id":"signals_daemon",
"state":"TIMED_WAITING",
"iterations":888643}
,{
"EOF":true}]}}
46. Configuring the Daemon
• The run interval in milliseconds: 10000
• This is how often the daemon will run the
topic query and send the results to the
specified target in the update decorator.
• I like to set this in concert with the
batchsize (bold) to, if possible, process all
new signals in a single batch.
daemon(id="signals_daemon",
runInterval="10000",
terminate="false",
update(my_signals,
batchSize=500,
zkHost="10.123.1.7:9983,
10.123.1.8:9983,
10.123.1.9:9983/lwfusion/4.1.2/solr",
topic(stream_checkpoints,
my_signals,
q="cluster:useast1",
fl="*",
id="signals_topic“
)
)
)
Run interval
47. Configuring the Daemon
• Whether the daemon terminates: false
• If true, the daemon will stay resident, but
will only run the topic query and send
results once.
• To keep running at the interval, set this to
false.
daemon(id="signals_daemon",
runInterval="10000",
terminate="false",
update(my_signals,
batchSize=500,
zkHost="10.123.1.7:9983,
10.123.1.8:9983,
10.123.1.9:9983/lwfusion/4.1.2/solr",
topic(stream_checkpoints,
my_signals,
q="cluster:useast1",
fl="*",
id="signals_topic“
)
)
)
Termination(?)
48. Starting the Daemon
curl http://localhost:8983/solr/stream_daemon_host/stream -d 'expr=
daemon(id="signals_daemon",
runInterval="10000",
terminate="false",
update(my_signals,
batchSize=500,
zkHost="10.123.1.7:9983,10.123.1.8:9983,10.123.1.9:9983/lwfusion/4.1.2/solr",
topic(stream_checkpoints,
my_signals,
q="cluster:useast1",
fl="*",
id="signals_topic“
)
)
)
'
The entire request, as run from the instance hosting daemon
49. Starting the Daemon
• Make a request to the stream API for a given collection and attach the full
daemon expression as a payload.
– The entire request must be one line (mind your whitespace).
• The collection (stream_daemon_host) is where the daemon will be created
as a new thread for that collection.
What are we actually doing?
curl http://localhost:8983/solr/stream_daemon_host/stream -d 'expr=
daemon(id="signals_daemon",
runInterval="10000",
terminate="false",
…
)
'
50. Starting the Daemon
• The temptation is to use the same collection that you're querying.
• If you only specify a collection name, Solr will randomly select a specific
shard and replica for the daemon thread to attach to but it won't tell you
which one.
• When you subsequently make a request for daemon status via a stream
action list, Solr will, again, randomly select a shard/replica and send the
request to that shard/replica.
• Consequently, you can successfully start a daemon and subsequently can't
find it.
Avoiding the vanishing daemon, part 1
51. Starting the Daemon
• Possible solution: specify a specific shard/replica in your request
– Just hope that you don't delete/re-create that replica later.
• Better solution: Create a single-shard, single-replica collection exclusively to
host the daemon (create it before you run the daemon).
– That way, all requests to the daemon host collection are consistent.
• I usually create this collection on the same instance in the cluster where I'll
be running the daemon.
• I like to put the daemon start code in a shell script and then run that script
from our Jenkins pipeline during builds.
Avoiding the vanishing daemon, part 2
53. Checking the search index sync
• Monitor the document counts per cluster
– OK: Entire collection
– Better: By datasource
• I wrote a shell script that:
– Performs a query on each datasource in each cluster: _lw_data_source_s:my_datasource_name
– Parses out the numFound number
– Adds all the counts to a row in a csv file that matches an Excel report spreadsheet we have
• Backlog item: Extend this process to a report in our Business Intelligence system
54. Checking the signal sync
• Monitor the signal counts in each cluster
– Timing of the daemon intervals means the total is rarely exactly equal
• That shell script (previous slide) also captures signal counts
Keep one eye on the collection…
55. Checking the signal sync
• Monitor the daemon status with a Stream API call
http://host-name-or-ip:8983/solr/stream_daemon_host/stream?action=list
• If you're running in on the local instance:
http://localhost:8983/solr/stream_daemon_host/stream?action=list
Keep the other eye on the streaming expression daemons…
56. Healthy daemon response
• The id is the daemon ID from your
daemon decorator fields
• The startTime and stopTime are UNIX
epoch dates (down to milliseconds)
• A stopTime of 0 means the daemon is still
active and running
• The state can be WAITING (no new
signals) or TIMED_WAITING (between
intervals)
• The iterations are the number of
documents sent.
{
"result-set":{
"docs":[{
"startTime":1567272866535,
"stopTime":0,
"id":"signals_daemon",
"state":"WAITING",
"iterations":549636}
,{
"EOF":true}]}}
57. Terminated daemon response
• A state of TERMINATED, combined with a
non-zero stopTime, means the daemon
has failed for some reason.
• We’ll talk about responding to this status a
little later.
{
"result-set":{
"docs":[{
"startTime":1564408742092,
"stopTime":1566732689909,
"id":"signals_daemon",
"state":"TERMINATED",
"iterations":2340808}
,{
"EOF":true}]}}
59. Typical daemon startup activity
• Starting state:
– There are existing documents (signals) in the collection
– There are no existing checkpoints in the checkpoint collection
• Daemon actions:
– Set the checkpoints to NOW
– Wait for the next interval
60. Normal daemon process interval
• Starting state:
– There are existing documents (signals) in the collection
– There are existing checkpoints in the checkpoint collection
– There are new signals added since the last checkpoint
• Daemon actions:
– Query for new documents (signals) added since the last checkpoint
– Send those documents to the target cluster
– Update the checkpoints to the last signal sent (tracked by shard)
– Wait for the next interval
61. Daemon restart activity
• Starting state:
– There are existing documents (signals) in the collection
– There are existing checkpoints in the checkpoint collection
– There are new signals added (very likely) since the last checkpoint
• Daemon actions:
– Query for new documents (signals) added since the last checkpoint
– Send those documents to the target cluster
– Update the checkpoints to the last signal sent (tracked by shard)
– Wait for the next interval
This is what happens after a daemon failure and you restart the daemon
62. Daemon bootstrap copy
• Starting state:
– There are existing documents (signals) in the collection
• Intervention actions:
– Delete any stored checkpoints for this topic in the checkpoint collection (if they exist)
– Start the daemon with an extra initialCheckpoint=0 parameter in the topic (below)
• Daemon actions:
– Query for all documents (signals) in the collection
– Send those documents to the target cluster
– Update the checkpoints to the last signal sent (tracked by shard)
– Wait for the next interval
• Bulk copy is faster (an arguably safer) than this
Process to copy entire collection to the target cluster
topic(stream_checkpoints,
my_signals,
q="cluster:useast1",
fl="*",
id="signals_topic",
initialCheckpoint=0
)
63. Streaming expressions Rock!
Any downsides?
• Streaming expressions do not yet support:
– Solr authentication plugin
– Solr authorization plugin
65. Daemon occasionally fail
• Streaming expression daemon failures are not very common.
– Five (5) failures in five (5) months
• All our failures have been due to Zookeeper connection timeouts.
– We doubled our ZK timeout from 30 to 60 seconds
– This helped reduce failures, though 60 seconds seems excessive
• The process for recovering from a failure is really easy:
– Restart the daemon
– The persisted checkpoints allow updates to continue from where the topic left off
66. And they can self-heal…
• Run a self-heal shell script as a cron job on the same instance as the daemon
– Do this in each cluster
• The script:
– Runs a test and see if the daemon is TERMINATED
– If it is not, log an OK message
– If it is TERMINATED, log the failure, and restart the daemon
69. Adding another cluster
us-east-1
Region Datacenter
eu-west-1
Region Datacenter
q="cluster:useast1
"
q="cluster:euwest1"
ap-southeast-1
Region Datacenter
q="cluster:useast1 OR
cluster:apsoutheast1" q="cluster:euwest1
OR
cluster:useast1"
q="cluster:apsoutheast1
OR cluster:euwest1"
70. Summary
• Sync search index with crawl schedules in each cluster
– Until CDCR 2 comes out
• Sync signals with streaming expression daemons
– Unless you have to use the Solr authentication or authorization plugins
Dynatrace is a software-intelligence monitoring platform that simplifies enterprise cloud complexity and accelerates digital transformation. With Davis (the Dynatrace AI causation engine) and complete automation, the Dynatrace all-in-one platform provides answers, not just data, about the performance of your applications, their underlying infrastructure, and the experience of your end users. Dynatrace is used to modernize and automate enterprise cloud operations, release higher-quality software faster, and deliver optimal digital experiences.
Dynatrace is the only solution built with software intelligence specifically designed for the new enterprise cloud and is the only next generation approach on the market. Dynatrace leads all competitors achieving both the highest ability to execute and completeness of vision in the latest Gartner report.
Could use a new cluster on the Pacific Rim, perhaps Singapore.
On August 31, 2019, the AWS North Virginia datacenter (us-east-1 region), which isn't very far from where we are sitting, had a power outage. Just like clockwork, several backup generators engaged to keep the datacenter alive, but one of them failed about an hour after the incident began, taking down a portion of the datacenter. As a result, several well known services, such as Twitter, Reddit, and Sling, experienced partial service outages. The Dynatrace search service also runs in the North Virginia datacenter and it, too, was impacted. But since we had a failover traffic policy in place, all incoming search requests were automatically routed to the Europe datacenter and our customers performing searches were blissfully unaware that anything was amiss.
Without the second cluster, customers performing searches would encounter never-ending spinning wheels and there would be much weeping and gnashing of teeth.
On August 31, 2019, the AWS North Virginia datacenter (us-east-1 region), which isn't very far from where we are sitting, had a power outage. Just like clockwork, several backup generators engaged to keep the datacenter alive, but one of them failed about an hour after the incident began, taking down a portion of the datacenter. As a result, several well known services, such as Twitter, Reddit, and Sling, experienced partial service outages. The Dynatrace search service also runs in the North Virginia datacenter and it, too, was impacted. But since we had a failover traffic policy in place, all incoming search requests were automatically routed to the Europe datacenter and our customers performing searches were blissfully unaware that anything was amiss.
Without the second cluster, customers performing searches would encounter never-ending spinning wheels and there would be much weeping and gnashing of teeth.
On August 31, 2019, the AWS North Virginia datacenter (us-east-1 region), which isn't very far from where we are sitting, had a power outage. Just like clockwork, several backup generators engaged to keep the datacenter alive, but one of them failed about an hour after the incident began, taking down a portion of the datacenter. As a result, several well known services, such as Twitter, Reddit, and Sling, experienced partial service outages. The Dynatrace search service also runs in the North Virginia datacenter and it, too, was impacted. But since we had a failover traffic policy in place, all incoming search requests were automatically routed to the Europe datacenter and our customers performing searches were blissfully unaware that anything was amiss.
Without the second cluster, customers performing searches would encounter never-ending spinning wheels and there would be much weeping and gnashing of teeth.
On August 31, 2019, the AWS North Virginia datacenter (us-east-1 region), which isn't very far from where we are sitting, had a power outage. Just like clockwork, several backup generators engaged to keep the datacenter alive, but one of them failed about an hour after the incident began, taking down a portion of the datacenter. As a result, several well known services, such as Twitter, Reddit, and Sling, experienced partial service outages. The Dynatrace search service also runs in the North Virginia datacenter and it, too, was impacted. But since we had a failover traffic policy in place, all incoming search requests were automatically routed to the Europe datacenter and our customers performing searches were blissfully unaware that anything was amiss.
Without the second cluster, customers performing searches would encounter never-ending spinning wheels and there would be much weeping and gnashing of teeth.
The temptation is to use the same collection that you're querying, but multi-shard and multi-replica collections can complicate this a bit. If you only specify a collection name, Solr will randomly select a specific shard and replica for the daemon thread to attach to but it won't tell you which one. When you subsequently make a request for daemon status via a stream action list, Solr will, again, randomly select a shard/replica and send the request to that shard/replica. Consequently, you can successfully start a daemon and subsequently can't find it. To avoid this, you can specify a specific shard/replica in your request, and hope that you don't delete/re-create that replica later. My strategy is to create a single shard, single replica collection exclusively to host the daemon. That way, all requests to the daemon host collection are consistent. For consistency, I usually create this collection on the same instance in the cluster where I'll be running the daemon.
The temptation is to use the same collection that you're querying, but multi-shard and multi-replica collections can complicate this a bit. If you only specify a collection name, Solr will randomly select a specific shard and replica for the daemon thread to attach to but it won't tell you which one. When you subsequently make a request for daemon status via a stream action list, Solr will, again, randomly select a shard/replica and send the request to that shard/replica. Consequently, you can successfully start a daemon and subsequently can't find it. To avoid this, you can specify a specific shard/replica in your request, and hope that you don't delete/re-create that replica later. My strategy is to create a single shard, single replica collection exclusively to host the daemon. That way, all requests to the daemon host collection are consistent. For consistency, I usually create this collection on the same instance in the cluster where I'll be running the daemon.
On August 31, 2019, the AWS North Virginia datacenter (us-east-1 region), which isn't very far from where we are sitting, had a power outage. Just like clockwork, several backup generators engaged to keep the datacenter alive, but one of them failed about an hour after the incident began, taking down a portion of the datacenter. As a result, several well known services, such as Twitter, Reddit, and Sling, experienced partial service outages. The Dynatrace search service also runs in the North Virginia datacenter and it, too, was impacted. But since we had a failover traffic policy in place, all incoming search requests were automatically routed to the Europe datacenter and our customers performing searches were blissfully unaware that anything was amiss.
Without the second cluster, customers performing searches would encounter never-ending spinning wheels and there would be much weeping and gnashing of teeth.