Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.

Solr for Data Science

3.739 Aufrufe

Veröffentlicht am

Solr is a great tool to have in the data scientist toolbox. In this talk, I walk through several demos of using Solr to data science activities as well as explore various use cases for Solr and data science

Veröffentlicht in: Technologie

Solr for Data Science

  1. 1. Solr for Data Science Scalable search and analytics in one Grant Ingersoll, CTO: @gsingers
  2. 2. http://github.com/lucidworks/solr-for-datascience
  3. 3. Solr in a nutshell 8M+ total downloads Solr is both established & growing 250,000+ monthly downloads Largest community of developers. 2500+open Solr jobs. Solr most widely used search solution on the planet. Lucidworks Unmatched Solr expertise. 1/3 of the active committers 70% of the open source code is committed Lucene/Solr Revolution world’s largest open source user conference dedicated to Lucene/Solr. Solr has tens of thousands of applications in production. You use Solr everyday.
  4. 4. Solr’s Key Features • Full text search (Info Retr.) • Facets/Guided Nav galore! • Lots of data types • Spelling, auto-complete, highlighting • Cursors • More Like This • De-duplication • Apache Lucene • Grouping and Joins • Stats, expressions, transformations and more • Lang. Detection • Extensible • Massive Scale/Fault tolerance
  5. 5. It is increasingly important to know what is important! Corollary: The faster you know what is important, the better
  6. 6. Data Exploration
  7. 7. • Solr - Logstash - Kibana ! • http://lucidworks.com/ product/integrations/silk/ • Open source at: • https://github.com/ LucidWorks/banana • https://github.com/ LucidWorks/solrlogmanager SiLK
  8. 8. • Feature Selection • Analyzers for all types • Easily get weights for terms • Term Vectors • Data Reduction • Filters • Analyzers • Data quality tools Feature Selection and Data Reduction
  9. 9. • Quick and dirty: • kNN, others • Carrot^2 integration for search result clustering • Integration with Mahout • Lucene provides Bayesian classifiers built on index • Easily build training and test sets via filter queries Classification and Clustering
  10. 10. • Built in expressions, stats, function queries make custom ranking a snap! • Search is essentially vector * matrix • Lucene index is a ranking optimized matrix • More coming! Math
  11. 11. Clicks, tweets, ratings, locations and much more can all be leveraged to provide high quality recommendations to users and deeper insight for data scientists ! Signals power relevance Query Modification Increase the findability of documents and records with automatic creation of tags, fields and meta-data Curate the user experience in your application with artificial result ranking, document injections and obfuscation Result ManipulationIndex Time Enrichment Perform real time decision making and routing in order to map a users intention or enterprise policy
  12. 12. • http://www.lucidworks.com/products/fusion • Ships w/ built-in Solr-based Recommender OOTB, but easy to extend • Demo: eCommerce data set • ~1.2M products • ~4M clicks Lucidworks Fusion
  13. 13. • Data ingest: • JSON, CSV, XML, Rich types (PDF, etc.), custom • Clients for Python, R, Java, .NET and more • http://cran.r-project.org/web/packages/solr/index.html, amongst others • Output formats: JSON, CSV, XML, custom Solr and Your Tools
  14. 14. • Vector Space or Probabilistic, it’s your choice! • Killer FST • Wicked fast • Pluggable compression, queries, indexing and more • Advanced Similarity Models • Lang. Modeling, Divergence from Random, more • Easy to plug-in ranking for Data Science
  15. 15. But what about?
  16. 16. • More Facets/Stats • Combine pivots, ranges and stats • Percentiles via t-digest • hyper-log-log • Deeper Spark integration for Solr • Custom distributed computation and aggregations/maths • Advanced schema on read options • Time series? Trends? Anomaly Detection? • Learn to rank? What’s coming?
  17. 17. Lucidworks Open Source • Logstash for Solr: • https://github.com/LucidWorks/solrlogmanager • Banana (Kibana for Solr): • https://github.com/LucidWorks/banana • Effortless AWS deployment and monitoring: • http://www.github.com/lucidworks/solr-scale-tk • Data Quality Toolkit: • https://github.com/LucidWorks/data-quality • Spark Integration • https://github.com/LucidWorks/spark-solr
  18. 18. • This code: http://github.com/lucidworks/solr-for- datascience • Company: http://www.lucidworks.com • Our blog: http://www.lucidworks.com/blog • Book: http://www.manning.com/ingersoll • Solr: http://lucene.apache.org/solr • Fusion: http://www.lucidworks.com/products/fusion • Twitter: @gsingers Resources