Anzeige
Anzeige

Más contenido relacionado

Destacado(20)

Anzeige
Anzeige

IR: Open source state

  1. IR: open source state Dmitry Kan, AlphaSense, Insider Solutions University of Helsinki, Information Retrieval and Search Engines course, Feb 21, 2017
  2. About me ● PhD in CS (Saint Petersburg State University), 2011 ● Running a Search Engine team at AlphaSense since 2014 ● Founded Insider Solutions in 2009: text analytics solutions + consulting ● Co-committer on luke project: toolbox for Lucene index since 2013
  3. What is AlphaSense ● Google for financial analysts ● Semantic research engine ● Edit, tag, annotate, share you data in a team ● Oracle, JP Morgan, Credit Suisse ● Engineering is 98% in Helsinki + 1% NYC + 1% India ● #1 fastest growing IT startup in Finland by Deloitte (2015) www.alpha-sense.com
  4. ● Founded 2009 ● BigText Analytics APIs and on-premise solutions ○ Sentiment analysis: Russian, Chinese, English ○ Searchable trend extraction ● Consulting: startups and corporates https://semanticanalyzer.info Insider Solutions
  5. Outline ● Search engine architecture ● Open source search ecosystem ● Research directions for applied IR
  6. Search engine: building blocks ● Web crawler: Apache Nutch (based on Hadoop) ● Data ingestion pipeline: receiving, cleaning, data extraction ● SolrCloud OR Elasticsearch (both based on Lucene) ● Shards: storing index on disk and / or memory
  7. Lucene / Solr history timeline
  8. Inject URLs Create segments New URLs
  9. Search Engine Software Components ● Schema ● Query parser ● Scoring algorithm ● Snippet highlighter ● Index (on-disk or in-memory)
  10. Query analysis and suggestions
  11. British vs US English handling
  12. One shard of the index
  13. Content extraction Apache Tika for parsing formats: ● Html, XML ● PDF ● Microsoft Office & iWorks document formats ● Audio, image, video ● Mail ● Source code
  14. Inspecting Lucene index with Luke Implemented by Andrzej Bialecki. Since 2013 → by Dmitry Kan (Finland) and Tomoko Uchida (Japan) ● Perform index maintenance ● Prototype similarity functions ● Search for documents, reconstruct field values from the index ● Read index from HDFS (Hadoop’s distributed file system) ● Supports Apache Solr and Elasticsearch
  15. Learning to rank: Solr Contributed by Bloomberg Machine learnt model for reranking documents based on user feedback Trained on features: views, popularity, was hit in the title, length, can view on mobile device? LamdaMART, RankSVM
  16. Lucene scoring formula
  17. Feature: is person and executive?
  18. Feature: recency of the document
  19. Features as signal of result importance
  20. Learnt model
  21. Word vectors with Lucene Word2vec was released by Google to open source Possible to train word2vec on Lucene index: https://github.com/kojisekig/word2vec-lucene ● NO need to provide a text file besides Lucene index ● NO need to normalize text. Normalization already done in the index or Analyzer does it for you when processing ● Use part of the index by specifying a filter query
  22. Questions? Reach me at: dk@semanticanalyzer.info Twitter: @dmitrykan Quora: https://www.quora.com/profile/Dmitry-Kan
  23. References 1. Luke: https://github.com/DmitryKey/luke 2. My blog: http://dmitrykan.blogspot.fi/ 3. Solr vs Elasticsearch (overview): https://sematext.com/blog/2015/01/30/solr-elasticsearch-comparison/ 4. Solr vs Elasticsearch (in-depth): https://sematext.com/blog/2012/08/23/solr-vs-elasticsearch-part-1-overview/ 5. Introduction to Apache Solr http://www.slideshare.net/ChristosManios/introduction-to-apache-solr-54076189 6. Word2vec-lucene: https://github.com/kojisekig/word2vec-lucene 7. Apache Tika: https://tika.apache.org/ 8. Apache Solr: http://lucene.apache.org/solr/ 9. Elasticsearch: https://github.com/elastic/elasticsearch 10. Learning to rank in Solr (video): https://www.youtube.com/watch?v=M7BKwJoh96s 11. Learning to rank in Solr (slides): https://lucidworks.com/2016/08/17/learning-to-rank-solr/ 12. Word2vec: https://en.wikipedia.org/wiki/Word2vec#Analysis 13. Lucene scoring formula: https://lucene.apache.org/core/4_0_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html
Anzeige