From AlphaCSP's Java conference - JavaEdge09. The presentation of myself and Evgeny Borisov about 'Java Indexing and Searching'
In this session we discussed the need of Full Test Search (as opposed to regular textual/SQL search) , Lucene and it's OO mismatches, the solution that Hibernate Search provides to those mismatches and then a bit about Lucene's scoring algorithm.
5. We’d like to : Index the information efficiently answer queries using that index More common than you think Full Text Search Motivation
6. Integrated full text search engine in the database e.g. DBSight, Recent versions of MySQL, MS SQL Server, Oracle Text, etc Out of the box Search Appliances e.g. Google Search Appliance Third party libraries Full Text Search Solutions Motivation
8. The most popular full text search library Scalable and high performance Around for about 9 years Open source Supported by the Apache Software Foundation Apache Lucene Lucene Intro
10. “Word-oriented” search Powerful query syntax Wildcards, typos, proximity search. Sorting by relevance (Lucene’s scoring algorithm) or any other field Fast searching, fast indexing Inverted index. Lucene’s Features Lucene Intro
11. Lucene Intro Inverted Index DB Head First Java 0 Best of the best of the best 1 Chuck Norris in action 2 JBoss in action 3
12. A Field is a key+value. Value is always represented as a String (Textual) A Document can contain as many Fields as we’d like Lucene’sindex is a collection of Documents Basic Definitions Lucene Intro
13. Lucene Intro Using Lucene API… IndexSearcher is = newIndexSearcher(“BookIndex"); QueryParserparser = newQueryParser("title", analyzer); Query query = parser.parse(“Good practices for Gava”); return is.search(query);
15. The Structural Mismatch Converting objects to string and vice versa No representation of relation between Documents The Synchronization Mismatch DB must by sync’ed with the index The Retrieval Mismatch Retrieving documents ( =pairs of key + value) and not objects Object vs Flat text mismatches Lucene Intro
17. Leverages ORM and Lucene together to solve those mismatches Complements Hibernate Core by providing FTS on persistent domain models. It’s actually a bridge that hides the sometimes complex Lucene API usage. Open source. Hibernate Search
18. Document = Class (Mapped POJO) Hibernate Search metadata can be described by Annotations only Regardless, you can still use Hibernate Core with XML descriptors (hbm files) Let’s create our first mapping – Book Mapping Hibernate Search
20. Types will be converted via “Field Bridge”. It is a bridge between the Java type and its representation in Lucene (aka String) Hibernate Search comes with a set for most standard types (Numbers – primitives and wrappers, Date, Class etc) They are extendable, of course Bridges Hibernate Search
21. We can use a field bridge… @FieldBridge(impl = MyPaddedFieldBridge.class, params = {@Parameter(name="padding", value=“5")} ) public Double getPrice(){ return price; } Or a class bridge - incase the data we want to index is more than just the field itself e.g. concatenation of 2 fields Custom Bridges Hibernate Search
22. In order to create a custom bridge we need to implement the interface StringBridge ParameterizedBridge – to inject params Custom Bridges Hibernate Search
23. Directory is where Lucene stores its index structure. Filesystem Directory Provider In-memory Directory Provider Clustering Directory Providers Hibernate Search
24. Default Most efficient Limited only by the disk’s free space Can be easily replicated Luke support Filesystem Directory Provider Hibernate Search
25. Index dies as soon as SessionFactory is closed. Very useful when unit testing. (along side with in-memory DBs) Data can be made persistent at any moment, if needed. Obviously, be aware of OutOfMemoryException In-memory Directory Provider Hibernate Search
27. Correlated queries - How do we navigate from one entity to another? Lucene doesn’t support relationships between documents Hibernate Search to the rescue - Denormalization Relationships Hibernate Search
30. Entities can be referenced by other entities. Relationships – Denormalization Pitfall Hibernate Search
31. Entities can be referenced by other entities. Relationships – Denormalization Pitfall Hibernate Search
32. Entities can be referenced by other entities. Relationships – Denormalization Pitfall Hibernate Search
33. The solution: The association pointing back to the parent will be marked with @ContainedIn @Entity @Indexed publicclass Book{ @ManyToOne @IndexEmbedded private Author author; } @Entity @Indexed publicclass Author{ @OneToMany(mappedBy=“author”) @ContainedIn private Set<Book> books; } Relationships – Solution Hibernate Search
34. Responsible for tokenizing and filtering words Tokenizing – not a trivial as it seems Filtering – Clearing the noise (case, stop words etc) and applying “other” operations Creating a custom analyzer is easy The default analyzer is Standard Analyzer Analyzers Hibernate Search
35. StandardTokenizer : Splits words and removes punctuations. StandardFilter : Removes apostrophes and dots from acronyms. LowerCaseFilter : Decapitalizes words. StopFilter : Eliminates common words. Standard Analyzer Hibernate Search
37. N-Gram algorithm – Indexing a sequence of n consecutive characters. Usually when a typo occurs, part of the word is still correct Encyclopedia in 3-grams = Enc | ncy | cyc | ycl | clo | lop | ope | ped | edi | dia Approximative Search Hibernate Search
38. Algorithms for indexing of words by their pronunciation The most widely known algorithm is Soundex Other Algorithms that are available : RefinedSoundex, Metaphone, DoubleMetaphone Phonetic Approximation Hibernate Search
39. Synonyms You can expand your synonym dictionary with your own rules (e.g. Business oriented words) Stemming Stemming is the process of reducing words to their stem, base or root form. “Fishing”, “Fisher”, “Fish” and “Fished” Fish Snowball stemming language – supports over 15 languages Synonyms & Stemming Hibernate Search
40. Lucene is bundled with the basic analyzers, tokenizers and filters. More can be found at Lucene’s contribution part and at Apache-Solr Additional Analyzers Hibernate Search
41. No free Hebrew analyzer for Lucene ItamarSyn-Hershko Involved in the creation of CLucene (The C++ port of Lucene) Creating a Hebrew analyzer as a side project Looking to join forces itamar@divrei-tora.com Hebrew? Hibernate Search
44. When data has changed? Which data has changed? When to index the changing data? How to do it all efficiently? Hibernate Search will do it for you! Transparent indexing Indexing
45. Indexing – On Rollback Application Queue DB Start Transaction Session (Entity Manager) Insert/update delete Lucene Index
46. Indexing – On Rollback Transaction failed Application Queue DB Rollback Start Transaction Session (Entity Manager) Insert/update delete Lucene Index
47. Indexing – On Commit Transaction Committed Application Queue DB Session (Entity Manager) Insert/update delete √ Lucene Index
51. tx = fullTextSession.beginTransaction(); //read the data from the database Query query = fullTextSession.createCriteria(Book.class); List<Book> books = query.list(); for (Book book: books ) { fullTextSession.index( book); } tx.commit(); Manual indexing Indexing
52. tx = fullTextSession.beginTransaction(); List<Integer> ids = getIds(); for (Integer id : ids) { if(…){ fullTextSession.purge(Book.class, id ); } } tx.commit(); fullTextSession.purgeAll(Book.class); Removing objects from the Lucene index Indexing
56. title : lord title: rings +title : lord +title: rings title : lord –author: Tolkien title: r?ngs title: r*gs title: “Lord of the Rings” title: “Lord Rings”~5 title: rengs~0.8 title: lord author: Tolkien^2 And more… Lucene’s Query Syntax Searching
57. To build FTS queries we need to: Create a Lucene query Create a Hibernate Search query that wraps the Lucene query Why? No need to build framework around Lucene Converting document to object happens transparently. Seamless integration with Hibernate Core API Querying Searching
58. String stringToSearch = “rings"; Term term = new Term(“title",stringToSearch); TermQuery query = newTermQuery(term); FullTextQueryhibQuery = session.createFullTextQuery(query,Book.class); List<Book> results = hibQuery.list(); Hibernate Queries Examples Searching
59. String stringToSearch = "r??gs"; Term term = new Term(“title",stringToSearch); WildCardQuery query = newWildCardQuery (term); ... List<Book> results = hibQuery.list(); WildCardQuery Example Searching
61. HS Query Flowchart Searching Hibernate Search Query Query the index Lucene Index Client Receive matching ids Loads objects from the Persistence Context DB DB access (if needed) Persistence Context
62. You can use list(), uniqueResult(), iterate(), scroll() – just like in Hibernate Core ! Multistage search engine Sorting Explanation object Querying tips Searching
68. Head First Java Best of the best of the best Best examples from Hibernate in action The best action of Chuck Norris Scoring example Score Search for: “best java in action books” 0.60206 0.12494 0.30103
72. Alternatives Distributed Spring support Simple Lucene based Integrates with popular ORM frameworks Configurable via XML or annotations Local & External TX Manager
74. Enterprise Search Server Supports multiple protocols (xml, json, ruby, etc...) Runs as a standalone Full Text Search server within a servlet e.g. Tomcat Heavily based on Lucene JSA – Java Search API (based on JPA) ODM (Object/Document Mapping) Spring integration (Transactions) Apache Solr Alternatives
75. Powerful Web Administration Interface Can be tailored without any Java coding! Extensive plugin architecture Server statistics exposed over JMX Scalability – easily replicated Apache Solr Alternatives
76. Resources Lucene Lucenecontrib part Hibernate Search Hibernate Search in Action / Emmanuel Bernard, John Griffin Compass Apache Solr