JavaEdge09 : Java Indexing and Searching

Java Indexing and Searching By : Shay Sofer & EvgenyBorisov

Motivation Lucene Intro Hibernate Search Indexing Searching Scoring Alternatives Agenda

Motivation What is Full Text Search and why do I need it?

Motivation Use case “Book” table Good practices for Gava

We’d like to : Index the information efficiently answer queries using that index More common than you think Full Text Search Motivation

Integrated full text search engine in the database e.g. DBSight, Recent versions of MySQL, MS SQL Server, Oracle Text, etc Out of the box Search Appliances e.g. Google Search Appliance Third party libraries Full Text Search Solutions Motivation

The most popular full text search library Scalable and high performance Around for about 9 years Open source Supported by the Apache Software Foundation Apache Lucene Lucene Intro

“Word-oriented” search Powerful query syntax Wildcards, typos, proximity search. Sorting by relevance (Lucene’s scoring algorithm) or any other field Fast searching, fast indexing Inverted index. Lucene’s Features Lucene Intro

Lucene Intro Inverted Index DB Head First Java 0 Best of the best of the best 1 Chuck Norris in action 2 JBoss in action 3

A Field is a key+value. Value is always represented as a String (Textual) A Document can contain as many Fields as we’d like Lucene’sindex is a collection of Documents Basic Definitions Lucene Intro

Lucene Intro Using Lucene API… IndexSearcher is = newIndexSearcher(“BookIndex"); QueryParserparser = newQueryParser("title", analyzer); Query query = parser.parse(“Good practices for Gava”); return is.search(query);

OO domain model Vs. Lucene’s Index structure Lucene Intro

The Structural Mismatch Converting objects to string and vice versa No representation of relation between Documents The Synchronization Mismatch DB must by sync’ed with the index The Retrieval Mismatch Retrieving documents ( =pairs of key + value) and not objects Object vs Flat text mismatches Lucene Intro

Hibernate Search Emmanuel Bernard

Leverages ORM and Lucene together to solve those mismatches Complements Hibernate Core by providing FTS on persistent domain models. It’s actually a bridge that hides the sometimes complex Lucene API usage. Open source. Hibernate Search

Document = Class (Mapped POJO) Hibernate Search metadata can be described by Annotations only Regardless, you can still use Hibernate Core with XML descriptors (hbm files) Let’s create our first mapping – Book Mapping Hibernate Search

@Entity @Indexed publicclass Book implementsSerializable { @Id private Long id; @Boost(2.0f) @Field private String title; @Field privateStringdescription; privateStringimageURL; @Field (index=Index.UN_TOKENIZED) privateStringisbn; … } Hibernate Search

Types will be converted via “Field Bridge”. It is a bridge between the Java type and its representation in Lucene (aka String) Hibernate Search comes with a set for most standard types (Numbers – primitives and wrappers, Date, Class etc) They are extendable, of course Bridges Hibernate Search

We can use a field bridge… @FieldBridge(impl = MyPaddedFieldBridge.class, params = {@Parameter(name="padding", value=“5")} ) public Double getPrice(){ return price; } Or a class bridge - incase the data we want to index is more than just the field itself e.g. concatenation of 2 fields Custom Bridges Hibernate Search

In order to create a custom bridge we need to implement the interface StringBridge ParameterizedBridge – to inject params Custom Bridges Hibernate Search

Directory is where Lucene stores its index structure. Filesystem Directory Provider In-memory Directory Provider Clustering Directory Providers Hibernate Search

Default Most efficient Limited only by the disk’s free space Can be easily replicated Luke support Filesystem Directory Provider Hibernate Search

Index dies as soon as SessionFactory is closed. Very useful when unit testing. (along side with in-memory DBs) Data can be made persistent at any moment, if needed. Obviously, be aware of OutOfMemoryException In-memory Directory Provider Hibernate Search

<propertyname="hibernate.search.default.directory_provider"> org.hibernate.search.store.FSDirectoryProvider </property> <propertyname= "hibernate.search.com.alphacsp.Book.directory_provider"> org.hibernate.search.store.RAMDirectoryProvider </property> Directory Providers Config Example Hibernate Search

Correlated queries - How do we navigate from one entity to another? Lucene doesn’t support relationships between documents Hibernate Search to the rescue - Denormalization Relationships Hibernate Search

@Entity@Indexed publicclass Book{ @ManyToOne @IndexEmbedded private Author author; } @Entity @Indexed publicclass Author{ private String firstName; } Object navigation is easy (author.firstName) Relationships Hibernate Search

Entities can be referenced by other entities. Relationships – Denormalization Pitfall Hibernate Search

The solution: The association pointing back to the parent will be marked with @ContainedIn @Entity @Indexed publicclass Book{ @ManyToOne @IndexEmbedded private Author author; } @Entity @Indexed publicclass Author{ @OneToMany(mappedBy=“author”) @ContainedIn private Set<Book> books; } Relationships – Solution Hibernate Search

Responsible for tokenizing and filtering words Tokenizing – not a trivial as it seems Filtering – Clearing the noise (case, stop words etc) and applying “other” operations Creating a custom analyzer is easy The default analyzer is Standard Analyzer Analyzers Hibernate Search

StandardTokenizer : Splits words and removes punctuations. StandardFilter : Removes apostrophes and dots from acronyms. LowerCaseFilter : Decapitalizes words. StopFilter : Eliminates common words. Standard Analyzer Hibernate Search

Other cool Filters…. Hibernate Search

N-Gram algorithm – Indexing a sequence of n consecutive characters. Usually when a typo occurs, part of the word is still correct Encyclopedia in 3-grams = Enc | ncy | cyc | ycl | clo | lop | ope | ped | edi | dia Approximative Search Hibernate Search

Algorithms for indexing of words by their pronunciation The most widely known algorithm is Soundex Other Algorithms that are available : RefinedSoundex, Metaphone, DoubleMetaphone Phonetic Approximation Hibernate Search

Synonyms You can expand your synonym dictionary with your own rules (e.g. Business oriented words) Stemming Stemming is the process of reducing words to their stem, base or root form. “Fishing”, “Fisher”, “Fish” and “Fished”  Fish Snowball stemming language – supports over 15 languages Synonyms & Stemming Hibernate Search

Lucene is bundled with the basic analyzers, tokenizers and filters. More can be found at Lucene’s contribution part and at Apache-Solr Additional Analyzers Hibernate Search

No free Hebrew analyzer for Lucene ItamarSyn-Hershko Involved in the creation of CLucene (The C++ port of Lucene) Creating a Hebrew analyzer as a side project Looking to join forces itamar@divrei-tora.com Hebrew? Hibernate Search

Hibernate Search שר הטבעות, גירסה ראשונה:אחוות הטבעת

When data has changed? Which data has changed? When to index the changing data? How to do it all efficiently? Hibernate Search will do it for you! Transparent indexing Indexing

Indexing – On Rollback Application Queue DB Start Transaction Session (Entity Manager) Insert/update delete Lucene Index

Indexing – On Rollback Transaction failed Application Queue DB Rollback Start Transaction Session (Entity Manager) Insert/update delete Lucene Index

Indexing – On Commit Transaction Committed Application Queue DB Session (Entity Manager) Insert/update delete √ Lucene Index

<property name="org.hibernate.worker.execution“>async </property> <property name="org.hibernate.worker.thread_pool.size“>2 </property> <property name="org.hibernate.worker.buffer_queue.max“>10 </property> hibernate.cfg.xml Indexing

Indexing It’s too late! I already have a database without Lucene!

FullTextSession extends from Session of Hibernate core Session session = sessionFactory.openSession(); FullTextSessionfts = Search.getFullTextSession(session); index(Object entity) purge(Class entityType, Serializable id) purgeAll(Class entityType) Manual indexing Indexing

tx = fullTextSession.beginTransaction(); //read the data from the database Query query = fullTextSession.createCriteria(Book.class); List<Book> books = query.list(); for (Book book: books ) { fullTextSession.index( book); } tx.commit(); Manual indexing Indexing

tx = fullTextSession.beginTransaction(); List<Integer> ids = getIds(); for (Integer id : ids) { if(…){ fullTextSession.purge(Book.class, id ); } } tx.commit(); fullTextSession.purgeAll(Book.class); Removing objects from the Lucene index Indexing

Indexing Rrrr!!! I got an OutOfMemoryException!

session.setFlushMode(FlushMode.MANUAL); session.setCacheMode(CacheMode.IGNORE); Transactiontx=session.beginTransaction(); ScrollableResultsresults = session.createCriteria(Item.class) .scroll(ScrollMode.FORWARD_ONLY); intindex = 0; while(results.next()) { index++; session.index(results.get(0)); if (index % BATCH_SIZE == 0){ session.flushToIndexes(); session.clear(); } } tx.commit(); Indexing 100 54

title : lord title: rings +title : lord +title: rings title : lord –author: Tolkien title: r?ngs title: r*gs title: “Lord of the Rings” title: “Lord Rings”~5 title: rengs~0.8 title: lord author: Tolkien^2 And more… Lucene’s Query Syntax Searching

To build FTS queries we need to: Create a Lucene query Create a Hibernate Search query that wraps the Lucene query Why? No need to build framework around Lucene Converting document to object happens transparently. Seamless integration with Hibernate Core API Querying Searching

String stringToSearch = “rings"; Term term = new Term(“title",stringToSearch); TermQuery query = newTermQuery(term); FullTextQueryhibQuery = session.createFullTextQuery(query,Book.class); List<Book> results = hibQuery.list(); Hibernate Queries Examples Searching

String stringToSearch = "r??gs"; Term term = new Term(“title",stringToSearch); WildCardQuery query = newWildCardQuery (term); ... List<Book> results = hibQuery.list(); WildCardQuery Example Searching

Motivation Use case Book table Good practices for Gava

HS Query Flowchart Searching Hibernate Search Query Query the index Lucene Index Client Receive matching ids Loads objects from the Persistence Context DB DB access (if needed) Persistence Context

You can use list(), uniqueResult(), iterate(), scroll() – just like in Hibernate Core ! Multistage search engine Sorting Explanation object Querying tips Searching

Most based on Vector Space Model of Salton Score

Term Rating Score Logarithm number of documents in the index term weight total number of documents containing term “I” best java in action books

Head First Java Best of the best of the best Best examples from Hibernate in action The best action of Chuck Norris Scoring example Score Search for: “best java in action books” 0.60206 0.12494 0.30103

Conventional Boolean retrieval Calculating score for only matching documents Customizing similarity algorithm Query boosting Custom scoring algorithms Lucene’s scoring approach Score

Alternatives Distributed Spring support Simple Lucene based Integrates with popular ORM frameworks Configurable via XML or annotations Local & External TX Manager

Enterprise Search Server Supports multiple protocols (xml, json, ruby, etc...) Runs as a standalone Full Text Search server within a servlet e.g. Tomcat Heavily based on Lucene JSA – Java Search API (based on JPA) ODM (Object/Document Mapping) Spring integration (Transactions) Apache Solr Alternatives

Powerful Web Administration Interface Can be tailored without any Java coding! Extensive plugin architecture Server statistics exposed over JMX Scalability – easily replicated Apache Solr Alternatives

Resources Lucene Lucenecontrib part Hibernate Search Hibernate Search in Action / Emmanuel Bernard, John Griffin Compass Apache Solr

JavaEdge09 : Java Indexing and Searching

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (19)

Andere mochten auch

Andere mochten auch (9)

Ähnlich wie JavaEdge09 : Java Indexing and Searching

Ähnlich wie JavaEdge09 : Java Indexing and Searching (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

JavaEdge09 : Java Indexing and Searching

Hinweis der Redaktion