'Hibernate Search 5: Adding Full-Text Query Super-Powers to Your JPA!'
by Sanne Grinovero- Principal Software Engineer at Red Hat
London JBoss User Group, 14th of January 2015
London JBUG is organised & sponsored by C2B2 - The Leading Independent Middleware Experts: http://www.c2b2.co.uk/home
See the video at: http://youtu.be/HNgwoKmxg8s
About the talk:
With the searchbox being the primary user interaction element for any modern application, the requirement to provide a responsive and smart search engine is becoming a primary concern for many developers. Users and product managers alike are accustomed to the web experience as provided by websites such as Google and Amazon, and will expect or demand the same experience when using your new service.
With the right tool it doesn't have to be complex - you just have to be familiar with the subject and anticipate this new form of requirements!
Hibernate Search is a popular extension of the Hibernate ORM library which integrates Apache Lucene indexing and search capabilities into the traditional Hibernate domain model.
In this talk Sanne will explain what Lucene can do for you, and show how to introduce the technology into an existing JPA/Hibernate based application.
We'll also look at the novelties of the freshly released Hibernate Search version 5.0, and how it integrates with other JBoss community projects such as WildFly and Infinispan.
Exploring the Future Potential of AI-Enabled Smartphone Processors
Hibernate Search 5: Adding Full-Text Query Super-Powers to Your JPA!
1. Hibernate Search 5: Adding Full-
Text Query Super-Powers to Your
JPA
Sanne Grinovero
Principal Software Engineer at Red Hat
@SanneGrinovero
Organised and
Sponsored by
2. Goals
● What's new in Hibernate
● What is “full-text” and how it can help you
● Get you up to speed with Hibernate Search
3. Who am I?
Engineering team: Hibernate
– Hibernate Search project lead
– Hibernate OGM team
– Hibernate ORM
● Query parser and performance
– Infinispan
● Contributor for fun and need
● Designed some recent improvements
● Driving some of the Hibernate and
Apache Lucene integrations
– @SanneGrinovero on Twitter
4.
5. Hibernate ORM
● Hibernate 5 is coming
● Small performance improvements
every week
● Jandex work
● Metamodel work
● New query parser
6. Hibernate Validator
● Bean Validation spec work
● Improved Java8 support
– See version 5.2 (work in progress)
7. Hibernate OGM
● Just released the first stable
version
● Lots of new features coming
8. Hibernate Search
● Just released version 5.0.0.Final in
December
– (and 5.0.1.Final this week)
● Let's see...
10. The Search problem
● Who searches, doesn't know what he Searches:
– Please, don't ask the user to give up the primary key
– Doesn't know the exact content of the document either
14. Does it work? How about these:
String author = “Fabrizio De André”
String title = “Nuvole barocche”
List<Product> list = s.createQuery( “ ...? “ );
String author = “De André, Fabrizio”
String title = “Nuvole barocche”
List<Product> list = s.createQuery( “ ...? “ );
String author = “De Andre, Fabrizio”
String title = “Nuvole barocche”
List<Product> list = s.createQuery( “ ...? “ );
15. More requirements
● Unique search input
– Might contain either/both author, title names
– Both entities might be composed of multiple terms
● Relevance
– Produts matching both should be listed on top
– Exact word matches should be scored better
● So you need approximate word matches?
● How about typos?
16. List<Products> list = s.createQuery(
“ ...? “
).setParameter(“F. de André nuvole barocche”)
.list();
● Mixed case, accents
● Relative order of terms, distance
● Abbreviations, typos
● Match on multiple fields
● 18,800 results in 0,41 seconds
17. More useful stuff:
● Similarity:
– hibernate ~ hybernat
● Proximity, synonyms, abbreviations:
– 'JPA' or 'Java Persistence API'
● Boosting, field adjustments:
– A match in the title is more “worth” than in the text content?
● Stemming (is language specific!)
18. Would you still use Google..
If it returned matches in
alphabetical order?
“hibernate search”
About 3.580.000 results (0.04 seconds)
19. So what..
● The database is not a good fit
– SQL is not appropriate for the task
– Still SQL is very handy for other tasks
● Need to use the best tool for each task:
– Relational databases
– filesystems
– Fulltext search engines
– key-value stores
● Need to integrate them, keep data integrity and
consistency
20. Apache Lucene
● Open source Apache™ top level project,
● Very advanced implementation
● Main language is Java, ports exists in other languages
● Rich open source ecosystem around it
● Many products embed it
26. Lucene: Stopwords
● Removes terms which are frequently used: not suited as
search keywords – might depend on your domain!
a,able,about,across,after,all,almost,also,am,among,an
,and,any,are,as,at,be,because,been,but,by,can,cannot,
could,dear,did,do,does,either,else,ever,every,for,fro
m,get,got,had,has,have,he,her,hers,him,his,how,howeve
r,i,if,in,into,is,it,its,just,least,let,like,likely,m
ay,me,might,most,must,my,neither,no,nor,not,of,off,of
ten,on,only,or,other,our,own,rather,said,say,says,she
,should,since,so,some,than,that,the,their,them,then,t
here,these,they,this,tis,to,too,twas,us,wants,was,we,
were,what,when,where,which,while,who,whom,why,will,wi
th,would,yet,you,your
27. Apache Lucene: Index
● It requires an Index
– On filesystem
– In memory
– ...
● Made of immutable segments
– Optimized for search speed, not for updates
● A world of strings and frequencies
28. So, back to our database..
● The index structure is deeply different than a relational
database – not everything is possible
● You need to keep the data in sync
– In case they are not, which one should be trusted more?
● How do queries look like?
● What do queries return?
29. Different worlds
● A Lucene Document, is unstructured (schemaless),
something close to
Map<String,String>
● An Hibernate model is structured to be functional as
representation of your business model
● Entities returned by an EntityManager or Hibernate
Session are managed, to keep the database in sync
● A bridge is needed
52. Infinispan
● Open source highly scalable data grid platform
– Distribution or Replication
– Sync or Async
– Transactional
– Persists contents using a CacheLoader
● Write-through or write-behind
● Shared or per cluster node
– Hibernate second-level cache
– State of the art eviction strategies
– Technology used by WildFly and JBoss EAP for many clustering features