This presentation was provided by Bob Kasenchak of Access Innovations, during the NISO Webinar "Discover and Online Search, Part Two: Personalized Content, Personal Data," which was held on June 19, 2019.
Kasenchak "What Is Semantic Search? And Why Is It Important?"
1. What Is Semantic Search?
And Why Is It Important?
Bob Kasenchak
Access Innovations
@taxobob
bob_kasenchak@accessinn.com
NISO Webinar
“Discovery and Online search:
Personalized Content, Personal Data”
2. Outline
Semantic Search
● What Is It? (Basics)
● Why Do We Need It? (Why Does Search Fail?)
● So…What Is It? (Specifics)
● Examples and Implementations
3. What Is Semantic Search?
Semantic Search goes beyond
keyword searches
to examine
Context
Google says “Things Not Strings”
4. Why “Basic” Search Fails
Why does search fail?
● Simple search simply matches text strings
● Language is ambiguous
● There is a *lot* of content
9. Why “Basic” Search Fails
Search fails because simple string matching is not adequate
for large, specialized repositories of content with technical
language that evolves over time.
(Also, language is ambiguous.)
10. What Is Semantic Search?
Semantic Search goes beyond
keyword string-matching
to examine
Context
using a variety of means
Google says “Things Not Strings”
11. What Is Semantic Search?
Semantic Search
Examines the semantic context of the search query to drive
relevant results.
This can include: taxonomies, lexical variants, location, your
previous searches, previous similar searches, ontologies,
knowledge graphs, and other strategies.
12. Allow Lexical Variants
“Fuzzy Matching” and Similar Techniques
● Use Levenshtein distance (or similar) to match
misspellings and variants
● Stem words for search
● Instead of exact string matches
● This can cause noise, be careful!
17. Google Knowledge Graph
Google (again): Things not Strings
The Google Knowledge Graph connects search with e.g.
known facts about entities
(Driven by a big old ontology)
21. Taxonomies and Tagging
Controlling Vocabularies
● Search tags before free text (search engine tuning)
● Allow users to browse (in addition to querying)
● Suggest topics using type-ahead or “did you mean”
● Leverage synonymy to deliver same relevant results from
various inputs
22. Taxonomies and Tagging
The Irony of Document Categorization
● We’re interested in concepts
● Words are ambiguous
● But words in the text are all we have to go on
○ Unless we apply good subject metadata
23. PLOS
● 9000+ term thesaurus
● And ~4000 Synonyms (!)
● Applied to documents
● Exposed in browse (!)
● Used to redirect search queries for synonyms
● Exposed at article level to user
○ Crowdsourced QC!
Taxonomies and Tagging
26. JSTOR
● Document becomes the search query (!)
● Combination of taxonomy and naive classification
● Suggests related content for research, bibliography
● Experimental, successful, also very cool
labs.jstor.org
Other Novel Approaches
28. ● Simple things:
● Using existing search software tuning/options
● Enable fuzzy matching
● Configure how Booleans are automatically applied
● Weight fields, doc types, etc. where appropriate
● Use dates to deliver recent results
Implementation: Kind of Easy
29. ● Next level:
● Taxonomy
● And tagging
● Knowledge Graphs
● User profiles, user behavior, other targeted means
Implementation: More Complex
Good afternoon. My name is Bob Kasenchak – I’m a taxonomist and director of business development at Access Innovations. Today I’d like to talk about Semantic Search, why it’s important, ways to implement it, and other related topics.
My talk will outline the topic and introduce some concepts; the subsequent talks will, I’m sure, go into more detail.
Here’s a brief outline of my talk. My goal is that by the end of the talk you have some idea of what Semantic Search is – or, at least, about the sundry approaches that people mean when they say “semantic search.” This will set up Duane and Travis to go into specifics and details. There is (in theory, at least) ample time in this block for questions and discussion at the end.
Ironically (since we’re talking about semantics) people use the term “semantic search” to refer to a variety of things – but they all have something in common: trying to extend or amplify or improve search beyond matching keyword strings to – using some method or methods – determine the context of the search. This takes a number of forms, about more which shortly. Google has popularized the tagline “things not strings” to explain its semantic search, which involves a knowledge graph (that is: ontology) and some other stuff. First, though, I think it will be helpful to investigate the problem we’re trying to solve.
So: what problem are we trying to solve?
What’s the problem with “regular” or “basic” search? And specifically, I think, I mean: in the context of scholarly publishing. Most of the time, search is limited to whatever default search is available on the platforms used by scholarly publishers – there are almost always some options, and they are almost never used; further, some platforms have limitations. And most are based on “regular” string-based search.
When I say “regular” or “basic” search: I mean that you enter a keyword in a box, and the search application (using an inverted index) tries to match that string with documents. Sometimes fuzzy matching is used – to catch misspellings and whatnot – but in essence “regular” search looks, literally, for WORDS in DOCUMENTS. That is to say: it matches text strings.
The problem for publishers is that in large, specialized content sets “basic” search fails because (1) it’s simply looking for text strings, and does not have the detailed kinds of indexing that e.g. Google does on a constant basis; (2) because language is ambiguous; and (3) because specialized repositories tend to have very, very large content sets with extraordinarily detailed and specialized vocabularies that change over time.
In essence, simple search just looks for the words in the query. In this example from Google Scholar, I have searched for the word “horse”. Two things are noteworthy here: (1) I got over 3 million results, and (2) the logic of the algorithm must prioritize author names -- since, as you can see (I hope), the first paper listed is not about horses; rather, the author’s name is “Red-Horse.”
This is, in my analysis, sub-optimal. Why not have a place to specifically search for Author – or omit Author names from search?
Further -- without using any fancy synonymy or acronyms or other semantic trickery -- if I search instead for the string “horses” I get 1.7 m results. This seems to be triggered by the form of the word in the title. In other words: Google scholar doesn’t even recognize simple English plurals as the same string -- and by extension, the same result set. It’s literally – and merely – looking for instances of the exact string typed in the Box.
Again: the search simply tries to match the word (or words) in the box in some place in a very large set of documents, with some priority seemingly given to which field (title, author, abstract) in which the word appears.
Google search -- not Google scholar -- seems to have figured this out long ago, but for some reason the Google scholar search...um, has not. Some simple NLP would go a long way here, such as recognizing plurals, not to mention other types of synonyms, instead of just words.
Another example, this time from a specialized repository (content from a scholarly publisher) -- the same concept can be expressed in multiple ways (besides morphological variants of words like plurals and adjectives); this problem is most acutely expressed with acronyms. While all acronyms have a great number of meanings, within many fields some acronyms are so ubiquitous that they should be accounted for in search.
This example shows search results for “unmanned aerial vehicles” in the library of a mid-sized academic publisher...for which, as you can see, 171 results were found…
...whereas a search for “uav” (the acronym for the same concept) returns over three times as many results.
Again, it’s obvious from the results listings of the respective queries that the way the concept is expressed in the title has a major bearing on the content returned.
Let me say that one more time: The way the author chooses to represent the main topic of an article in the title has a major bearing on the search results.
This, again, to me, seems to be sub-optimal.
So. Why does search fail?
It fails because simple string matching is not adequate for large, specialized repositories of content with technical language that evolves over time.
Also: it goes without saying I think that language is ambiguous.
So: the basic idea is that Semantic Search goes beyond simple keyword text-string matching. This takes a variety of forms, the common denominator of which, I think, is that:
Semantic Search does not merely try to find keywords but examines the semantic context of the search query to drive relevance. This can include synonyms and lexical variants as well as things like location of the user, or involve graph databases and related concepts as well as a host of other methods. Some of these are quite simple, and others quite complex and involved. Accordingly, some are simple to implement – and some are, well, not.
So. Some examples will I think help.
One basic kind of semantic search is designed to help people find what they’re looking for without knowing the exact or specific language or terminology in the content. The idea is to allow the search engine to match near-matches instead of just exact strings. This can cause noise – so use caution. The upside is that someone might not know how to spell “Gastrointestinal stromal neoplasms” – but if they can get close, it might still match.
Levenshtein distance (speaking of things I have to look up to spell correctly) is a commonly used metric to determine the “distance” between words that essentially depends on how many changes you’d have to make to get from Word A to Word B. So, for example, “Bob” and “Rob” are very close in Levenshtein terms, while “Bob” and “Antidisestablishmentarianism” are far, far apart. In short, it’s a kind of “fuzzy” matching algorithm.
Another use of the knowledge graph stems from user behavior. We are no longer trained to use old-fashioned library searches with keywords and Boolean operators; instead, when we use Google, we often type in a query – not just a keyword. This also relies on the knowledge graph.
For example, Googling “Harrison Ford” and “When is Harrison Ford’s Birthday” picks up two different results. The first resembles a keyword search; the second is a natural language query, which requires parsing. Once parsed, the Knowledge Graph can deliver a targeted result with the exact information sought. Note that the info box on the right (the knowledge graph results) are the same – it’s referencing the graph to pull out information based on the parsed query.
Another set of methods are in a category called “contextual search”. This moniker applies to a variety of techniques to use information gathered about the user, location, or recent searches to drive relevance.
For example, some applications – notably Google, but also other map-based applications – use your location to deliver relevant local results. This is usually done using either your IP address (which contains location information) or by allowing an application to access GPS data, usually from your phone or other devices. For example, if I Google “pizza” the first results aren’t definitions and Wikipedia pages about what pizza is; rather, I get suggestions for “pizza near me”.
In fact, all of the first page of Google results comprises pizza restaurants.
Note, however, that the Google Knowledge graph in the right sidebar does provide the generic definition (Wikipedia, again) as well as – interestingly – information about pizza stocks to invest in, which is interesting.
Also from Google (can you see a trend here?) we can see another of contextual search based on recent activity or user profiles. This works particularly well if you’re logged into Google, of course.
If I search for “jaguar” – what do I expect: cars or cats? *I* get results for cars – although I can’t quite say why. Someone who regularly searches for animals would get results for large cats. So Google stores logged-in user searches and delivers results based on previous activity.
This can be useful for publishers and societies – if, and only if, users log in when they come to search your site – or they persistently “stay” logged in (the browser remembers them or whatever). In this way, if you’re, say, a cancer research organization and your members have some kind of indication about whether they’re doctors or researchers or patients or pharma reps – they can get relevant content delivered based on their member profile. Naturally, there’s some work to do to set this up. But it can be very valuable.
Incidentally, I also Googled “jaguars” plural – expecting to get the big cats. However, probably since I regularly read NFL content, my results were for the Jacksonville Jaguars.
(I would like to note that I use the same Google profile across my devices, so this does not mean I’m reading NFL news AT WORK.)
The Google Knowledge Graph is an example that’s becoming very commonplace; I’ll show a screengrab in a minute. Briefly, when a query strikes some node in the graph – in addition to providing search results of web pages, a sidebar appears with other information related to the search. This works particularly well with entities., less well with concepts. Here’s an example:
So. Searching for “Empire State Building” brings up, predictably, the website, twitter feed, and (below, as we shall see) the Wikipedia page. But over on the right side is a bunch of information from the Google Knowledge Graph. Let’s take a closer look:
(I grabbed the stuff downscreen and put them side-by-side). So we get some pictures of the Empire State Building – with a link to more – and links to the website, Google Maps for directions, a blurb (which comes from Wikipedia), address, statistics, Questions and Answers, Reviews, Popular times, stuff about the movie of the same name, links to social media, and links to “other people also search for”. Pretty cool – and a much richer experience than just the web page results.
It is also pretty intuitively presented, which is nice.
How does this work?
Basically, there’s a ginormous ontology – specifically, a knowledge graph – and if you hit a node it returns a bunch of other information is has associated with that node using the semantic web.
Here I’ve shown a totally made up but plausible facsimile of what the knowledge graph behind the information box in the previous slides must look like.
It is a lot of work to build knowledge graphs like this – but they are extraordinarily powerful. And the industry is moving in this direction.
A good way to approach semantic search is to, basically, try to control the semantics. Taxonomies can help in a number of ways: tuning the search to prioritize tags over free text, allowing users to browse a taxonomy of subjects, using taxonomy terms to drive type-ahead or “did you mean”-style redirection, and using synonymy to drive the same relevant results for a number of string inputs. This allows for both improvements in search and improvements in interfaces – how users interact with the data.
On tagging: the irony of document categorization is that we’re not interested in words -- we’re interested in concepts; but the only window into the concepts we have available are the words. This is why we use subject metadata to describe things (using a taxonomy or other controlled vocabulary), and then make that metadata available to the search engine. This is not news, of course.
But it illustrates why good subject categorization with a controlled vocabulary helps to organize the data to make it more efficient for search.
The previous examples showing failed searches using abbreviations, synonyms, acronyms, and other lexical variants can be solved with a robust, well-formed taxonomy and document tagging program.
This is not new to the industry – but it merits discussion in a talk about semantic search.
What can good document categorization achieve? As an example, let’s take a look at the PLOS One platform (full disclosure -- PLOS is a client and I have worked with their taxonomy team). PLOS uses a very large thesaurus and automatically tags each article with up to 8 terms from the vocabulary for search; they also expose the full hierarchy for browsing.
Here’s the browse interface; in addition to the hierarchy, you can see how many articles are attached to each term (and, of course, launch a search by clicking).
And here’s a sample PLOS One article; in the lower right-hand corner you can see the terms applied -- and, moreover, click on one to launch a search on a topic. This is great -- you don’t have to guess how they phrase the term you’re interested in! Even cooler -- the little buttons next to each yellow bar are to flag the article if you think a term has been misapplied.
These pretty standard applications of metadata, tagging, taxonomy, and search technologies make the user experience much improved. The same principle applies to tagging content for machine learning!
To illustrate the direction semantic search is heading, next I want to say a little big about the work JSTOR -- and in particular, JSTOR Labs -- has been doing around their search experience, which relies on a combination of traditional taxonomy-based tagging and inferential (that is, naive) topic modeling to create new search experiences.
In the JSTOR Labs Text Analyzer, any document you can OCR, upload, or take a picture of with your phone becomes your search. The document is analyzed both in terms of the massive JSTOR taxonomy (again, full disclosure -- I worked on this project!) as well as other words and entities that appear in the document -- and recommends related content in JSTOR (some 7 million articles, last I checked)! You can curate the results to make the algorithms more accurate.
This very cool beta project takes a new perspective on search using both traditional metadata, taxonomy, and tagging applications as well as ML-based technologies.
So, practically speaking, how do we get this done? If you want to implement semantic search: where do you start?
Any existing search platform has a back-end – which you may need a developer to access and change – that can be configured to take advantage of built-in features. This can include using fuzzy matching (or not), changing other settings like automatic Booleans in strings, using dates to rank relevancy of results, or prioritizing certain fields (for structured content, of course) to, say, prefer keyword strings found in titles first.
Beyond configuring search to improve results, consider taxonomies and tagging – both for retrieval and interface options like type-ahead and “did you mean”. The next level is a knowledge graph – which is a considerable effort, but a very powerful tool. And understanding the user – whether considering previous searches or some kind of user profile with useful information – is yet another avenue to pursue.