Kasenchak "What Is Semantic Search? And Why Is It Important?"

•Als PPTX, PDF herunterladen•

0 gefällt mir•429 views

National Information Standards Organization (NISO)

This presentation was provided by Bob Kasenchak of Access Innovations, during the NISO Webinar "Discover and Online Search, Part Two: Personalized Content, Personal Data," which was held on June 19, 2019.

Bildung

What Is Semantic Search?
And Why Is It Important?
Bob Kasenchak
Access Innovations
@taxobob
bob_kasenchak@accessinn.com
NISO Webinar
“Discovery and Online search:
Personalized Content, Personal Data”

Outline
Semantic Search
● What Is It? (Basics)
● Why Do We Need It? (Why Does Search Fail?)
● So…What Is It? (Specifics)
● Examples and Implementations

What Is Semantic Search?
Semantic Search goes beyond
keyword searches
to examine
Context
Google says “Things Not Strings”

Why “Basic” Search Fails
Why does search fail?
● Simple search simply matches text strings
● Language is ambiguous
● There is a *lot* of content

Specialized Repositories
Why “Basic” Search Fails

Why “Basic” Search Fails
Search fails because simple string matching is not adequate
for large, specialized repositories of content with technical
language that evolves over time.
(Also, language is ambiguous.)

What Is Semantic Search?
Semantic Search goes beyond
keyword string-matching
to examine
Context
using a variety of means
Google says “Things Not Strings”

What Is Semantic Search?
Semantic Search
Examines the semantic context of the search query to drive
relevant results.
This can include: taxonomies, lexical variants, location, your
previous searches, previous similar searches, ontologies,
knowledge graphs, and other strategies.

Allow Lexical Variants
“Fuzzy Matching” and Similar Techniques
● Use Levenshtein distance (or similar) to match
misspellings and variants
● Stem words for search
● Instead of exact string matches
● This can cause noise, be careful!

Google Knowledge Graph
Google (again): Things not Strings
The Google Knowledge Graph connects search with e.g.
known facts about entities
(Driven by a big old ontology)

Taxonomies and Tagging
Controlling Vocabularies
● Search tags before free text (search engine tuning)
● Allow users to browse (in addition to querying)
● Suggest topics using type-ahead or “did you mean”
● Leverage synonymy to deliver same relevant results from
various inputs

Taxonomies and Tagging
The Irony of Document Categorization
● We’re interested in concepts
● Words are ambiguous
● But words in the text are all we have to go on
○ Unless we apply good subject metadata

PLOS
● 9000+ term thesaurus
● And ~4000 Synonyms (!)
● Applied to documents
● Exposed in browse (!)
● Used to redirect search queries for synonyms
● Exposed at article level to user
○ Crowdsourced QC!
Taxonomies and Tagging

JSTOR
● Document becomes the search query (!)
● Combination of taxonomy and naive classification
● Suggests related content for research, bibliography
● Experimental, successful, also very cool
labs.jstor.org
Other Novel Approaches

● Simple things:
● Using existing search software tuning/options
● Enable fuzzy matching
● Configure how Booleans are automatically applied
● Weight fields, doc types, etc. where appropriate
● Use dates to deliver recent results
Implementation: Kind of Easy

● Next level:
● Taxonomy
● And tagging
● Knowledge Graphs
● User profiles, user behavior, other targeted means
Implementation: More Complex

Thanks!
Bob Kasenchak
Access Innovations
@taxobob
bob_kasenchak@accessinn.com
NISO Webinar
“Discovery and Online search:
Personalized Content, Personal Data”

Weitere ähnliche Inhalte

Ähnlich wie Kasenchak "What Is Semantic Search? And Why Is It Important?"

Sourcing languagesDean Da Costa

Internet Research Presentationadeason

Googling academic researchwchrism

How search engines work Anand SainiDr,Saini Anand

Il semaforo di Yoast non è il (tuo) problemaLaura Sacco

Natural language processing and searchNathan McMinn

Effective search strategiesLisa Proctor

Google Is a Two Page SiteMartina Helene Welander

Martina Welander - Google is a two pagesiteNordicSitecoreConference

Vandenbosch2010 04-13search the-internetJan Beniest

E-LEARN: Search StrategiesRose Petralia

week 8 Effective Searching on Internet.pptMohamed960052

Introduction to natural language processing (NLP)Alia Hamwi

Semantic Search tutorial at SemTech 2012Peter Mika

Apfm studio bechard_jan182013classMelanie Parlette-Stewart

Search skillsEslamEzz7

INFORMATION SKILLS: NAVIGATING RESEARCH IN LIBRARYChris Okiki

Deck_Rob FlahertyWarren Hersch

Full text searchdeleteman

Paradigm Wars: Object Oriented Vs Functional Programming in creating MarkParserRohit Arora

Ähnlich wie Kasenchak "What Is Semantic Search? And Why Is It Important?" (20)

Sourcing languages

Internet Research Presentation

Googling academic research

How search engines work Anand Saini

Il semaforo di Yoast non è il (tuo) problema

Natural language processing and search

Effective search strategies

Google Is a Two Page Site

Martina Welander - Google is a two pagesite

Vandenbosch2010 04-13search the-internet

E-LEARN: Search Strategies

week 8 Effective Searching on Internet.ppt

Introduction to natural language processing (NLP)

Semantic Search tutorial at SemTech 2012

Apfm studio bechard_jan182013class

Search skills

INFORMATION SKILLS: NAVIGATING RESEARCH IN LIBRARY

Deck_Rob Flaherty

Full text search

Paradigm Wars: Object Oriented Vs Functional Programming in creating MarkParser

Mehr von National Information Standards Organization (NISO)

Bazargan "NISO Webinar, Sustainability in Publishing"National Information Standards Organization (NISO)

Rapple "Scholarly Communications and the Sustainable Development Goals"National Information Standards Organization (NISO)

Compton "NISO Webinar, Sustainability in Publishing"National Information Standards Organization (NISO)

Mattingly "AI & Prompt Design: Large Language Models"National Information Standards Organization (NISO)

Hazen, Morse, and Varnum "Spring 2024 ODI Conformance Statement Workshop for ...National Information Standards Organization (NISO)

Mattingly "AI & Prompt Design" - Introduction to Machine Learning"National Information Standards Organization (NISO)

Mattingly "Text and Data Mining: Building Data Driven Applications"National Information Standards Organization (NISO)

Mattingly "Text and Data Mining: Searching Vectors"National Information Standards Organization (NISO)

Mattingly "Text Mining Techniques"National Information Standards Organization (NISO)

Mattingly "Text Processing for Library Data: Representing Text as Data"National Information Standards Organization (NISO)

Carpenter "Designing NISO's New Strategic Plan: 2023-2026"National Information Standards Organization (NISO)

Ross and Clark "Strategic Planning"National Information Standards Organization (NISO)

Mattingly "Data Mining Techniques: Classification and Clustering"National Information Standards Organization (NISO)

Straza "Global collaboration towards equitable and open science: UNESCO Recom...National Information Standards Organization (NISO)

Lippincott "Beyond access: Accelerating discovery and increasing trust throug...National Information Standards Organization (NISO)

Kriegsman "Integrating Open and Equitable Research into Open Science"National Information Standards Organization (NISO)

Mattingly "Ethics and Cleaning Data"National Information Standards Organization (NISO)

Mercado-Lara "Open & Equitable Program"National Information Standards Organization (NISO)

Ratner "Enhancing Open Science: Assessing Tools & Charting Progress"National Information Standards Organization (NISO)

Pfeiffer "Enhancing Open Science: Assessing Tools & Charting Progress"National Information Standards Organization (NISO)

Mehr von National Information Standards Organization (NISO) (20)

Bazargan "NISO Webinar, Sustainability in Publishing"

Rapple "Scholarly Communications and the Sustainable Development Goals"

Compton "NISO Webinar, Sustainability in Publishing"

Mattingly "AI & Prompt Design: Large Language Models"

Hazen, Morse, and Varnum "Spring 2024 ODI Conformance Statement Workshop for ...

Mattingly "AI & Prompt Design" - Introduction to Machine Learning"

Mattingly "Text and Data Mining: Building Data Driven Applications"

Mattingly "Text and Data Mining: Searching Vectors"

Mattingly "Text Mining Techniques"

Mattingly "Text Processing for Library Data: Representing Text as Data"

Carpenter "Designing NISO's New Strategic Plan: 2023-2026"

Ross and Clark "Strategic Planning"

Mattingly "Data Mining Techniques: Classification and Clustering"

Straza "Global collaboration towards equitable and open science: UNESCO Recom...

Lippincott "Beyond access: Accelerating discovery and increasing trust throug...

Kriegsman "Integrating Open and Equitable Research into Open Science"

Mattingly "Ethics and Cleaning Data"

Mercado-Lara "Open & Equitable Program"

Ratner "Enhancing Open Science: Assessing Tools & Charting Progress"

Pfeiffer "Enhancing Open Science: Assessing Tools & Charting Progress"

Kürzlich hochgeladen

DATA STRUCTURE AND ALGORITHM for beginnersSabitha Banu

Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Celine George

ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxiammrhaywood

Difference Between Search & Browse Methods in Odoo 17Celine George

Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfTechSoup

ACC 2024 Chronicles. Cardiology. Exam.pdfSpandanaRallapalli

Q4 English4 Week3 PPT Melcnmg-based.pptxnelietumpap1

YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptxConquiztadors- the Quiz Society of Sri Venkateswara College

Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfMr Bounab Samir

USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...Postal Advocate Inc.

Raw materials used in Herbal Cosmetics.pptxAshokrao Mane college of Pharmacy Peth-Vadgaon

ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTiammrhaywood

Science 7 Quarter 4 Module 2: Natural Resources.pptxMaryGraceBautista27

ENGLISH6-Q4-W3.pptxqurter our high choomnelietumpap1

MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxAnupkumar Sharma

Choosing the Right CBSE School A Comprehensive Guide for Parentsnavabharathschool99

Field Attribute Index Feature in Odoo 17Celine George

THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONHumphrey A Beña

HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...Nguyen Thanh Tu Collection

Computed Fields and api Depends in the Odoo 17Celine George

Kürzlich hochgeladen (20)

DATA STRUCTURE AND ALGORITHM for beginners

Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17

ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx

Difference Between Search & Browse Methods in Odoo 17

Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf

ACC 2024 Chronicles. Cardiology. Exam.pdf

Q4 English4 Week3 PPT Melcnmg-based.pptx

YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx

Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf

USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...

Raw materials used in Herbal Cosmetics.pptx

ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT

Science 7 Quarter 4 Module 2: Natural Resources.pptx

ENGLISH6-Q4-W3.pptxqurter our high choom

MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx

Choosing the Right CBSE School A Comprehensive Guide for Parents

Field Attribute Index Feature in Odoo 17

THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION

HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...

Computed Fields and api Depends in the Odoo 17

Kasenchak "What Is Semantic Search? And Why Is It Important?"

1. What Is Semantic Search? And Why Is It Important? Bob Kasenchak Access Innovations @taxobob bob_kasenchak@accessinn.com NISO Webinar “Discovery and Online search: Personalized Content, Personal Data”

2. Outline Semantic Search ● What Is It? (Basics) ● Why Do We Need It? (Why Does Search Fail?) ● So…What Is It? (Specifics) ● Examples and Implementations

3. What Is Semantic Search? Semantic Search goes beyond keyword searches to examine Context Google says “Things Not Strings”

4. Why “Basic” Search Fails Why does search fail? ● Simple search simply matches text strings ● Language is ambiguous ● There is a *lot* of content

5. Discovery Google Scholar

6. Discovery

7. Specialized Repositories Why “Basic” Search Fails

8. Specialized Repositories Discovery

9. Why “Basic” Search Fails Search fails because simple string matching is not adequate for large, specialized repositories of content with technical language that evolves over time. (Also, language is ambiguous.)

10. What Is Semantic Search? Semantic Search goes beyond keyword string-matching to examine Context using a variety of means Google says “Things Not Strings”

11. What Is Semantic Search? Semantic Search Examines the semantic context of the search query to drive relevant results. This can include: taxonomies, lexical variants, location, your previous searches, previous similar searches, ontologies, knowledge graphs, and other strategies.

12. Allow Lexical Variants “Fuzzy Matching” and Similar Techniques ● Use Levenshtein distance (or similar) to match misspellings and variants ● Stem words for search ● Instead of exact string matches ● This can cause noise, be careful!

13. Query Parsing

14. Contextual Search: Location

15. Contextual Search: User Activity

16. Contextual Search: User Activity

17. Google Knowledge Graph Google (again): Things not Strings The Google Knowledge Graph connects search with e.g. known facts about entities (Driven by a big old ontology)

18. Google Knowledge Graph

19. Google Knowledge Graph

20. Google Knowledge Graph

21. Taxonomies and Tagging Controlling Vocabularies ● Search tags before free text (search engine tuning) ● Allow users to browse (in addition to querying) ● Suggest topics using type-ahead or “did you mean” ● Leverage synonymy to deliver same relevant results from various inputs

22. Taxonomies and Tagging The Irony of Document Categorization ● We’re interested in concepts ● Words are ambiguous ● But words in the text are all we have to go on ○ Unless we apply good subject metadata

23. PLOS ● 9000+ term thesaurus ● And ~4000 Synonyms (!) ● Applied to documents ● Exposed in browse (!) ● Used to redirect search queries for synonyms ● Exposed at article level to user ○ Crowdsourced QC! Taxonomies and Tagging

24. Discovery PLOS

25. Discovery PLOS

26. JSTOR ● Document becomes the search query (!) ● Combination of taxonomy and naive classification ● Suggests related content for research, bibliography ● Experimental, successful, also very cool labs.jstor.org Other Novel Approaches

27. Discovery JSTOR

28. ● Simple things: ● Using existing search software tuning/options ● Enable fuzzy matching ● Configure how Booleans are automatically applied ● Weight fields, doc types, etc. where appropriate ● Use dates to deliver recent results Implementation: Kind of Easy

29. ● Next level: ● Taxonomy ● And tagging ● Knowledge Graphs ● User profiles, user behavior, other targeted means Implementation: More Complex

30. Thanks! Bob Kasenchak Access Innovations @taxobob bob_kasenchak@accessinn.com NISO Webinar “Discovery and Online search: Personalized Content, Personal Data”

Hinweis der Redaktion

Good afternoon. My name is Bob Kasenchak – I’m a taxonomist and director of business development at Access Innovations. Today I’d like to talk about Semantic Search, why it’s important, ways to implement it, and other related topics. My talk will outline the topic and introduce some concepts; the subsequent talks will, I’m sure, go into more detail.
Here’s a brief outline of my talk. My goal is that by the end of the talk you have some idea of what Semantic Search is – or, at least, about the sundry approaches that people mean when they say “semantic search.” This will set up Duane and Travis to go into specifics and details. There is (in theory, at least) ample time in this block for questions and discussion at the end.
Ironically (since we’re talking about semantics) people use the term “semantic search” to refer to a variety of things – but they all have something in common: trying to extend or amplify or improve search beyond matching keyword strings to – using some method or methods – determine the context of the search. This takes a number of forms, about more which shortly. Google has popularized the tagline “things not strings” to explain its semantic search, which involves a knowledge graph (that is: ontology) and some other stuff. First, though, I think it will be helpful to investigate the problem we’re trying to solve. So: what problem are we trying to solve?
What’s the problem with “regular” or “basic” search? And specifically, I think, I mean: in the context of scholarly publishing. Most of the time, search is limited to whatever default search is available on the platforms used by scholarly publishers – there are almost always some options, and they are almost never used; further, some platforms have limitations. And most are based on “regular” string-based search. When I say “regular” or “basic” search: I mean that you enter a keyword in a box, and the search application (using an inverted index) tries to match that string with documents. Sometimes fuzzy matching is used – to catch misspellings and whatnot – but in essence “regular” search looks, literally, for WORDS in DOCUMENTS. That is to say: it matches text strings. The problem for publishers is that in large, specialized content sets “basic” search fails because (1) it’s simply looking for text strings, and does not have the detailed kinds of indexing that e.g. Google does on a constant basis; (2) because language is ambiguous; and (3) because specialized repositories tend to have very, very large content sets with extraordinarily detailed and specialized vocabularies that change over time.
In essence, simple search just looks for the words in the query. In this example from Google Scholar, I have searched for the word “horse”. Two things are noteworthy here: (1) I got over 3 million results, and (2) the logic of the algorithm must prioritize author names -- since, as you can see (I hope), the first paper listed is not about horses; rather, the author’s name is “Red-Horse.” This is, in my analysis, sub-optimal. Why not have a place to specifically search for Author – or omit Author names from search?
Further -- without using any fancy synonymy or acronyms or other semantic trickery -- if I search instead for the string “horses” I get 1.7 m results. This seems to be triggered by the form of the word in the title. In other words: Google scholar doesn’t even recognize simple English plurals as the same string -- and by extension, the same result set. It’s literally – and merely – looking for instances of the exact string typed in the Box. Again: the search simply tries to match the word (or words) in the box in some place in a very large set of documents, with some priority seemingly given to which field (title, author, abstract) in which the word appears. Google search -- not Google scholar -- seems to have figured this out long ago, but for some reason the Google scholar search...um, has not. Some simple NLP would go a long way here, such as recognizing plurals, not to mention other types of synonyms, instead of just words.
Another example, this time from a specialized repository (content from a scholarly publisher) -- the same concept can be expressed in multiple ways (besides morphological variants of words like plurals and adjectives); this problem is most acutely expressed with acronyms. While all acronyms have a great number of meanings, within many fields some acronyms are so ubiquitous that they should be accounted for in search. This example shows search results for “unmanned aerial vehicles” in the library of a mid-sized academic publisher...for which, as you can see, 171 results were found…
...whereas a search for “uav” (the acronym for the same concept) returns over three times as many results. Again, it’s obvious from the results listings of the respective queries that the way the concept is expressed in the title has a major bearing on the content returned. Let me say that one more time: The way the author chooses to represent the main topic of an article in the title has a major bearing on the search results. This, again, to me, seems to be sub-optimal.
So. Why does search fail? It fails because simple string matching is not adequate for large, specialized repositories of content with technical language that evolves over time. Also: it goes without saying I think that language is ambiguous.
So: the basic idea is that Semantic Search goes beyond simple keyword text-string matching. This takes a variety of forms, the common denominator of which, I think, is that:
Semantic Search does not merely try to find keywords but examines the semantic context of the search query to drive relevance. This can include synonyms and lexical variants as well as things like location of the user, or involve graph databases and related concepts as well as a host of other methods. Some of these are quite simple, and others quite complex and involved. Accordingly, some are simple to implement – and some are, well, not. So. Some examples will I think help.
One basic kind of semantic search is designed to help people find what they’re looking for without knowing the exact or specific language or terminology in the content. The idea is to allow the search engine to match near-matches instead of just exact strings. This can cause noise – so use caution. The upside is that someone might not know how to spell “Gastrointestinal stromal neoplasms” – but if they can get close, it might still match. Levenshtein distance (speaking of things I have to look up to spell correctly) is a commonly used metric to determine the “distance” between words that essentially depends on how many changes you’d have to make to get from Word A to Word B. So, for example, “Bob” and “Rob” are very close in Levenshtein terms, while “Bob” and “Antidisestablishmentarianism” are far, far apart. In short, it’s a kind of “fuzzy” matching algorithm.
Another use of the knowledge graph stems from user behavior. We are no longer trained to use old-fashioned library searches with keywords and Boolean operators; instead, when we use Google, we often type in a query – not just a keyword. This also relies on the knowledge graph. For example, Googling “Harrison Ford” and “When is Harrison Ford’s Birthday” picks up two different results. The first resembles a keyword search; the second is a natural language query, which requires parsing. Once parsed, the Knowledge Graph can deliver a targeted result with the exact information sought. Note that the info box on the right (the knowledge graph results) are the same – it’s referencing the graph to pull out information based on the parsed query.
Another set of methods are in a category called “contextual search”. This moniker applies to a variety of techniques to use information gathered about the user, location, or recent searches to drive relevance. For example, some applications – notably Google, but also other map-based applications – use your location to deliver relevant local results. This is usually done using either your IP address (which contains location information) or by allowing an application to access GPS data, usually from your phone or other devices. For example, if I Google “pizza” the first results aren’t definitions and Wikipedia pages about what pizza is; rather, I get suggestions for “pizza near me”. In fact, all of the first page of Google results comprises pizza restaurants. Note, however, that the Google Knowledge graph in the right sidebar does provide the generic definition (Wikipedia, again) as well as – interestingly – information about pizza stocks to invest in, which is interesting.
Also from Google (can you see a trend here?) we can see another of contextual search based on recent activity or user profiles. This works particularly well if you’re logged into Google, of course. If I search for “jaguar” – what do I expect: cars or cats? *I* get results for cars – although I can’t quite say why. Someone who regularly searches for animals would get results for large cats. So Google stores logged-in user searches and delivers results based on previous activity. This can be useful for publishers and societies – if, and only if, users log in when they come to search your site – or they persistently “stay” logged in (the browser remembers them or whatever). In this way, if you’re, say, a cancer research organization and your members have some kind of indication about whether they’re doctors or researchers or patients or pharma reps – they can get relevant content delivered based on their member profile. Naturally, there’s some work to do to set this up. But it can be very valuable.
Incidentally, I also Googled “jaguars” plural – expecting to get the big cats. However, probably since I regularly read NFL content, my results were for the Jacksonville Jaguars. (I would like to note that I use the same Google profile across my devices, so this does not mean I’m reading NFL news AT WORK.)
The Google Knowledge Graph is an example that’s becoming very commonplace; I’ll show a screengrab in a minute. Briefly, when a query strikes some node in the graph – in addition to providing search results of web pages, a sidebar appears with other information related to the search. This works particularly well with entities., less well with concepts. Here’s an example:
So. Searching for “Empire State Building” brings up, predictably, the website, twitter feed, and (below, as we shall see) the Wikipedia page. But over on the right side is a bunch of information from the Google Knowledge Graph. Let’s take a closer look:
(I grabbed the stuff downscreen and put them side-by-side). So we get some pictures of the Empire State Building – with a link to more – and links to the website, Google Maps for directions, a blurb (which comes from Wikipedia), address, statistics, Questions and Answers, Reviews, Popular times, stuff about the movie of the same name, links to social media, and links to “other people also search for”. Pretty cool – and a much richer experience than just the web page results. It is also pretty intuitively presented, which is nice. How does this work?
Basically, there’s a ginormous ontology – specifically, a knowledge graph – and if you hit a node it returns a bunch of other information is has associated with that node using the semantic web. Here I’ve shown a totally made up but plausible facsimile of what the knowledge graph behind the information box in the previous slides must look like. It is a lot of work to build knowledge graphs like this – but they are extraordinarily powerful. And the industry is moving in this direction.
A good way to approach semantic search is to, basically, try to control the semantics. Taxonomies can help in a number of ways: tuning the search to prioritize tags over free text, allowing users to browse a taxonomy of subjects, using taxonomy terms to drive type-ahead or “did you mean”-style redirection, and using synonymy to drive the same relevant results for a number of string inputs. This allows for both improvements in search and improvements in interfaces – how users interact with the data.
On tagging: the irony of document categorization is that we’re not interested in words -- we’re interested in concepts; but the only window into the concepts we have available are the words. This is why we use subject metadata to describe things (using a taxonomy or other controlled vocabulary), and then make that metadata available to the search engine. This is not news, of course. But it illustrates why good subject categorization with a controlled vocabulary helps to organize the data to make it more efficient for search. The previous examples showing failed searches using abbreviations, synonyms, acronyms, and other lexical variants can be solved with a robust, well-formed taxonomy and document tagging program. This is not new to the industry – but it merits discussion in a talk about semantic search.
What can good document categorization achieve? As an example, let’s take a look at the PLOS One platform (full disclosure -- PLOS is a client and I have worked with their taxonomy team). PLOS uses a very large thesaurus and automatically tags each article with up to 8 terms from the vocabulary for search; they also expose the full hierarchy for browsing.
Here’s the browse interface; in addition to the hierarchy, you can see how many articles are attached to each term (and, of course, launch a search by clicking).
And here’s a sample PLOS One article; in the lower right-hand corner you can see the terms applied -- and, moreover, click on one to launch a search on a topic. This is great -- you don’t have to guess how they phrase the term you’re interested in! Even cooler -- the little buttons next to each yellow bar are to flag the article if you think a term has been misapplied. These pretty standard applications of metadata, tagging, taxonomy, and search technologies make the user experience much improved. The same principle applies to tagging content for machine learning!
To illustrate the direction semantic search is heading, next I want to say a little big about the work JSTOR -- and in particular, JSTOR Labs -- has been doing around their search experience, which relies on a combination of traditional taxonomy-based tagging and inferential (that is, naive) topic modeling to create new search experiences.
In the JSTOR Labs Text Analyzer, any document you can OCR, upload, or take a picture of with your phone becomes your search. The document is analyzed both in terms of the massive JSTOR taxonomy (again, full disclosure -- I worked on this project!) as well as other words and entities that appear in the document -- and recommends related content in JSTOR (some 7 million articles, last I checked)! You can curate the results to make the algorithms more accurate. This very cool beta project takes a new perspective on search using both traditional metadata, taxonomy, and tagging applications as well as ML-based technologies.
So, practically speaking, how do we get this done? If you want to implement semantic search: where do you start? Any existing search platform has a back-end – which you may need a developer to access and change – that can be configured to take advantage of built-in features. This can include using fuzzy matching (or not), changing other settings like automatic Booleans in strings, using dates to rank relevancy of results, or prioritizing certain fields (for structured content, of course) to, say, prefer keyword strings found in titles first.
Beyond configuring search to improve results, consider taxonomies and tagging – both for retrieval and interface options like type-ahead and “did you mean”. The next level is a knowledge graph – which is a considerable effort, but a very powerful tool. And understanding the user – whether considering previous searches or some kind of user profile with useful information – is yet another avenue to pursue.
Thank you.

Kasenchak "What Is Semantic Search? And Why Is It Important?"

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Ähnlich wie Kasenchak "What Is Semantic Search? And Why Is It Important?"

Ähnlich wie Kasenchak "What Is Semantic Search? And Why Is It Important?" (20)

Mehr von National Information Standards Organization (NISO)

Mehr von National Information Standards Organization (NISO) (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Kasenchak "What Is Semantic Search? And Why Is It Important?"

Hinweis der Redaktion