Relevance redefined

Deviance Nerd Feeler Declared Refine Even
Candid Reefer Eleven Fleeced Reindeer Van
Freelance Never Died Deliverance Need Ref
Dereference And Live End Free Deliverance
Deliverance Nerd Fee Deface Vender Reline
Refaced Relevance Veneered Nil Refaced redefined
Relined Even
Vile Acne Feeder Nerd Declare Define Nerve
Cleared Define Never Fancied Reverend Lee
Lukas Koster
Dance Fender Relieve Canned Refereed Vile
Library of the University of Amsterdam
Canned Refereed Evil Canned Deliverer Fee
@lukask
Canned Relieved Free Freelance Need Drive
Irrelevance Feed End File And Decree Nerve
Card Need Relief IGeLU Even 2014 Revealed - Oxford
Nice Fender

Main discovery tool feedback issues
Content
Not enough
Too many
Wrong types
No ‘full text’
Relevance
Not #1
Too many
Known item!?
WTF?
Main issues reported as feedback/survey results on discovery tools are about content
and relevance.
Funny thing is that the use of facets for refining is somehow not very popular?

Usual responses to feedback issues
Change the front end!
Tabs - Facets - Filters - - Font
Positions
More/less content!
More of the same same same
Improve relevance ranking algorithms!
Very shhhophisticated - Very shhhecret
Usual responses to feedback issues: Front end, content, relevance (ranking)
Front end UI changes: it’s just about cosmetics and perception: more tabs with
specific data sources, element positioning, etc.
More or less content can have various effects. Either more or less relevant results.
But usually still the same traditional content types
Algorithm: only for a small part influenced by libraries, customers. Most of it in the
software, which is confidential, not transparent, for competitive reasons

Before
Example: before.University of Amsterdam Primo: originally Google experience: one
box (apart from advanced search), all sources, one blended results list.

After
Example: after. University of Amsterdam Primo now: three tabs: All, Local catalogue,
Primo Central

Same old
Same old UX tricks
Same old Content types
Same old View on relevance
Basically these changes are not actual changes at all: it’s all cosmetic.
UX/UI changes: perception, not actual improvement of relevance.
Content: usually still the same resource types+search indexes
Relevance from a system +result set perspective

iNTERLiNKED
R
S E A R C H C
L A O
E N N
V K T
A E
N X
C O N T E N T
E
But every aspect is dependent on all others: search, rank, content, context, relevance.
No search without context. Search is executed in a specific limited content index.
Ranking is performed on the results within this limited index. Relevance is completely
defined by a user's context.

Relevance
Context
+
Content
Objectively, we can say that relevance is determined by context and content.

Relevance=Relative:Subjective:Contextual
Person Context
Role
Task
Goal
Need
Workflow
System Algorithm
Content
Index
Query
Collection
Configuration
Clash between personal context and system/collection.
Personal context, defined by a person's specific needs in a specific role for a specific
task/goal in a specific time, culminates in a specific Query, which consists of a limited
number of words in character string format.
System doesn’t know personal context, only has the indexed content, made up from
specific collections, that is indexed in a certain way with specific system
configurations, and the string based query to run through that structure.

Relevance
Recall
The fraction of relevant
instances that are retrieved
retrieved relevant instances
total relevant instances
Precision
The fraction of retrieved
instances that are relevant
retrieved relevant instances
total retrieved instances
Basic concepts used for determining relevance of result sets: Recall and Precision.
This cannot be used to determine actual relevance of specific results! That is
dependent on context and can only be determined by the user.

Total: 1000 items
Relevant: 300
Retrieved: 180
Retrieved relevant: 120
Retrieved unrelevant: 60
Unrelevant: 700
Recall:
120/300=0.4
Precision:
120/180=0.66
Relevance
Recall and Precision
Example:
Searched index: 1000 items
Relevant for query: 300
Retrieved items: 180
Retrieved relevant items: 120
Retrieved unrelevant items: 60
Recall=120/300=⅖ (0.4)
Precision=120/180=⅔ (0.66)

Relevance ranking is NOT Relevance
Relevance = Finding appropriate items
Recall, Precision
Relevance ranking = Determining most relevant
within retrieved set
Term Frequency, Inverse Document Frequency, Proximity,
Value Score
Retrieved set may not contain any relevant items at
all, but can still be ordered according to relevance.
Relevance is NOT relevance ranking!
Relevance is finding/retrieving appropriate items, using the words in the query, and if
available: context information.
Recall and Precision are used to measure the degree of relevance of a result set.
Relevance ranking is determining the most relevant items in a result set based on the
query terms and the content of retrieved items, using a number of standard
measures:
TF, IDF, Proximity.
Value score: a specific Primo algorithm that is looking at number of words, type of
words, etc.
Also possible: local boosting. This method does not take into account any content
relevance, but just uses brute force to promote items from specific (local) data
sources.

Primo Central search and ranking
enhancement - July 8, 2014
As part of our continuing efforts to enhance search
and ranking performance in Primo, we changed the
way Primo finds matches for keyword searches
within indexed full text. As part of this approach
Primo lowers the ranking of, or excludes, items of
low relevance from the result set that were
previously included. You may find as part of this
change that the number of results for some
searches is reduced, although result lists have
become more meaningful.
Official Ex Libris announcement July 8, 2014.
Combined with improvements to known item search/incomplete query terms in Primo
4.7.
Something changed!? This announcement implies mixing up of getting relevant
results and relevance ranking. Some results are actually excluded.
Only for full text resources.
Only in Primo Central.
Not clear if this is independent of software version/SP?
Unclear to libraries, customers what and how relevance/search/ranking are modified:
an example of the not transparent nature of discovery tools' relavance algorithms.
There were a number of complaints on the Primo mailing list about this.

The System Perspective
Objectivizing a subjective experience
Let's look at the traditional system perspective on relevance. It's trying to make a
subjective process into an objective one.

Recall issues
Discovery tool index limits recall scope in advance
Relevance is calculated on:
available
selected
indexed
(scholarly) content
By vendors
By libraries
Everything
System
First let’s have a look at some recall issues in discovery tools.
Recall is limited in advance, because only a limited set of items of certain content
types are available for searching. A lot of relevant content is not considered at all.
Decided by vendors, publishers and libraries.
In Primo Central: by Ex Libris agreements with publishers, metadata vendors.
In Primo Central: libraries decide what is enabled, what is subscribed, free for search
In Primo Local: libraries decide which (part of) collections are indexed.

Recall issues
NOT indexed:
Not accessible
Not subscribed
Not enabled
Unusual resource types
Connections
Not digital
Not indexed, thus not searched:
Content not accessible to index vendors, libraries
Unusual resource types: theatre performances, television interviews, research project
information, historical events
Not physical, tangible content.
Connections: influenced by, collaborating with, temporal, genre
May not fit in bibliographical/proprietary format (MARC, DC, PNX)

Recall issues
Indexed, but NOT found:
By author name (string based)
By subject (string based, languages)
Related but unlinked items (chapter in book)
Content that IS indexed, but can't be found:
Author names: only strings, textual variations of name/pseudonyms, etc. that are
indexed. Only items with explicit author search term are found.
Subject: strings, individually indexed 'as is' from data sources, multiple languages.
Only items with explicit specific subject search term are found.
Related: a chapter may be indexed with a textual reference to the book it is a part of.
The book (relevant for delivery) is not retrieved, neither a link to that item.

Author
Author name example.
Charlotte Brontë pseudonym/pen name Currer Bell (male) used for Jane Eyre. (Left
screenshot Wikipedia)
In this case no links between both names, so the very relevant Charlotte Brontë stuff
is not retrieved. (Right screenshot University of Amsterdam Primo)

Subject
Subject example.
Topic/discipline “philosophy” (English) does not find stuff with Dutch “filosofie” (which
also appears to be Czech).

Chapter
Connections example.
Chapter written by UvA researcher, in local institutional repository, harvested in local
Primo.
Book in Aleph catalogue, harvested in local Primo.
Book is not retrieved as item to present delivery options directly.

Precision issues
Discovery tool limits precision by ambiguous
indexing
Next: some precision issues.
Problems caused by using strings instead of identifiers/concepts

Precision issues
Indexed and/but erroneously found
By author name (string based)
By subject (string based, languages)
Query too broad
Indexed irrelevant items that are retrieved erroneously:
Author: common names result in items of all authors with that name.
Subject: similar terms with different/ambiguous meanings give noise (voc)
Broad query (few terms) gives too much noise

Author
Example of author names.
J. (Johanna, Jan, Joop, etc.) de Vries is a very common Dutch name.
Results consist of all items by different authors.

Subject
Example of subjects.
Ambiguous/Multilingual topic VOC: physics (Volatile Organic Compounds), music
(Vocals), history (Verenigde Oostindische Compagnie, Dutch East Indies Company).

Too broad
Example of too broad search terms.
Way too many results with a very common search term.

Recall and precision issues
Content of index
Quality of search index units
Lack of connections (isolated string items)
Algorithms for retrieving and ranking not
transparent
Summary of Recall and Precision issues in discovery tools and relevance:
Content of index: resource types, connections, data
Search index units (individual search index fields): strings, isolated items
Cause: system perspective with legacy data
There is no way to determine if all relevant items have been retrieved.

Research cycle
http://commons.wikimedia.org/wiki/File:Research_cycle.png
Intermezzo: closer look at Context: workflow and use case. Example: research cycle
(Cameron Neylon). Many different versions of this cycle. Important is: the nature of
someone’s information need differs depending on the stage. Broad, focused in
several dimensions

Context example - theatre research
Play
Author
Text
Productions
Use case: Theatre play researcher.
A theatre Play is written by Author, is represented as text, but most importantly it is
performed (or not) for an audience.

Play
Author
Text
Productions
Period
Background
Connections
Influences
For the Author there is biographical information, important things are background,
connections with others (artists, funders, relatives, etv.), influences, the period in
which they live and work.
Libraries/discovery tools may have some biographical information.

Play
Author
Text
Productions
Period
Background
Connections
Influences
Versions
Translations
Editions
Text: there may be several versions, translations, editions etc. FRBR can be used to
model this.
This belongs to the traditional library domain.

Play
Author
Text
Productions
Period
Background
Connections
Influences
Versions
Translations
Editions
Performances
Reception
Theatres
Producers
Actors
Visitor stats
Directors
Props
Posters
Recordings
Photos
Costumes
Productions and performances: a whole different world.
People involved in a number of different roles. Different Productions, various actual
performances, physical props, costumes, audio and video recordings, etc.
Reception both of the play as such, as of the various productions, always related to
the period.
What’s in a discovery tool? Could be anything, but in individual texts/items, not as
separate retrievable items, and certainly not as connections/crossreferences/related
information.
Authors in authority files or individual biographical databases.
Text/Editions treated separately as individual items.
Productions: maybe, depending on types of indexed (local) databases.
Reception: individual reviews possibly.

Relevance - New perspective
Instead of
SYSTEM
Collections, Indexed content, Query
Context-Workflow-Goals- Environment of
USER
It is time to switch perspectives, from collection based System algorithms to context
based User needs.

Is this technically possible, feasible?
Extend Content?
Know Context?
Important questions:
Is it even possible, feasible to extend content without limits, to interpret personal
context?
Can commercial vendors and publishers benefit?

Relevance Redefined and Primo
What is already possible?
Content
Additional content types
Additional indexed fields
Third nodes (not merged)
External links (not searchable, link out only)
Context
Discipline (for ranking, not searching)
Algorithm improvements (for current items)
Let's look at this from the Primo perspective.
What is already possible in current version of Primo?
Content
Other resource types can be added, both in Primo Central, by ExLibris; and in Primo
local, by individual libraries.
Indexed fields: extend the PNX search section, extra entries (for authors, subjects for
instance, needs normalization rules) + locally defined fields
Third nodes, external data sources via API, like distributes federated search(EBSCO,
Worldcat), but results unclear, can't be merged very well.
Context
Users can enter their Discipline and Degree, but this is only used for ranking, not for
retrieving.

What is missing?
Content
Internal links
Integrated Primo Central/Local
External links
External indexes
Normalised/multilingual authors/subjects
Context
Context
What is still missing/not possible in Primo?
Internal links: chapter-book(s), article-journal(s), article-datasets, qualitative
relationships etc.
Primo Central-Primo Local: two separate indexes to be searched, no deduplication
etc.
External links, for instance to related content in external databases, not indexed in
Primo: theater performances, research information, etc.
External indexes: non-Primo data sources searchable (maybe with Third Nodes, but
not merged)
Normalised/multilingual indexes: there is no use of identifiers instead of string indexes

Options
Content
Universal record format: RDF!
Identifier based authorities: VIAF! MACS? (DBpedia?)
Global metadata index!
Transparent algorithms!
Context
What would Google do?
What options can we distinguish for future Primo development?
Record format not proprietary, but universal: RDF
RDF also requires identifiers + relations (triples).
Existing authorities: VIAF, LCSH, MACS etc. (RDF/Linked data).
Global metadata index: not silos for separate discovery layers, but open, global,
unified format. Could be decentralized, distributed; managed by multiple parties
Transparent algorithms: to make it clear how relevance is computed.
New features announced by Ex Libris on earlier occasions:
URIs in Primo PNX Links Section
Knowledge Graph type additional info (Wikipedia, …)
Announced during conference: Primo/third generation discovery, with related
information and serendipity, using identifiers, external sources, linked data.
Context: Google: next slide.

A word about Google vs Primo
Google knows
IP addresses
Account
Searches
Clicks
Location
Primo makes an educated guess
Discipline?
Query type
The difference between Google and library discovery.
Google knows a lot about the user, and can target search results at user's history,
location, email etc.
Library discovery tools do not have that knowledge. They have to guess.

VIAF
Example of identifier based person authority files.
VIAF consolidates names for large number of authoritative sources.
Also has Related names.

MACS
Multilingual ACcess to Subjects
Since 1997
Manual linking between strings
New future?
The European Library...
http://www.nb.admin.ch/nb_professionnel/projektarbeit/00729/00733/index.html?lang=en
Example of multilingual subjects.
MACS, since 1997, manually maintained, input from four national libraries. Used in
The European Library.
Discussed at IFLA 2014 Linked Data for Libraries Satellite Meeting Paris
There are plans for extending and adjusting MACS for future, automated, linked data
concepts.
This would be a very important development.

The European Library uses MACS.
Multilingual AND disambiguated

WikiPedia/DBpedia
SLUB Dresden local Primo addon SLUB-Semantics, using multilingual and
disambiguated topics from Wikipedia/DBPedia

But, wait a minute...
RDF?
Identifiers?
Global index?
Transparency?
What are we talking about here? What would be the consequences of applying these
suggestions?

Ŧ ᶙ©Ѥ
**** the system
Open independent transparent web based
connected data infrastructure
Linked Open Data
Should libraries, vendors invest in data
infrastructure instead of systems?
Discovery layers should be separated (decoupled) from proprietary systems, closed
data stores and indexes.
Main focus should be a global data infrastructure. Which can be accomplished with
RDF/LOD.
Tools, services built on top of global infrastructure.
This is exactly what Linked Open Data is all about.
Main issue here: would this be commercially beneficial for current discovery layer
vendors?
And should libraries focus on data infrastructure instead of systems?

Ŧ ᶙ©Ѥ
**** the system
Open independent transparent web based
connected data infrastructure
Linked Open Data
Should libraries, vendors invest in data
infrastructure instead of systems?
No, if you look closely, it doesn't say what your mind thinks ;-)

NISO Open Discovery Initiative
“Transparency in discovery” 2014
(http://www.niso.org/workrooms/odi/)
“... facilitate increased transparency in the content
coverage of indexbased discovery services …
Full transparency will enable libraries to objectively
evaluate discovery services …”
NISO Open Discovery Initiative report 2014 objectives.
Transparency in discovery, sounds promising.

In scope:
Quantity of content
Form of content
Do not favor or disfavor items from any given
content source or material type
Specific metadata fields indexed
Whether controlled vocabularies or ontologies are
included
NISO ODI topics declared “in scope”
Most of these topics confirm suggestions made in this presentation.

Out of scope:
“Relevancy ranking” (may fall within the realm of
proprietary technologies used competitively to
differentiate commercial offerings)
APIs exposed by discovery service (initially,
reluctantly)
However: NISO ODI topics declared “out of scope”
Relevance ranking
APIs (system independent access to data, more or less)
These are exactly the things that are most important for transparency in discovery.

Nothing about:
Content linking/identifiers
Normalised/multilingual authority files
Relevancy ranking
System independent data infrastructure
NISO ODI ignores all issues that improve relevance in discovery.

Stakeholders/Working group members:
Content providers
Discovery service providers
Libraries
Who’s missing?
Most important stakeholders are missing from NISO ODI committees, the end users.

Relevance redefined
Context
User needs
User input
User feedback
Content
Open connected data
infrastructure
Systems (Primo) Services
Algorithms Transparency
SOA - Service Oriented Architecture + Context
Conclusion/recommendation:
Instead of closed systems with limited content, a transition to a new 3 component
environment is required:
- content (open global data infrastructure)
- context (user needs, input, feedback)
- services, systems that access the content and context layers in transparent ways
SOA! Service Oriented Architecture + Context
How this can be achieved is still to be investigated. However, SOA is already widely
implemented elsewhere.
Linked Open Data is technically possible, we only need the will to cooperate.
Context is the hardest part to realize. But it is not impossible.

Relevance redefined

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (19)

Ähnlich wie Relevance redefined

Ähnlich wie Relevance redefined (20)

Mehr von Lukas Koster

Mehr von Lukas Koster (13)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Relevance redefined