SlideShare ist ein Scribd-Unternehmen logo
1 von 50
Downloaden Sie, um offline zu lesen
Search in Research, Let’s Make it More Complex!
Collaboratively Looking Under the Hood and Its
Consequences
Marijn Koolen
Humanities Cluster - Royal Netherlands Academy of Arts and Sciences
CLARIAH Media Studies Summer School
Netherlands Institute for Sound and Vision, 3 July 2018
Overview
1. Search in Research
a. Search as part of research process
b. Search vs. other access methods
2. Search, Retrieval and Ranking
a. Retrieval Systems, Ranking Algorithms and Relevance Models
3. Searching in Digital Collections
a. Understanding (digital) collections and their construction
b. Tool analysis through experimentation
4. Search Strategies and Corpus Building
a. Systematic searching
b. Search strategies and sampling
1. Search in Research
● Research Phases
○ Exploration, gathering, analysis, synthesis, presentation
○ Extremely non-linear (affordance of digital realm)
● Search happens throughout research process
○ Search phases: pre-focus, focus, post-focus
○ Use different types of collections and search engines
■ General purpose search engines,
■ Domain- and collection-specific (e.g. GLAMS),
■ Personal/private (offline) collections
○ Search strategies:
■ Ad hoc or systematic: berrypicking (Bates 1989), keyword harvesting (Burke 2011), …
■ Important for data and tool criticism
Research Process
● For many online materials access is limited to search interface
○ Browsing is guided by available structure
■ Drill down via facets
■ Navigate via metadata fields (if enabled)
○ Without (relevant) structure, direct search is only practical alternative
● Searching as exploration
○ How does search engine provide overview?
■ How big is collection?
■ How is collection structure communicated?
■ What (meta)data is available?
■ How are search characteristics explained?
■ How are search results summarised?
Search Engine as Mediator
● Browsing brings you along unintended materials:
○ Navigating your way to relevance
○ Impresses on you what else there is (see also Putnam 2016)
● Keyword search tends to focus on relevance
○ Pushes back related/nearby materials
○ Collection structure can be enabled to allow faceting (overview)
● Search and research methodology
○ Impact of digital keyword search needs to be reflected in methodology
○ How do you account for search process in scholarly communication?
■ Method of citation is based on analogue browse/search in archives and libraries
■ Pre-focus to focus: switch between ad hoc and systematic?
■ Non-linearity: exploration never stops, assumptions constantly challenged
Browsing vs. Keyword Searching
'To take a single example of this disconnect between research process and representation, many of us
use and cite eighteenth and nineteenth-century newspapers as simple hard-copy references without
mention of how we navigated to the specific article, page and issue. In doing so, we actively misrepresent
the limitations within which we are working.' (Hitchcock 2013, 12)
'This is not only about being explicit about our use of keyword searching - it is about moving beyond a
traditional form of scholarship to data modelling and to what Franco Moretti calls “distant reading”.'
(Hitchcock, Confronting the Digital, 2013, p. 19).
Keyword Search and “Confronting the Digital”
Information Search and Seeking
● Search takes place in context
○ Part of seeking, and overall inf. behaviour (Wilson)
○ As inf. behaviour changes (phases), so does seeking
and search behaviour
● Reflection-in-action
○ When and where are choice points?
○ How do search actions relate to strategy and inf.
need?
Digital Tool Criticism
Search and Accountability
● What should scholars account for?
○ Aspects of sources, tools and process
● Digital source criticism
○ How to evaluate digital sources (Fickers 2012)
○ Who made digital source, when, why, what for, how?
● Digital tool criticism
○ How to evaluate impact of digital tools (Koolen et al. 2018)
○ Reflection-in-action, experimentation
● Data Scopes
○ How to communicate research process to others (Hoekstra & Koolen 2018)
○ Discuss process of selection, modelling, normalization, linking, classification
2. Search, Retrieval and Ranking
Anatomy of Retrieval Process
Retrieval - Matching and Similarity
● Matching based on user query
○ Query: free text, controlled facet, example (doc, AV or text)
○ Matching docs returned in certain order (non-matching are not retrieved)
■ How does search engine perform matching (esp. for free text and example)?
■ Potentially many objects match query: does order matter?
● Similarity
○ Degree of matching: some match better than others (notion of similarity)
■ Retrieve most similar documents first (ranking)
○ Similar how? Does interface explain?
● Retrieval and ranking
○ Retrieval: which matching documents are returned to the user as results?
○ Ranking: in which order are the results returned?
Retrieval, Ranking and Relevance
● Retrieval results form a set
○ Can be ordered or unordered (e.g. SQL or SPARQL query)
■ Even unordered sets need to be presented to the user in some order
○ Criteria for ordering: alphabetic, size, recency, popularity (views, likes, citations, links)
■ Ordering re-organizes materials, temporarily disrupts “original” organization
■ Provides different view on materials
● Many systems perform relevance ranking
○ Relevant to who or what?
■ Query: document similarity scores
■ User: e.g. search history, preferences
■ Situation: user, location, time, device, query, work context (page views, annotations)
■ Other aspects: quality, diversity, controversy, polarity, exploration/exploitation, ...
● How does an algorithm understand the notion of relevance?
○ Statistical interpretation:
■ Generally: frequent words carry less signal, look for unexpected stuff
■ Many ways of scoring signal
○ TF-IDF:
■ Term Frequency in document (relevance of term in document)
■ Inverse of Document Frequency in collection (commonness of term across docs)
○ Probabilistic Language Model (PLM):
■ Probability of picking term from document as bag of words (relevance of term in doc)
■ Probability of picking term from collection as bag of words (commonness of term)
○ Many other relevance models, e.g. BM25, DFR, SDM, …
■ Different interpretations of relevance, hence different rankings
Algorithmic Interpretation of Relevance
Ranking Issues
● Document length
○ TF-IDF doesn’t model document length, favours longer documents
○ PLM explicitly normalizes on document length, favours shorter documents
○ Upshot: Delpher API returns short documents first for short queries
● Document priors: are all documents equal or not?
○ Can use document prior probability (independent of query)
○ Can favour documents that are more popular, recent, authoritative, …
○ Can favour documents that are more appropriate for situation (location, time of day, …)
● Problem: how do you know how search engine scores relevance?
○ How much should you know about it?
○ Many GLAM search engines have relatively straightforward relevance models, no doc priors
○ Google uses many hundreds of features for document, query, user and situation
Relevance in Metadata Records
● Relevance ranking of metadata records
○ Metadata records are peculiar textual representations
■ Minimal amount of text, low redundancy
■ Majority of terms occur only once
○ Which part of TF-IDF contributes more to score of metadata record?
○ Which fields are useful/used for matching?
● NISV collection
○ Search engine indexes metadata records
■ Some records have lengthy itemized descriptions, some have not
■ Some have transcripts, some have not
○ Consequences for retrieving? And for ranking?
■ How does search engine handle this?
■ How does search engine communicate this?
● Hard to match keywords against AV signal directly
○ Option: use text representation for AV document
■ E.g. metadata, description, script, speech transcript, ...
○ Option: use AV representation of query
■ E.g. example document or user recording
■ Use audio or visual similarity (again, similar how?)
Retrieving and Ranking Audiovisual Materials
● Experiment to understand search functionalities
○ How can you find out if multiple search terms are treated with Boolean AND or OR operators?
○ How can you find out if terms are stemmed/normalized?
● Phrase search:
○ What happens when you use quotation marks to group terms into a phrase?
○ How do the results compare to those using no quotation marks?
● Proximity search:
○ Can you specify that terms should be near each other?
● Fuzzy search: wildcard and edit distance searches
○ Controlling lexical variation vs. uncontrolled wildcard search
○ voetbal+voetballen vs. voetbal* (matches voetbalvereniging, voetbalveld, ...)
Opaqueness of Interfaces and Experimentation
● Experiment with Search and Compare tools of the CLARIAH Mediasuite
○ Find out if stopwords are removed
○ Find out if words are stemmed/normalized
○ Find out how multi-word queries are interpreted, i.e. as AND or OR
○ Find out how standard search operators work
■ Boolean AND, OR and NOT
■ Quotation marks for phrases
Exercise
3. Searching in Digital Collections
● Collections of GLAMs are often built up over decades
○ Based on aims and selection criteria
■ Rarely "complete", dependent on availability of materials
○ Digital access via digitization, or digital archiving (born-digital)
■ Some things are lost in this process (e.g. context, quality, …)
● Heterogeneity: mix of object/source types (sub-collections)
○ Different modalities, different ways of accessing and presenting
■ Text vs. Image vs. AV vs. 3D (or 4D)
Nature of Digital Collections
Nature of Metadata
● Digital access via metadata
○ Metadata: data about the object/source
○ Types: formal, structural, technical, administrative, aboutness
○ Metadata fields allow selection and search via specific fields
■ Title, description, creator, creation date, genre, …
○ Allows (seemingly) uniform access to heterogeneous collections
■ But, different materials have different aspects to describe
■ Edition is relevant for books and films, not so much for paintings
● Metadata creation process
○ Often done with limited time, information and system flexibility
○ Inherently subjective, especially content analysis
● Size matters
○ Requirements change as size of collection grows (also depends on expectations)
● Hierarchical organization
○ 4 levels
■ Series: De Wereld Draait Door
■ Season: De Wereld Draait Door 2016
■ Program: De Wereld Draait Door 21-06-2016
■ Segment: De Wereld Draait Door 21-06-2016
○ Each level has a metadata record (with overlap in field, e.g. title)
● Follows archival standard
○ Describe aspect at highest relevant level
○ Don’t repeat at lower levels unless it deviates (e.g. main titles)
○ Fonds: aggregation of documents from same origin
Archival Structure and NISV Audiovisual Collection
● Power of the archive
○ Problem of perspective (from archive-as-source to archive-as-subject, Stoler 2002)
● History of the archive
○ Collections created over decades often go through changes in
■ selection criteria, cataloguers (human or algorithm),
■ cataloguing budgets, policies, rules, practice and vocabularies,
■ software (migrations and updates), hardware,
■ institutional mission, societal attitudes, …
○ Most of these aspects remain undocumented or partially documented
● Consequences
○ Almost inherently incomplete, inconsistent and sometimes necessarily incorrect
○ After many years, it's hard to retrace what happened
■ and how it affects access, selection and analysis
Digital Source and Data Criticism
Metadata in theory Metadata in practice
Source: Jaap Kamps
Combined Collections
● Several portals combine (heterogeneous) collections
○ Examples:
■ Europeana, European Newspapers, EU screen, Nederlab, Delpher, Online Archives of
California, …
○ Worldwide aggregated collections:
■ ArchiveGrid (1000+ archives): over 5M finding aids
■ WorldCat (72,000 libraries): 400M records, 2.6B assets, 100M persons
● Huge challenge for source criticism as well as search
○ Collections vary in size, provenance, selection criteria, metadata policies, interpretation and
richness
○ Heterogeneous metadata schemas have been mapped to single schema
■ Causes problems for interpretation
■ E.g. what does creator mean for paintings, films, tv series, letters, advertisements, ...?
Assessing Metadata Quality
● Questions
○ What are pitfalls in relying on metadata?
○ How can we evaluate metadata quality?
○ What are relevant aspects to consider?
● Collection inspection
○ In CLARIAH Media Suite we created a tool for inspecting metadata
■ Esp. useful for complex collections like NISV audiovisual collection
■ Somewhat ad hoc, please feel encouraged to give feedback!
○ Please go to the Media Suite and go to the Collection Inspector tool
■ Click on “select field to analyse” and let the interface load the data on completeness (this
will take awhile)
Assessing Timelines and Other Visualizations
● Timeline visualizations give view of temporal spread
○ Very difficult to interpret properly
● Issues with absolute frequencies:
○ Collection materials not evenly distributed
○ Need to compare query-specific distribution to collection-distribution
● Issues with relative frequencies:
○ Incompleteness not evenly distributed (use collection inspector)
Retrievability and Metadata Characteristics
● Different types of metadata fields
○ Controlled vocabulary: e.g. broadcast channel (radio or tv)
○ Number: number of episodes/seasons/segments
○ Time/date: program length, recording date
○ Free keyword/keyphrase: title, person name (tend to be non-unique)
○ Free text: description, summary, transcript, … (tend to be unique)
● Different types allow different forms of retrieval and ranking
○ Long text fields have more terms, with higher frequencies
■ Some types of programs have longer descriptions/transcript
■ These match more queries, so higher chance of being retrieved
■ Impact of long text fields on ranking depends on relevance model!
○ Repeated values allow aggregation, navigation
● Some search interfaces offer facets to narrow down search results
○ E.g. broadcaster and genre in the CLARIAH Media Suite
○ Facets provide overview, afford focusing through selection
● How do facets work?
○ Based on metadata fields: rich schema has rich options for facets
○ Types of metadata fields: controlled vocab, number, date, keyword/phrase, free text
■ Facets work for field with limited range of values, so not free text fields
○ Long tails in facets: typically, few high frequency, many low frequency values
Metadata and Search Facets
Exercise
● Experiment with the Collection Inspector of the CLARIAH Mediasuite
○ Try out the collection inspector:
■ Scroll through the list of fields to get an idea of what is available
■ Look at completeness of fields for f.i. “genre”, “keywords” and “awards”
■ Which metadata fields are relatively complete?
■ At which archival levels are they most complete?
● Explore which fields are available and which fields make good facets
○ Explore facet distributions in entire collection and for specific queries
4. Search Strategies and Corpus
Building
● Importance of selection criteria
○ Do you have to hand pick each document?
○ Or can you select sets based on matching criteria?
○ Is representativeness important? If so, representativeness of what?
○ Or completeness? Why?
● Exploiting facets and dates
○ Filtering: align facets/dates with research focus
○ Sampling: compare across facets
■ Which facet types can you use?
○ Sampling strategies
■ Sample per facet/year (e.g. X items per facet/year)
■ Within facets, select random or not
Searching for Corpus Building
Tracking Context in Corpus Building
● Why were certain documents selected?
○ How were they selected?
○ What strategy was used?
○ Documenting helps understanding/remembering choices?
● Do research goals and questions change during collection?
○ Interacting with sources during search updates knowledge structures (Vakkari 2016)
○ Updates tend to be small and incremental, hence barely noticeable
○ Explicit reflection-in-action can bring these to the surface (Koolen et al. 2018)
○ Adding annotations can also provide context
Systematic Searching
● Systematic (comprehensive) search has two factors (Yakel 2010):
○ Search strategy (user)
○ Search functionalities (system)
○ Functionalities shape/affect strategy
● Step 1: systematic search for relevant collections online
○ Different collections/sites offer different search functionalities and levels of detail
○ Explicitly address what consequences this has for your strategy and research goals
● Step 2:
○ Explore individual collections using one or more strategies
○ "Researchers need to be flexible and creative to accommodate the vagaries of cataloging
practices." (Yakel 2010, p. 110)
○ Footnote and reference chasing: references often give an "information scent", suggesting
other collections and items to explore.
Search Strategies
● Web search strategies defined by Drabenstott (2001)
○ Discussed in archive context by Yakel (2010)
● Five strategies
○ Synonym generation
○ Chaining
○ Name collection
○ Pearl growing
○ Successive segmentation
● Somewhat related to information seeking patterns by Ellis (1989)
○ Starting, chaining, browsing, differentiating, monitoring, extracting
● Synonym generation: 1) search with relevant term, 2) close read results to
identify related terms (wordclouds, facets), 3) search via related terms for
synonyms.
● Chaining: follow references/citations (explicit or implicit), identify relevant
subset and use explicit structure to explore connected/related subset
● Name collection: search with keywords, identify relevant names, search with
names, identify related names and keywords, repeat. Similar to keyword
harvesting (Burke 2011).
Drabenstott’s Strategies (1/2)
Drabenstott’s Strategies (2/2)
● Pearl growing: start small and focused with specific search terms, slowly
expand out with additional terms to broader topics/themes
● Successive segmentation: opposite of pearl growing; start broad and
increasingly zoom in and focus; e.g. make queries increasingly specific by
adding (ANDing) keywords, replace broad terms with lower frequency terms,
or select facets
Search Strategies and Research Phases
● Research phase
○ Exploration <-> search phase pre-focus
i. Ad hoc, no need yet for systematic search
ii. Mostly pearl growing and/or successive segmentation to determine focus
○ Analysis <-> search phase focus
i. Switch to systematic, determine strategy
ii. Use chaining, name collection, synonym generation (for coverage/representation,
boundaries)
● But reality resists:
○ (Re)search process is very non-linear
○ Boundary between exploration and analysis is not always clear
○ Late discoveries can prompt or force new directions, ...
When To Stop
● Often switch from exploration to “sorta” systematic search
○ But hard to remember and explain what and how you searched
○ Moreover, difficult to determine when to stop
○ Explicit strategy allows for stopping criteria
● Stopping criteria
○ Check whole set/sample, all available facets, ...
○ Diminishing returns: you increasingly encounter seen things, new relevance becomes rare
○ When stopping, make explicit (at least for yourself) when and why you stopped
● Meta-strategy:
○ chance strategy/tactics
○ E.g. successive segmentation -> harvest keywords -> switch segment -> harvest keywords, ...
Wrap Up
● Search in research
○ How to incorporate these processes in research methodology
● Large, heterogeneous collections introduce issues for research
○ Assessing incompleteness of materials
○ Assessing incompleteness, incorrectness and inconsistency of metadata
● Looking under the hood
○ Evaluating information access functionalities (search and browse)
○ Selecting an appropriate search strategy for research goals
○ Determining success/failure of searches
○ Understanding search for corpus building
Burke, T. 2011. How I Talk About Searching, Discovery and Research in Courses. May 9, 2011.
Drabenstott, K.M., 2001. Web Search Strategy Development. Online, 25(4), pp.18-25.
Fickers, F. 2012. Towards a New Digital Historicism? Doing History in the Age of Abundance. View
journal, volume 1 (1). http://orbilu.uni.lu/bitstream/10993/7615/1/4-4-1-PB.pdf
Hitchcock, T. 2013. Confronting the Digital - Or How Academic History Writing Lost the Plot. Cultural and
Social History, Volume 10, Issue 1, pp. 9-23. https://doi.org/10.2752/147800413X13515292098070
Hoekstra, R., M. Koolen. 2018. Data Scopes for Digital History Research. Historical Methods: A Journal of
Quantitative and Interdisciplinary History, Volume 51 (2), 2018.
References
References
Koolen, M., J. van Gorp, J. van Ossenbruggen. 2018. Lessons Learned from a Digital Tool Criticism
Workshop. Digital Humanities in the Benelux 2018 Conference.
Putnam L. 2016. The Transnational and the Text-Searchable: Digitized Sources and the Shadows They
Cast. American Historical Review, Volume 121, Number 2, pp. 377-402.
Vakkari, P. 2016. Searching as Learning: A systematization based on literature. Journal of Information
Science, 42(1) 2016, pp. 7-18.
Yakel, E., 2010. Searching and seeking in the deep web: Primary sources on the internet. Working in the
archives: Practical research methods for rhetoric and composition, pp.102-118.

Weitere ähnliche Inhalte

Was ist angesagt?

Presentation Timo Kouwenhoven FIATIFTA
Presentation Timo Kouwenhoven FIATIFTAPresentation Timo Kouwenhoven FIATIFTA
Presentation Timo Kouwenhoven FIATIFTATimo Kouwenhoven
 
"Mass Surveillance" through Distant Reading
"Mass Surveillance" through Distant Reading"Mass Surveillance" through Distant Reading
"Mass Surveillance" through Distant ReadingShalin Hai-Jew
 
Semantic Search
Semantic SearchSemantic Search
Semantic Searchsssw2012
 
Large-Scale Semantic Search
Large-Scale Semantic SearchLarge-Scale Semantic Search
Large-Scale Semantic SearchRoi Blanco
 
Text REtrieval Conference (TREC) Dynamic Domain Track 2015
Text REtrieval Conference (TREC) Dynamic Domain Track 2015Text REtrieval Conference (TREC) Dynamic Domain Track 2015
Text REtrieval Conference (TREC) Dynamic Domain Track 2015Grace Hui Yang
 
Educational Standards Webinar - Sept 2015 - Patricia Payton
Educational Standards Webinar - Sept 2015 - Patricia PaytonEducational Standards Webinar - Sept 2015 - Patricia Payton
Educational Standards Webinar - Sept 2015 - Patricia PaytonBookNet Canada
 
How search engines work Anand Saini
How search engines work Anand SainiHow search engines work Anand Saini
How search engines work Anand SainiDr,Saini Anand
 
Graph Models for Deep Learning
Graph Models for Deep LearningGraph Models for Deep Learning
Graph Models for Deep LearningExperfy
 
Taxonomy design best practices
Taxonomy design best practices Taxonomy design best practices
Taxonomy design best practices voginip
 
Information searching & retrieving techniques khalid
Information searching & retrieving techniques khalidInformation searching & retrieving techniques khalid
Information searching & retrieving techniques khalidKhalid Mahmood
 
Capitalizing on Machine Reading to Engage Bigger Data
Capitalizing on Machine Reading to Engage Bigger DataCapitalizing on Machine Reading to Engage Bigger Data
Capitalizing on Machine Reading to Engage Bigger DataShalin Hai-Jew
 
Letting the Machine Code Qualitative and Mixed Methods Data in NVivo 10
Letting the Machine Code Qualitative and Mixed Methods Data in NVivo 10Letting the Machine Code Qualitative and Mixed Methods Data in NVivo 10
Letting the Machine Code Qualitative and Mixed Methods Data in NVivo 10Shalin Hai-Jew
 
Bringing semantic publishing into TEI: ideas and pointers
Bringing semantic publishing into TEI: ideas and pointersBringing semantic publishing into TEI: ideas and pointers
Bringing semantic publishing into TEI: ideas and pointersUniversity of Bologna
 
Spatial Decision Support Portal- Presented at AAG 2010
Spatial Decision Support Portal- Presented at AAG 2010Spatial Decision Support Portal- Presented at AAG 2010
Spatial Decision Support Portal- Presented at AAG 2010Nathan Strout
 

Was ist angesagt? (14)

Presentation Timo Kouwenhoven FIATIFTA
Presentation Timo Kouwenhoven FIATIFTAPresentation Timo Kouwenhoven FIATIFTA
Presentation Timo Kouwenhoven FIATIFTA
 
"Mass Surveillance" through Distant Reading
"Mass Surveillance" through Distant Reading"Mass Surveillance" through Distant Reading
"Mass Surveillance" through Distant Reading
 
Semantic Search
Semantic SearchSemantic Search
Semantic Search
 
Large-Scale Semantic Search
Large-Scale Semantic SearchLarge-Scale Semantic Search
Large-Scale Semantic Search
 
Text REtrieval Conference (TREC) Dynamic Domain Track 2015
Text REtrieval Conference (TREC) Dynamic Domain Track 2015Text REtrieval Conference (TREC) Dynamic Domain Track 2015
Text REtrieval Conference (TREC) Dynamic Domain Track 2015
 
Educational Standards Webinar - Sept 2015 - Patricia Payton
Educational Standards Webinar - Sept 2015 - Patricia PaytonEducational Standards Webinar - Sept 2015 - Patricia Payton
Educational Standards Webinar - Sept 2015 - Patricia Payton
 
How search engines work Anand Saini
How search engines work Anand SainiHow search engines work Anand Saini
How search engines work Anand Saini
 
Graph Models for Deep Learning
Graph Models for Deep LearningGraph Models for Deep Learning
Graph Models for Deep Learning
 
Taxonomy design best practices
Taxonomy design best practices Taxonomy design best practices
Taxonomy design best practices
 
Information searching & retrieving techniques khalid
Information searching & retrieving techniques khalidInformation searching & retrieving techniques khalid
Information searching & retrieving techniques khalid
 
Capitalizing on Machine Reading to Engage Bigger Data
Capitalizing on Machine Reading to Engage Bigger DataCapitalizing on Machine Reading to Engage Bigger Data
Capitalizing on Machine Reading to Engage Bigger Data
 
Letting the Machine Code Qualitative and Mixed Methods Data in NVivo 10
Letting the Machine Code Qualitative and Mixed Methods Data in NVivo 10Letting the Machine Code Qualitative and Mixed Methods Data in NVivo 10
Letting the Machine Code Qualitative and Mixed Methods Data in NVivo 10
 
Bringing semantic publishing into TEI: ideas and pointers
Bringing semantic publishing into TEI: ideas and pointersBringing semantic publishing into TEI: ideas and pointers
Bringing semantic publishing into TEI: ideas and pointers
 
Spatial Decision Support Portal- Presented at AAG 2010
Spatial Decision Support Portal- Presented at AAG 2010Spatial Decision Support Portal- Presented at AAG 2010
Spatial Decision Support Portal- Presented at AAG 2010
 

Ähnlich wie Search in Research, Let's Make it More Complex!

Hobby horses-and-detail-devils-transparency-in-digital-humanities-research-an...
Hobby horses-and-detail-devils-transparency-in-digital-humanities-research-an...Hobby horses-and-detail-devils-transparency-in-digital-humanities-research-an...
Hobby horses-and-detail-devils-transparency-in-digital-humanities-research-an...Marijn Koolen
 
A hands-on approach to digital tool criticism: Tools for (self-)reflection
A hands-on approach to digital tool criticism: Tools for (self-)reflectionA hands-on approach to digital tool criticism: Tools for (self-)reflection
A hands-on approach to digital tool criticism: Tools for (self-)reflectionMarijn Koolen
 
Data Scopes - Towards transparent data research in digital humanities (Digita...
Data Scopes - Towards transparent data research in digital humanities (Digita...Data Scopes - Towards transparent data research in digital humanities (Digita...
Data Scopes - Towards transparent data research in digital humanities (Digita...Marijn Koolen
 
Managing Ireland's Research Data - 3 Research Methods
Managing Ireland's Research Data - 3 Research MethodsManaging Ireland's Research Data - 3 Research Methods
Managing Ireland's Research Data - 3 Research MethodsRebecca Grant
 
Search & Recommendation: Birds of a Feather?
Search & Recommendation: Birds of a Feather?Search & Recommendation: Birds of a Feather?
Search & Recommendation: Birds of a Feather?Toine Bogers
 
Information Retrieval Fundamentals - An introduction
Information Retrieval Fundamentals - An introduction Information Retrieval Fundamentals - An introduction
Information Retrieval Fundamentals - An introduction Grace Hui Yang
 
Trendspotting: Helping you make sense of large information sources
Trendspotting: Helping you make sense of large information sourcesTrendspotting: Helping you make sense of large information sources
Trendspotting: Helping you make sense of large information sourcesMarieke Guy
 
Data Analytics.03. Data processing
Data Analytics.03. Data processingData Analytics.03. Data processing
Data Analytics.03. Data processingAlex Rayón Jerez
 
Search, Report, Wherever You Are: A Novel Approach to Assessing User Satisfac...
Search, Report, Wherever You Are: A Novel Approach to Assessing User Satisfac...Search, Report, Wherever You Are: A Novel Approach to Assessing User Satisfac...
Search, Report, Wherever You Are: A Novel Approach to Assessing User Satisfac...Rachel Vacek
 
Requirements for Learning Analytics
Requirements for Learning AnalyticsRequirements for Learning Analytics
Requirements for Learning AnalyticsTore Hoel
 
Tutorial: CASRAI Standards Development (for a non-technology audience) - Davi...
Tutorial: CASRAI Standards Development (for a non-technology audience) - Davi...Tutorial: CASRAI Standards Development (for a non-technology audience) - Davi...
Tutorial: CASRAI Standards Development (for a non-technology audience) - Davi...CASRAI
 
A Research Plan to Study Impact of a Collaborative Web Search Tool on Novice'...
A Research Plan to Study Impact of a Collaborative Web Search Tool on Novice'...A Research Plan to Study Impact of a Collaborative Web Search Tool on Novice'...
A Research Plan to Study Impact of a Collaborative Web Search Tool on Novice'...Karthikeyan Umapathy
 
A Framework For Effective Content Strategy Based On Heuristic Evaluation (Res...
A Framework For Effective Content Strategy Based On Heuristic Evaluation (Res...A Framework For Effective Content Strategy Based On Heuristic Evaluation (Res...
A Framework For Effective Content Strategy Based On Heuristic Evaluation (Res...Nim Dvir
 
QQML Panel 2014: Pratt Institute SILS
QQML Panel 2014: Pratt Institute SILSQQML Panel 2014: Pratt Institute SILS
QQML Panel 2014: Pratt Institute SILSA. M. Kelleher
 
Using Qualitative Methods for Library Evaluation: An Interactive Workshop
Using Qualitative Methods for Library Evaluation: An Interactive WorkshopUsing Qualitative Methods for Library Evaluation: An Interactive Workshop
Using Qualitative Methods for Library Evaluation: An Interactive WorkshopLynn Connaway
 
Using Qualitative Methods for Library Evaluation: An Interactive Workshop
Using Qualitative Methods for Library Evaluation: An Interactive WorkshopUsing Qualitative Methods for Library Evaluation: An Interactive Workshop
Using Qualitative Methods for Library Evaluation: An Interactive WorkshopOCLC
 
Survey Research Methods with Lynn Silipigni Connaway
Survey Research Methods with Lynn Silipigni ConnawaySurvey Research Methods with Lynn Silipigni Connaway
Survey Research Methods with Lynn Silipigni ConnawayLynn Connaway
 
Review of search and retrieval strategies
Review of search and retrieval strategiesReview of search and retrieval strategies
Review of search and retrieval strategiesAbid Fakhre Alam
 

Ähnlich wie Search in Research, Let's Make it More Complex! (20)

Hobby horses-and-detail-devils-transparency-in-digital-humanities-research-an...
Hobby horses-and-detail-devils-transparency-in-digital-humanities-research-an...Hobby horses-and-detail-devils-transparency-in-digital-humanities-research-an...
Hobby horses-and-detail-devils-transparency-in-digital-humanities-research-an...
 
A hands-on approach to digital tool criticism: Tools for (self-)reflection
A hands-on approach to digital tool criticism: Tools for (self-)reflectionA hands-on approach to digital tool criticism: Tools for (self-)reflection
A hands-on approach to digital tool criticism: Tools for (self-)reflection
 
Data Scopes - Towards transparent data research in digital humanities (Digita...
Data Scopes - Towards transparent data research in digital humanities (Digita...Data Scopes - Towards transparent data research in digital humanities (Digita...
Data Scopes - Towards transparent data research in digital humanities (Digita...
 
Managing Ireland's Research Data - 3 Research Methods
Managing Ireland's Research Data - 3 Research MethodsManaging Ireland's Research Data - 3 Research Methods
Managing Ireland's Research Data - 3 Research Methods
 
Starr Hoffman - Data Collection & Research Design
Starr Hoffman - Data Collection & Research Design Starr Hoffman - Data Collection & Research Design
Starr Hoffman - Data Collection & Research Design
 
Search & Recommendation: Birds of a Feather?
Search & Recommendation: Birds of a Feather?Search & Recommendation: Birds of a Feather?
Search & Recommendation: Birds of a Feather?
 
Information Retrieval Fundamentals - An introduction
Information Retrieval Fundamentals - An introduction Information Retrieval Fundamentals - An introduction
Information Retrieval Fundamentals - An introduction
 
Trendspotting: Helping you make sense of large information sources
Trendspotting: Helping you make sense of large information sourcesTrendspotting: Helping you make sense of large information sources
Trendspotting: Helping you make sense of large information sources
 
Data Analytics.03. Data processing
Data Analytics.03. Data processingData Analytics.03. Data processing
Data Analytics.03. Data processing
 
Search, Report, Wherever You Are: A Novel Approach to Assessing User Satisfac...
Search, Report, Wherever You Are: A Novel Approach to Assessing User Satisfac...Search, Report, Wherever You Are: A Novel Approach to Assessing User Satisfac...
Search, Report, Wherever You Are: A Novel Approach to Assessing User Satisfac...
 
Requirements for Learning Analytics
Requirements for Learning AnalyticsRequirements for Learning Analytics
Requirements for Learning Analytics
 
Tutorial: CASRAI Standards Development (for a non-technology audience) - Davi...
Tutorial: CASRAI Standards Development (for a non-technology audience) - Davi...Tutorial: CASRAI Standards Development (for a non-technology audience) - Davi...
Tutorial: CASRAI Standards Development (for a non-technology audience) - Davi...
 
A Research Plan to Study Impact of a Collaborative Web Search Tool on Novice'...
A Research Plan to Study Impact of a Collaborative Web Search Tool on Novice'...A Research Plan to Study Impact of a Collaborative Web Search Tool on Novice'...
A Research Plan to Study Impact of a Collaborative Web Search Tool on Novice'...
 
A Framework For Effective Content Strategy Based On Heuristic Evaluation (Res...
A Framework For Effective Content Strategy Based On Heuristic Evaluation (Res...A Framework For Effective Content Strategy Based On Heuristic Evaluation (Res...
A Framework For Effective Content Strategy Based On Heuristic Evaluation (Res...
 
QQML Panel 2014: Pratt Institute SILS
QQML Panel 2014: Pratt Institute SILSQQML Panel 2014: Pratt Institute SILS
QQML Panel 2014: Pratt Institute SILS
 
Using Qualitative Methods for Library Evaluation: An Interactive Workshop
Using Qualitative Methods for Library Evaluation: An Interactive WorkshopUsing Qualitative Methods for Library Evaluation: An Interactive Workshop
Using Qualitative Methods for Library Evaluation: An Interactive Workshop
 
Using Qualitative Methods for Library Evaluation: An Interactive Workshop
Using Qualitative Methods for Library Evaluation: An Interactive WorkshopUsing Qualitative Methods for Library Evaluation: An Interactive Workshop
Using Qualitative Methods for Library Evaluation: An Interactive Workshop
 
Wheatley and Hervieux "Voice-Assistants, Artificial Intelligence, and the fut...
Wheatley and Hervieux "Voice-Assistants, Artificial Intelligence, and the fut...Wheatley and Hervieux "Voice-Assistants, Artificial Intelligence, and the fut...
Wheatley and Hervieux "Voice-Assistants, Artificial Intelligence, and the fut...
 
Survey Research Methods with Lynn Silipigni Connaway
Survey Research Methods with Lynn Silipigni ConnawaySurvey Research Methods with Lynn Silipigni Connaway
Survey Research Methods with Lynn Silipigni Connaway
 
Review of search and retrieval strategies
Review of search and retrieval strategiesReview of search and retrieval strategies
Review of search and retrieval strategies
 

Mehr von Marijn Koolen

Recommender Systems NL Meetup
Recommender Systems NL MeetupRecommender Systems NL Meetup
Recommender Systems NL MeetupMarijn Koolen
 
Narrative-Driven Recommendation for Casual Leisure Needs
Narrative-Driven Recommendation for Casual Leisure NeedsNarrative-Driven Recommendation for Casual Leisure Needs
Narrative-Driven Recommendation for Casual Leisure NeedsMarijn Koolen
 
Digital History - Maritieme Carrieres bij de VOC
Digital History - Maritieme Carrieres bij de VOCDigital History - Maritieme Carrieres bij de VOC
Digital History - Maritieme Carrieres bij de VOCMarijn Koolen
 
Facilitating reusable third-party annotations in the digital edition
Facilitating reusable third-party annotations in the digital editionFacilitating reusable third-party annotations in the digital edition
Facilitating reusable third-party annotations in the digital editionMarijn Koolen
 
Narrative-Driven Recommendation for Casual Leisure Needs
Narrative-Driven Recommendation for Casual Leisure NeedsNarrative-Driven Recommendation for Casual Leisure Needs
Narrative-Driven Recommendation for Casual Leisure NeedsMarijn Koolen
 
Scholary Web Annotation - HuC Live 2018
Scholary Web Annotation - HuC Live 2018Scholary Web Annotation - HuC Live 2018
Scholary Web Annotation - HuC Live 2018Marijn Koolen
 

Mehr von Marijn Koolen (6)

Recommender Systems NL Meetup
Recommender Systems NL MeetupRecommender Systems NL Meetup
Recommender Systems NL Meetup
 
Narrative-Driven Recommendation for Casual Leisure Needs
Narrative-Driven Recommendation for Casual Leisure NeedsNarrative-Driven Recommendation for Casual Leisure Needs
Narrative-Driven Recommendation for Casual Leisure Needs
 
Digital History - Maritieme Carrieres bij de VOC
Digital History - Maritieme Carrieres bij de VOCDigital History - Maritieme Carrieres bij de VOC
Digital History - Maritieme Carrieres bij de VOC
 
Facilitating reusable third-party annotations in the digital edition
Facilitating reusable third-party annotations in the digital editionFacilitating reusable third-party annotations in the digital edition
Facilitating reusable third-party annotations in the digital edition
 
Narrative-Driven Recommendation for Casual Leisure Needs
Narrative-Driven Recommendation for Casual Leisure NeedsNarrative-Driven Recommendation for Casual Leisure Needs
Narrative-Driven Recommendation for Casual Leisure Needs
 
Scholary Web Annotation - HuC Live 2018
Scholary Web Annotation - HuC Live 2018Scholary Web Annotation - HuC Live 2018
Scholary Web Annotation - HuC Live 2018
 

Kürzlich hochgeladen

Digital Indonesia Report 2024 by We Are Social .pdf
Digital Indonesia Report 2024 by We Are Social .pdfDigital Indonesia Report 2024 by We Are Social .pdf
Digital Indonesia Report 2024 by We Are Social .pdfNicoChristianSunaryo
 
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...Jack Cole
 
Role of Consumer Insights in business transformation
Role of Consumer Insights in business transformationRole of Consumer Insights in business transformation
Role of Consumer Insights in business transformationAnnie Melnic
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen
 
Introduction to Mongo DB-open-­‐source, high-­‐performance, document-­‐orient...
Introduction to Mongo DB-open-­‐source, high-­‐performance, document-­‐orient...Introduction to Mongo DB-open-­‐source, high-­‐performance, document-­‐orient...
Introduction to Mongo DB-open-­‐source, high-­‐performance, document-­‐orient...boychatmate1
 
Rithik Kumar Singh codealpha pythohn.pdf
Rithik Kumar Singh codealpha pythohn.pdfRithik Kumar Singh codealpha pythohn.pdf
Rithik Kumar Singh codealpha pythohn.pdfrahulyadav957181
 
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis modelDecoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis modelBoston Institute of Analytics
 
DATA ANALYSIS using various data sets like shoping data set etc
DATA ANALYSIS using various data sets like shoping data set etcDATA ANALYSIS using various data sets like shoping data set etc
DATA ANALYSIS using various data sets like shoping data set etclalithasri22
 
IBEF report on the Insurance market in India
IBEF report on the Insurance market in IndiaIBEF report on the Insurance market in India
IBEF report on the Insurance market in IndiaManalVerma4
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Boston Institute of Analytics
 
Decoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectDecoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectBoston Institute of Analytics
 
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024Susanna-Assunta Sansone
 
Principles and Practices of Data Visualization
Principles and Practices of Data VisualizationPrinciples and Practices of Data Visualization
Principles and Practices of Data VisualizationKianJazayeri1
 
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfEnglish-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfblazblazml
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Boston Institute of Analytics
 
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...Dr Arash Najmaei ( Phd., MBA, BSc)
 
World Economic Forum Metaverse Ecosystem By Utpal Chakraborty.pdf
World Economic Forum Metaverse Ecosystem By Utpal Chakraborty.pdfWorld Economic Forum Metaverse Ecosystem By Utpal Chakraborty.pdf
World Economic Forum Metaverse Ecosystem By Utpal Chakraborty.pdfsimulationsindia
 
Statistics For Management by Richard I. Levin 8ed.pdf
Statistics For Management by Richard I. Levin 8ed.pdfStatistics For Management by Richard I. Levin 8ed.pdf
Statistics For Management by Richard I. Levin 8ed.pdfnikeshsingh56
 
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Boston Institute of Analytics
 

Kürzlich hochgeladen (20)

Digital Indonesia Report 2024 by We Are Social .pdf
Digital Indonesia Report 2024 by We Are Social .pdfDigital Indonesia Report 2024 by We Are Social .pdf
Digital Indonesia Report 2024 by We Are Social .pdf
 
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
 
Role of Consumer Insights in business transformation
Role of Consumer Insights in business transformationRole of Consumer Insights in business transformation
Role of Consumer Insights in business transformation
 
Data Analysis Project: Stroke Prediction
Data Analysis Project: Stroke PredictionData Analysis Project: Stroke Prediction
Data Analysis Project: Stroke Prediction
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)
 
Introduction to Mongo DB-open-­‐source, high-­‐performance, document-­‐orient...
Introduction to Mongo DB-open-­‐source, high-­‐performance, document-­‐orient...Introduction to Mongo DB-open-­‐source, high-­‐performance, document-­‐orient...
Introduction to Mongo DB-open-­‐source, high-­‐performance, document-­‐orient...
 
Rithik Kumar Singh codealpha pythohn.pdf
Rithik Kumar Singh codealpha pythohn.pdfRithik Kumar Singh codealpha pythohn.pdf
Rithik Kumar Singh codealpha pythohn.pdf
 
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis modelDecoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis model
 
DATA ANALYSIS using various data sets like shoping data set etc
DATA ANALYSIS using various data sets like shoping data set etcDATA ANALYSIS using various data sets like shoping data set etc
DATA ANALYSIS using various data sets like shoping data set etc
 
IBEF report on the Insurance market in India
IBEF report on the Insurance market in IndiaIBEF report on the Insurance market in India
IBEF report on the Insurance market in India
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
 
Decoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectDecoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis Project
 
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
 
Principles and Practices of Data Visualization
Principles and Practices of Data VisualizationPrinciples and Practices of Data Visualization
Principles and Practices of Data Visualization
 
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfEnglish-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
 
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
 
World Economic Forum Metaverse Ecosystem By Utpal Chakraborty.pdf
World Economic Forum Metaverse Ecosystem By Utpal Chakraborty.pdfWorld Economic Forum Metaverse Ecosystem By Utpal Chakraborty.pdf
World Economic Forum Metaverse Ecosystem By Utpal Chakraborty.pdf
 
Statistics For Management by Richard I. Levin 8ed.pdf
Statistics For Management by Richard I. Levin 8ed.pdfStatistics For Management by Richard I. Levin 8ed.pdf
Statistics For Management by Richard I. Levin 8ed.pdf
 
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
 

Search in Research, Let's Make it More Complex!

  • 1. Search in Research, Let’s Make it More Complex! Collaboratively Looking Under the Hood and Its Consequences Marijn Koolen Humanities Cluster - Royal Netherlands Academy of Arts and Sciences CLARIAH Media Studies Summer School Netherlands Institute for Sound and Vision, 3 July 2018
  • 2. Overview 1. Search in Research a. Search as part of research process b. Search vs. other access methods 2. Search, Retrieval and Ranking a. Retrieval Systems, Ranking Algorithms and Relevance Models 3. Searching in Digital Collections a. Understanding (digital) collections and their construction b. Tool analysis through experimentation 4. Search Strategies and Corpus Building a. Systematic searching b. Search strategies and sampling
  • 3. 1. Search in Research
  • 4. ● Research Phases ○ Exploration, gathering, analysis, synthesis, presentation ○ Extremely non-linear (affordance of digital realm) ● Search happens throughout research process ○ Search phases: pre-focus, focus, post-focus ○ Use different types of collections and search engines ■ General purpose search engines, ■ Domain- and collection-specific (e.g. GLAMS), ■ Personal/private (offline) collections ○ Search strategies: ■ Ad hoc or systematic: berrypicking (Bates 1989), keyword harvesting (Burke 2011), … ■ Important for data and tool criticism Research Process
  • 5. ● For many online materials access is limited to search interface ○ Browsing is guided by available structure ■ Drill down via facets ■ Navigate via metadata fields (if enabled) ○ Without (relevant) structure, direct search is only practical alternative ● Searching as exploration ○ How does search engine provide overview? ■ How big is collection? ■ How is collection structure communicated? ■ What (meta)data is available? ■ How are search characteristics explained? ■ How are search results summarised? Search Engine as Mediator
  • 6. ● Browsing brings you along unintended materials: ○ Navigating your way to relevance ○ Impresses on you what else there is (see also Putnam 2016) ● Keyword search tends to focus on relevance ○ Pushes back related/nearby materials ○ Collection structure can be enabled to allow faceting (overview) ● Search and research methodology ○ Impact of digital keyword search needs to be reflected in methodology ○ How do you account for search process in scholarly communication? ■ Method of citation is based on analogue browse/search in archives and libraries ■ Pre-focus to focus: switch between ad hoc and systematic? ■ Non-linearity: exploration never stops, assumptions constantly challenged Browsing vs. Keyword Searching
  • 7. 'To take a single example of this disconnect between research process and representation, many of us use and cite eighteenth and nineteenth-century newspapers as simple hard-copy references without mention of how we navigated to the specific article, page and issue. In doing so, we actively misrepresent the limitations within which we are working.' (Hitchcock 2013, 12) 'This is not only about being explicit about our use of keyword searching - it is about moving beyond a traditional form of scholarship to data modelling and to what Franco Moretti calls “distant reading”.' (Hitchcock, Confronting the Digital, 2013, p. 19). Keyword Search and “Confronting the Digital”
  • 8. Information Search and Seeking ● Search takes place in context ○ Part of seeking, and overall inf. behaviour (Wilson) ○ As inf. behaviour changes (phases), so does seeking and search behaviour ● Reflection-in-action ○ When and where are choice points? ○ How do search actions relate to strategy and inf. need?
  • 10. Search and Accountability ● What should scholars account for? ○ Aspects of sources, tools and process ● Digital source criticism ○ How to evaluate digital sources (Fickers 2012) ○ Who made digital source, when, why, what for, how? ● Digital tool criticism ○ How to evaluate impact of digital tools (Koolen et al. 2018) ○ Reflection-in-action, experimentation ● Data Scopes ○ How to communicate research process to others (Hoekstra & Koolen 2018) ○ Discuss process of selection, modelling, normalization, linking, classification
  • 11. 2. Search, Retrieval and Ranking
  • 13. Retrieval - Matching and Similarity ● Matching based on user query ○ Query: free text, controlled facet, example (doc, AV or text) ○ Matching docs returned in certain order (non-matching are not retrieved) ■ How does search engine perform matching (esp. for free text and example)? ■ Potentially many objects match query: does order matter? ● Similarity ○ Degree of matching: some match better than others (notion of similarity) ■ Retrieve most similar documents first (ranking) ○ Similar how? Does interface explain? ● Retrieval and ranking ○ Retrieval: which matching documents are returned to the user as results? ○ Ranking: in which order are the results returned?
  • 14. Retrieval, Ranking and Relevance ● Retrieval results form a set ○ Can be ordered or unordered (e.g. SQL or SPARQL query) ■ Even unordered sets need to be presented to the user in some order ○ Criteria for ordering: alphabetic, size, recency, popularity (views, likes, citations, links) ■ Ordering re-organizes materials, temporarily disrupts “original” organization ■ Provides different view on materials ● Many systems perform relevance ranking ○ Relevant to who or what? ■ Query: document similarity scores ■ User: e.g. search history, preferences ■ Situation: user, location, time, device, query, work context (page views, annotations) ■ Other aspects: quality, diversity, controversy, polarity, exploration/exploitation, ...
  • 15. ● How does an algorithm understand the notion of relevance? ○ Statistical interpretation: ■ Generally: frequent words carry less signal, look for unexpected stuff ■ Many ways of scoring signal ○ TF-IDF: ■ Term Frequency in document (relevance of term in document) ■ Inverse of Document Frequency in collection (commonness of term across docs) ○ Probabilistic Language Model (PLM): ■ Probability of picking term from document as bag of words (relevance of term in doc) ■ Probability of picking term from collection as bag of words (commonness of term) ○ Many other relevance models, e.g. BM25, DFR, SDM, … ■ Different interpretations of relevance, hence different rankings Algorithmic Interpretation of Relevance
  • 16.
  • 17.
  • 18. Ranking Issues ● Document length ○ TF-IDF doesn’t model document length, favours longer documents ○ PLM explicitly normalizes on document length, favours shorter documents ○ Upshot: Delpher API returns short documents first for short queries ● Document priors: are all documents equal or not? ○ Can use document prior probability (independent of query) ○ Can favour documents that are more popular, recent, authoritative, … ○ Can favour documents that are more appropriate for situation (location, time of day, …) ● Problem: how do you know how search engine scores relevance? ○ How much should you know about it? ○ Many GLAM search engines have relatively straightforward relevance models, no doc priors ○ Google uses many hundreds of features for document, query, user and situation
  • 19. Relevance in Metadata Records ● Relevance ranking of metadata records ○ Metadata records are peculiar textual representations ■ Minimal amount of text, low redundancy ■ Majority of terms occur only once ○ Which part of TF-IDF contributes more to score of metadata record? ○ Which fields are useful/used for matching? ● NISV collection ○ Search engine indexes metadata records ■ Some records have lengthy itemized descriptions, some have not ■ Some have transcripts, some have not ○ Consequences for retrieving? And for ranking? ■ How does search engine handle this? ■ How does search engine communicate this?
  • 20. ● Hard to match keywords against AV signal directly ○ Option: use text representation for AV document ■ E.g. metadata, description, script, speech transcript, ... ○ Option: use AV representation of query ■ E.g. example document or user recording ■ Use audio or visual similarity (again, similar how?) Retrieving and Ranking Audiovisual Materials
  • 21. ● Experiment to understand search functionalities ○ How can you find out if multiple search terms are treated with Boolean AND or OR operators? ○ How can you find out if terms are stemmed/normalized? ● Phrase search: ○ What happens when you use quotation marks to group terms into a phrase? ○ How do the results compare to those using no quotation marks? ● Proximity search: ○ Can you specify that terms should be near each other? ● Fuzzy search: wildcard and edit distance searches ○ Controlling lexical variation vs. uncontrolled wildcard search ○ voetbal+voetballen vs. voetbal* (matches voetbalvereniging, voetbalveld, ...) Opaqueness of Interfaces and Experimentation
  • 22. ● Experiment with Search and Compare tools of the CLARIAH Mediasuite ○ Find out if stopwords are removed ○ Find out if words are stemmed/normalized ○ Find out how multi-word queries are interpreted, i.e. as AND or OR ○ Find out how standard search operators work ■ Boolean AND, OR and NOT ■ Quotation marks for phrases Exercise
  • 23. 3. Searching in Digital Collections
  • 24. ● Collections of GLAMs are often built up over decades ○ Based on aims and selection criteria ■ Rarely "complete", dependent on availability of materials ○ Digital access via digitization, or digital archiving (born-digital) ■ Some things are lost in this process (e.g. context, quality, …) ● Heterogeneity: mix of object/source types (sub-collections) ○ Different modalities, different ways of accessing and presenting ■ Text vs. Image vs. AV vs. 3D (or 4D) Nature of Digital Collections
  • 25. Nature of Metadata ● Digital access via metadata ○ Metadata: data about the object/source ○ Types: formal, structural, technical, administrative, aboutness ○ Metadata fields allow selection and search via specific fields ■ Title, description, creator, creation date, genre, … ○ Allows (seemingly) uniform access to heterogeneous collections ■ But, different materials have different aspects to describe ■ Edition is relevant for books and films, not so much for paintings ● Metadata creation process ○ Often done with limited time, information and system flexibility ○ Inherently subjective, especially content analysis ● Size matters ○ Requirements change as size of collection grows (also depends on expectations)
  • 26. ● Hierarchical organization ○ 4 levels ■ Series: De Wereld Draait Door ■ Season: De Wereld Draait Door 2016 ■ Program: De Wereld Draait Door 21-06-2016 ■ Segment: De Wereld Draait Door 21-06-2016 ○ Each level has a metadata record (with overlap in field, e.g. title) ● Follows archival standard ○ Describe aspect at highest relevant level ○ Don’t repeat at lower levels unless it deviates (e.g. main titles) ○ Fonds: aggregation of documents from same origin Archival Structure and NISV Audiovisual Collection
  • 27. ● Power of the archive ○ Problem of perspective (from archive-as-source to archive-as-subject, Stoler 2002) ● History of the archive ○ Collections created over decades often go through changes in ■ selection criteria, cataloguers (human or algorithm), ■ cataloguing budgets, policies, rules, practice and vocabularies, ■ software (migrations and updates), hardware, ■ institutional mission, societal attitudes, … ○ Most of these aspects remain undocumented or partially documented ● Consequences ○ Almost inherently incomplete, inconsistent and sometimes necessarily incorrect ○ After many years, it's hard to retrace what happened ■ and how it affects access, selection and analysis Digital Source and Data Criticism
  • 28. Metadata in theory Metadata in practice Source: Jaap Kamps
  • 29. Combined Collections ● Several portals combine (heterogeneous) collections ○ Examples: ■ Europeana, European Newspapers, EU screen, Nederlab, Delpher, Online Archives of California, … ○ Worldwide aggregated collections: ■ ArchiveGrid (1000+ archives): over 5M finding aids ■ WorldCat (72,000 libraries): 400M records, 2.6B assets, 100M persons ● Huge challenge for source criticism as well as search ○ Collections vary in size, provenance, selection criteria, metadata policies, interpretation and richness ○ Heterogeneous metadata schemas have been mapped to single schema ■ Causes problems for interpretation ■ E.g. what does creator mean for paintings, films, tv series, letters, advertisements, ...?
  • 30. Assessing Metadata Quality ● Questions ○ What are pitfalls in relying on metadata? ○ How can we evaluate metadata quality? ○ What are relevant aspects to consider? ● Collection inspection ○ In CLARIAH Media Suite we created a tool for inspecting metadata ■ Esp. useful for complex collections like NISV audiovisual collection ■ Somewhat ad hoc, please feel encouraged to give feedback! ○ Please go to the Media Suite and go to the Collection Inspector tool ■ Click on “select field to analyse” and let the interface load the data on completeness (this will take awhile)
  • 31.
  • 32.
  • 33.
  • 34. Assessing Timelines and Other Visualizations ● Timeline visualizations give view of temporal spread ○ Very difficult to interpret properly ● Issues with absolute frequencies: ○ Collection materials not evenly distributed ○ Need to compare query-specific distribution to collection-distribution ● Issues with relative frequencies: ○ Incompleteness not evenly distributed (use collection inspector)
  • 35. Retrievability and Metadata Characteristics ● Different types of metadata fields ○ Controlled vocabulary: e.g. broadcast channel (radio or tv) ○ Number: number of episodes/seasons/segments ○ Time/date: program length, recording date ○ Free keyword/keyphrase: title, person name (tend to be non-unique) ○ Free text: description, summary, transcript, … (tend to be unique) ● Different types allow different forms of retrieval and ranking ○ Long text fields have more terms, with higher frequencies ■ Some types of programs have longer descriptions/transcript ■ These match more queries, so higher chance of being retrieved ■ Impact of long text fields on ranking depends on relevance model! ○ Repeated values allow aggregation, navigation
  • 36. ● Some search interfaces offer facets to narrow down search results ○ E.g. broadcaster and genre in the CLARIAH Media Suite ○ Facets provide overview, afford focusing through selection ● How do facets work? ○ Based on metadata fields: rich schema has rich options for facets ○ Types of metadata fields: controlled vocab, number, date, keyword/phrase, free text ■ Facets work for field with limited range of values, so not free text fields ○ Long tails in facets: typically, few high frequency, many low frequency values Metadata and Search Facets
  • 37.
  • 38. Exercise ● Experiment with the Collection Inspector of the CLARIAH Mediasuite ○ Try out the collection inspector: ■ Scroll through the list of fields to get an idea of what is available ■ Look at completeness of fields for f.i. “genre”, “keywords” and “awards” ■ Which metadata fields are relatively complete? ■ At which archival levels are they most complete? ● Explore which fields are available and which fields make good facets ○ Explore facet distributions in entire collection and for specific queries
  • 39. 4. Search Strategies and Corpus Building
  • 40. ● Importance of selection criteria ○ Do you have to hand pick each document? ○ Or can you select sets based on matching criteria? ○ Is representativeness important? If so, representativeness of what? ○ Or completeness? Why? ● Exploiting facets and dates ○ Filtering: align facets/dates with research focus ○ Sampling: compare across facets ■ Which facet types can you use? ○ Sampling strategies ■ Sample per facet/year (e.g. X items per facet/year) ■ Within facets, select random or not Searching for Corpus Building
  • 41. Tracking Context in Corpus Building ● Why were certain documents selected? ○ How were they selected? ○ What strategy was used? ○ Documenting helps understanding/remembering choices? ● Do research goals and questions change during collection? ○ Interacting with sources during search updates knowledge structures (Vakkari 2016) ○ Updates tend to be small and incremental, hence barely noticeable ○ Explicit reflection-in-action can bring these to the surface (Koolen et al. 2018) ○ Adding annotations can also provide context
  • 42. Systematic Searching ● Systematic (comprehensive) search has two factors (Yakel 2010): ○ Search strategy (user) ○ Search functionalities (system) ○ Functionalities shape/affect strategy ● Step 1: systematic search for relevant collections online ○ Different collections/sites offer different search functionalities and levels of detail ○ Explicitly address what consequences this has for your strategy and research goals ● Step 2: ○ Explore individual collections using one or more strategies ○ "Researchers need to be flexible and creative to accommodate the vagaries of cataloging practices." (Yakel 2010, p. 110) ○ Footnote and reference chasing: references often give an "information scent", suggesting other collections and items to explore.
  • 43. Search Strategies ● Web search strategies defined by Drabenstott (2001) ○ Discussed in archive context by Yakel (2010) ● Five strategies ○ Synonym generation ○ Chaining ○ Name collection ○ Pearl growing ○ Successive segmentation ● Somewhat related to information seeking patterns by Ellis (1989) ○ Starting, chaining, browsing, differentiating, monitoring, extracting
  • 44. ● Synonym generation: 1) search with relevant term, 2) close read results to identify related terms (wordclouds, facets), 3) search via related terms for synonyms. ● Chaining: follow references/citations (explicit or implicit), identify relevant subset and use explicit structure to explore connected/related subset ● Name collection: search with keywords, identify relevant names, search with names, identify related names and keywords, repeat. Similar to keyword harvesting (Burke 2011). Drabenstott’s Strategies (1/2)
  • 45. Drabenstott’s Strategies (2/2) ● Pearl growing: start small and focused with specific search terms, slowly expand out with additional terms to broader topics/themes ● Successive segmentation: opposite of pearl growing; start broad and increasingly zoom in and focus; e.g. make queries increasingly specific by adding (ANDing) keywords, replace broad terms with lower frequency terms, or select facets
  • 46. Search Strategies and Research Phases ● Research phase ○ Exploration <-> search phase pre-focus i. Ad hoc, no need yet for systematic search ii. Mostly pearl growing and/or successive segmentation to determine focus ○ Analysis <-> search phase focus i. Switch to systematic, determine strategy ii. Use chaining, name collection, synonym generation (for coverage/representation, boundaries) ● But reality resists: ○ (Re)search process is very non-linear ○ Boundary between exploration and analysis is not always clear ○ Late discoveries can prompt or force new directions, ...
  • 47. When To Stop ● Often switch from exploration to “sorta” systematic search ○ But hard to remember and explain what and how you searched ○ Moreover, difficult to determine when to stop ○ Explicit strategy allows for stopping criteria ● Stopping criteria ○ Check whole set/sample, all available facets, ... ○ Diminishing returns: you increasingly encounter seen things, new relevance becomes rare ○ When stopping, make explicit (at least for yourself) when and why you stopped ● Meta-strategy: ○ chance strategy/tactics ○ E.g. successive segmentation -> harvest keywords -> switch segment -> harvest keywords, ...
  • 48. Wrap Up ● Search in research ○ How to incorporate these processes in research methodology ● Large, heterogeneous collections introduce issues for research ○ Assessing incompleteness of materials ○ Assessing incompleteness, incorrectness and inconsistency of metadata ● Looking under the hood ○ Evaluating information access functionalities (search and browse) ○ Selecting an appropriate search strategy for research goals ○ Determining success/failure of searches ○ Understanding search for corpus building
  • 49. Burke, T. 2011. How I Talk About Searching, Discovery and Research in Courses. May 9, 2011. Drabenstott, K.M., 2001. Web Search Strategy Development. Online, 25(4), pp.18-25. Fickers, F. 2012. Towards a New Digital Historicism? Doing History in the Age of Abundance. View journal, volume 1 (1). http://orbilu.uni.lu/bitstream/10993/7615/1/4-4-1-PB.pdf Hitchcock, T. 2013. Confronting the Digital - Or How Academic History Writing Lost the Plot. Cultural and Social History, Volume 10, Issue 1, pp. 9-23. https://doi.org/10.2752/147800413X13515292098070 Hoekstra, R., M. Koolen. 2018. Data Scopes for Digital History Research. Historical Methods: A Journal of Quantitative and Interdisciplinary History, Volume 51 (2), 2018. References
  • 50. References Koolen, M., J. van Gorp, J. van Ossenbruggen. 2018. Lessons Learned from a Digital Tool Criticism Workshop. Digital Humanities in the Benelux 2018 Conference. Putnam L. 2016. The Transnational and the Text-Searchable: Digitized Sources and the Shadows They Cast. American Historical Review, Volume 121, Number 2, pp. 377-402. Vakkari, P. 2016. Searching as Learning: A systematization based on literature. Journal of Information Science, 42(1) 2016, pp. 7-18. Yakel, E., 2010. Searching and seeking in the deep web: Primary sources on the internet. Working in the archives: Practical research methods for rhetoric and composition, pp.102-118.