SlideShare ist ein Scribd-Unternehmen logo
1 von 26
Downloaden Sie, um offline zu lesen
Introduction
Lucene
MTAS
Tokenizer FoLiA
Search using CQL
Results
Multi Tier Annotation Search
MTAS
Matthijs Brouwer
Meertens Institute
December 8, 2015
Matthijs Brouwer Multi Tier Annotation Search
Introduction
Lucene
MTAS
Tokenizer FoLiA
Search using CQL
Results
1 Introduction
2 Lucene
3 MTAS
4 Tokenizer FoLiA
5 Search using CQL
6 Results
Matthijs Brouwer Multi Tier Annotation Search
Introduction
Lucene
MTAS
Tokenizer FoLiA
Search using CQL
Results
Text and Metadata
Annotated Text
Requirements
Provide Search on Combination of Text and Metadata
Example data
Author Eduard Douwes Dekker
Place of birth Amsterdam
Date of birth 1820, March 2
Pseudonym Max Havelaar
Title Multatuli
Published 1860
Text Ik ben makelaar in ko e
en woon op de Lauriergracht
no 37 . . .
Matthijs Brouwer Multi Tier Annotation Search
Introduction
Lucene
MTAS
Tokenizer FoLiA
Search using CQL
Results
Text and Metadata
Annotated Text
Requirements
Solution based on Apache Solr
Reverse Index
Apache Solr (based on Apache Lucene)
Index on both Text and Metadata
Advantages
Search
Facets
Scalable
Custom plugin (join)
Actively developed
Matthijs Brouwer Multi Tier Annotation Search
Introduction
Lucene
MTAS
Tokenizer FoLiA
Search using CQL
Results
Text and Metadata
Annotated Text
Requirements
Search Text
’Ik ben makelaar in ko e, en woon op de Lauriergracht no 37.’
We can search for
”Makelaar”
”Makelaar in ko e”
”Makel.* in ko e”
Matthijs Brouwer Multi Tier Annotation Search
Introduction
Lucene
MTAS
Tokenizer FoLiA
Search using CQL
Results
Text and Metadata
Annotated Text
Requirements
Annotations
’Ik ben makelaar in ko e, en woon op de Lauriergracht no 37.’
text lemma pos/features
Ik ik VNW(pers,pron,nomin,vol,1,ev)
ben zijn WW(pv,tgw,ev)
makelaar makelaar N(soort,ev,basis,zijd,stan)
in in VZ(init)
ko e ko e N(soort,ev,basis,zijd,stan)
, , LET()
. . . . . . . . .
Matthijs Brouwer Multi Tier Annotation Search
Introduction
Lucene
MTAS
Tokenizer FoLiA
Search using CQL
Results
Text and Metadata
Annotated Text
Requirements
FoLiA
<text xml:id=”untitled.text”>
<p xml:id=”untitled.p.1”>
<s xml:id=”untitled.p.1.s.1”>
<w xml:id=”untitled.p.1.s.1.w.1” class=”WORD”>
<t>Ik</t>
<pos class=”VNW(pers,pron,nomin,vol,1,ev)” confidence=”0.999791” head=”VNW”>
<feat class=”pers” subset=”vwtype”/>
<feat class=”pron” subset=”pdtype”/>
<feat class=”nomin” subset=”naamval”/>
<feat class=”vol” subset=”status”/>
<feat class=”1” subset=”persoon”/>
<feat class=”ev” subset=”getal”/>
</pos>
<morphology>
<morpheme>
<t o↵set=”0”>ik</t>
</morpheme>
</morphology>
<lemma class=”ik”/>
</w>
. . .
Matthijs Brouwer Multi Tier Annotation Search
Introduction
Lucene
MTAS
Tokenizer FoLiA
Search using CQL
Results
Text and Metadata
Annotated Text
Requirements
Required functionality
Extend current Solr solution
Search on annotations like pos, lemma, features, . . .
Search on sentences, paragraphs, chapters, . . .
Search on entities and chunks
Search on dependencies
Statistics, grouping, facets, . . .
Important
Maintaining functionality and scalability
Upgradeable to new releases Solr/Lucene
Matthijs Brouwer Multi Tier Annotation Search
Introduction
Lucene
MTAS
Tokenizer FoLiA
Search using CQL
Results
Tokenization
Reverse Index
Limitations
Alternatives
Tokenization
Something about Lucene internals
Focus on text
Tokenization
Text is split up into tokens
value, e.g. ”ko e”
position, e.g. 4
o↵set, e.g. 19 24
payload, e.g. 1.000
’Ik ben makelaar in ko e, en woon op de Lauriergracht no 37.’
Matthijs Brouwer Multi Tier Annotation Search
Introduction
Lucene
MTAS
Tokenizer FoLiA
Search using CQL
Results
Tokenization
Reverse Index
Limitations
Alternatives
Reverse Index
Tokenstream used to construct Reverse Index
text document position o↵set payload
ben 0 1 3 5 0.500
de 0 9 38 39 0.200
en 0 6 27 28 0.250
in 0 3 16 17 0.350
ko e 0 4 19 24 0.900
makelaar 0 2 7 14 0.800
. . . . . . . . . . . . . . .
This enables fast search, since the locations of matching terms can
be found very quickly.
Matthijs Brouwer Multi Tier Annotation Search
Introduction
Lucene
MTAS
Tokenizer FoLiA
Search using CQL
Results
Tokenization
Reverse Index
Limitations
Alternatives
Limitations
Limitations of this approach
Heavily based on grouping by document
Collecting statistics
Grouping results
Not possible to include
Structural information: sentences, paragraphs, . . .
Annotations: pos, lemma’s, . . .
Relations: dependencies, chunking, . . .
No real forward index
Finding all tokens for a given position
Matthijs Brouwer Multi Tier Annotation Search
Introduction
Lucene
MTAS
Tokenizer FoLiA
Search using CQL
Results
Tokenization
Reverse Index
Limitations
Alternatives
Alternatives
Alternative solutions
Graph Database
Experiments Neo4j: problems scalability and performance
Too general, doesn’t use sequential nature of textual data
BlackLab
Based on Lucene, no integration with Solr
Di↵erent fields for each annotation layer
Matthijs Brouwer Multi Tier Annotation Search
Introduction
Lucene
MTAS
Tokenizer FoLiA
Search using CQL
Results
General
Prefixes
Payload
Forward Indexes
Additional requirements
Extension provided by MTAS
Store multiple tokens on the same position, and use prefixes
to distinguish between di↵erent layers of annotations
Use the payload to encode additional information on each
token
Construct forward indexes by extending the Lucene Codec
Implementation
Extension based on the Lucene Library
Provide query handlers for extended data structures
Provide Solr Plugin using the MTAS extension
Matthijs Brouwer Multi Tier Annotation Search
Introduction
Lucene
MTAS
Tokenizer FoLiA
Search using CQL
Results
General
Prefixes
Payload
Forward Indexes
Additional requirements
Prefixes
Store multiple tokens on the same position, and use prefixes to
distinguish between di↵erent layers of annotations
text document position
lemma:de 0 9
lemma:zijn 0 1
. . . . . . . . .
pos:LID 0 9
pos:WW 0 1
. . . . . . . . .
t:ben 0 1
t:de 0 9
. . . . . . . . .
Matthijs Brouwer Multi Tier Annotation Search
Introduction
Lucene
MTAS
Tokenizer FoLiA
Search using CQL
Results
General
Prefixes
Payload
Forward Indexes
Additional requirements
Payload
Use the payload to encode additional information on each token
mtas id integer identifying token within a document
position type of position: single, range or set
additional information for range or set
o↵set start and end o↵set
real o↵set start and end real o↵set
parent reference to another token by its mtas id
payload original payload
Matthijs Brouwer Multi Tier Annotation Search
Introduction
Lucene
MTAS
Tokenizer FoLiA
Search using CQL
Results
General
Prefixes
Payload
Forward Indexes
Additional requirements
Forward Indexes
Construct forward indexes by extending the Lucene Codec
Position Given the position within the document,
return references to all objects on that position.
Parent Id Given the mtas id, return references
to all objects referring to this mtas id as parent
Object Id Given the id, return a reference to the object
Prefix/Position Given prefix and position, return the value
Matthijs Brouwer Multi Tier Annotation Search
Introduction
Lucene
MTAS
Tokenizer FoLiA
Search using CQL
Results
General
Prefixes
Payload
Forward Indexes
Additional requirements
Usage new structure
The additions make it possible to quickly retrieve the required
information for queries and results based on the annotated text.
To take advantage of these additions to the Lucene structure, we
need
Tokenizer mapping the original annotated data (FoLiA) on the
new structure
Query handlers, and query language: CQL
Matthijs Brouwer Multi Tier Annotation Search
Introduction
Lucene
MTAS
Tokenizer FoLiA
Search using CQL
Results
FoLiA
<text xml:id=”untitled.text”>
<p xml:id=”untitled.p.1”>
<s xml:id=”untitled.p.1.s.1”>
<w xml:id=”untitled.p.1.s.1.w.1” class=”WORD”>
<t>Ik</t>
<pos class=”VNW(pers,pron,nomin,vol,1,ev)” confidence=”0.999791” head=”VNW”>
<feat class=”pers” subset=”vwtype”/>
<feat class=”pron” subset=”pdtype”/>
<feat class=”nomin” subset=”naamval”/>
<feat class=”vol” subset=”status”/>
<feat class=”1” subset=”persoon”/>
<feat class=”ev” subset=”getal”/>
</pos>
<morphology>
<morpheme>
<t o↵set=”0”>ik</t>
</morpheme>
</morphology>
<lemma class=”ik”/>
</w>
. . .
Matthijs Brouwer Multi Tier Annotation Search
Introduction
Lucene
MTAS
Tokenizer FoLiA
Search using CQL
Results
Tokenizer FoLiA
Several elements can be distinguished:
Words : <w/>
Annotations on Words : <pos/>, <t/>, <lemma/>
Groups of Words : <p/>, <s/>, <div/>
Annotations on Groups : <lang/>
References : <wref/>
Relations : <entity/>
The configurable FoLiA tokenizer enables to define these items and
map them onto the new index structure.
Matthijs Brouwer Multi Tier Annotation Search
Introduction
Lucene
MTAS
Tokenizer FoLiA
Search using CQL
Results
Search using CQL
For new MTAS data structure
Query handlers provided
Support Corpus Query Language (CQL)
Enables to define conditions on annotations
Confusion about the exact interpretation and implementation
Matthijs Brouwer Multi Tier Annotation Search
Introduction
Lucene
MTAS
Tokenizer FoLiA
Search using CQL
Results
Search using CQL
the big green shiny apple
LID ADJ ADJ ADJ N
Ambiguities illustrated by examples
[pos = ”LID”|word = ”the”] (1)
[word = ”b. ⇤ ”|word = ”. ⇤ g”] (2)
[pos = ”ADJ”]{2} (3)
[pos = ”ADJ”]? [pos = ”N”] (4)
Matthijs Brouwer Multi Tier Annotation Search
Introduction
Lucene
MTAS
Tokenizer FoLiA
Search using CQL
Results
Search using CQL
Within MTAS
Results should be considered as equal if and only if the
positions of both results exactly match.
Di↵ers from the default query interpretation of Lucene and
the CQL interpretation as used in other applications
No options to refer to parts of the matched pattern to e.g.
sort, group or collect statistics
Matthijs Brouwer Multi Tier Annotation Search
Introduction
Lucene
MTAS
Tokenizer FoLiA
Search using CQL
Results
Size indexes
Performance
TODO
Size indexes
Collection # FoLiA Zipped Size Index Positions
DBNL T 9, 465 29GB 198GB 677,476,310
DBNL DT 131, 177 95GB 395,530,191
SONAR 2, 063, 880 22GB 127GB 504,393,711
Search on combined indexes using Solr sharding
# FoLiA 2, 204, 522
# Positions 1, 577, 400, 212
# Sentences 92, 584, 655
There are approximately 10 tokens on each position.
Matthijs Brouwer Multi Tier Annotation Search
Introduction
Lucene
MTAS
Tokenizer FoLiA
Search using CQL
Results
Size indexes
Performance
TODO
Performance
Virtual Machine, Ubuntu, 8 cores, 48GB (40GB Solr)
Computing stats (sum, mean, median, standarddeviation, etc.) on
full set of 2, 204, 522 documents and 1, 577, 400, 212 positions.
CQL Time Hits Docs
[t = ”de”] 3, 023 ms 57, 531, 353 1, 801, 583
[t = ”de” & pos = ”LID”] 7, 877 ms 56, 704, 921 1, 799, 499
[t = ”de” & !pos = ”LID”] 3, 105 ms 826, 432 132, 722
< s > [t = ”De”] 11, 568 ms 6, 085, 643 1, 090, 127
[pos = ”N”] 6, 200 ms 259, 942, 340 2, 189, 750
[pos = ”ADJ”] [pos = ”N”] 42, 977 ms 45, 366, 603 1, 821, 716
[pos = ”ADJ”]? [pos = ”N”] 207, 795 ms 305, 308, 943 2, 189, 750
Matthijs Brouwer Multi Tier Annotation Search
Introduction
Lucene
MTAS
Tokenizer FoLiA
Search using CQL
Results
Size indexes
Performance
TODO
TODO
Group results
Facets
Performance
. . .
Matthijs Brouwer Multi Tier Annotation Search
Introduction
Lucene
MTAS
Tokenizer FoLiA
Search using CQL
Results
Size indexes
Performance
TODO
The end
Matthijs Brouwer Multi Tier Annotation Search

Weitere ähnliche Inhalte

Andere mochten auch

Meraj_Ul_Haq_Spot_Certificate
Meraj_Ul_Haq_Spot_CertificateMeraj_Ul_Haq_Spot_Certificate
Meraj_Ul_Haq_Spot_CertificateMohammed Meraj
 
QB'er demonstration
QB'er demonstrationQB'er demonstration
QB'er demonstrationCLARIAH
 
Movilización de un estudiante universitario en el transporte público
Movilización de un estudiante universitario en el transporte públicoMovilización de un estudiante universitario en el transporte público
Movilización de un estudiante universitario en el transporte públicoAndresMopositaJara
 
PRESENTACIÓN PARCIA Nº 2
PRESENTACIÓN PARCIA Nº 2PRESENTACIÓN PARCIA Nº 2
PRESENTACIÓN PARCIA Nº 2johanna yate
 
WP3: overzicht van de voortgang van WP# op de CLARIAH-dag
WP3: overzicht van de voortgang van WP# op de CLARIAH-dagWP3: overzicht van de voortgang van WP# op de CLARIAH-dag
WP3: overzicht van de voortgang van WP# op de CLARIAH-dagCLARIAH
 
Case Study: SEGA
Case Study: SEGACase Study: SEGA
Case Study: SEGAWes McCabe
 
Buku standart SPMI
Buku standart SPMIBuku standart SPMI
Buku standart SPMIspmi
 
Stable environment/Dynamic environment
Stable environment/Dynamic environmentStable environment/Dynamic environment
Stable environment/Dynamic environmentVJIMPGDM
 
North Carolina State Legislature
North Carolina State LegislatureNorth Carolina State Legislature
North Carolina State LegislatureMatthew Caggia
 
Work life balance ppt
Work life balance   pptWork life balance   ppt
Work life balance pptAnkit Kumar
 

Andere mochten auch (15)

Meraj_Ul_Haq_Spot_Certificate
Meraj_Ul_Haq_Spot_CertificateMeraj_Ul_Haq_Spot_Certificate
Meraj_Ul_Haq_Spot_Certificate
 
SQE Consulting- Round 2 submission
SQE Consulting- Round 2 submissionSQE Consulting- Round 2 submission
SQE Consulting- Round 2 submission
 
QB'er demonstration
QB'er demonstrationQB'er demonstration
QB'er demonstration
 
Movilización de un estudiante universitario en el transporte público
Movilización de un estudiante universitario en el transporte públicoMovilización de un estudiante universitario en el transporte público
Movilización de un estudiante universitario en el transporte público
 
PRESENTACIÓN PARCIA Nº 2
PRESENTACIÓN PARCIA Nº 2PRESENTACIÓN PARCIA Nº 2
PRESENTACIÓN PARCIA Nº 2
 
WP3: overzicht van de voortgang van WP# op de CLARIAH-dag
WP3: overzicht van de voortgang van WP# op de CLARIAH-dagWP3: overzicht van de voortgang van WP# op de CLARIAH-dag
WP3: overzicht van de voortgang van WP# op de CLARIAH-dag
 
Msicasdelsigloxx 100510104516-phpapp01
Msicasdelsigloxx 100510104516-phpapp01Msicasdelsigloxx 100510104516-phpapp01
Msicasdelsigloxx 100510104516-phpapp01
 
Case Study: SEGA
Case Study: SEGACase Study: SEGA
Case Study: SEGA
 
Social Learning
Social LearningSocial Learning
Social Learning
 
Buku standart SPMI
Buku standart SPMIBuku standart SPMI
Buku standart SPMI
 
Stable environment/Dynamic environment
Stable environment/Dynamic environmentStable environment/Dynamic environment
Stable environment/Dynamic environment
 
North Carolina State Legislature
North Carolina State LegislatureNorth Carolina State Legislature
North Carolina State Legislature
 
Work life balance ppt
Work life balance   pptWork life balance   ppt
Work life balance ppt
 
American Metaphors
American MetaphorsAmerican Metaphors
American Metaphors
 
Primary Elections
Primary ElectionsPrimary Elections
Primary Elections
 

Ähnlich wie MTAS Henny Brugman

Advanced full text searching techniques using Lucene
Advanced full text searching techniques using LuceneAdvanced full text searching techniques using Lucene
Advanced full text searching techniques using LuceneAsad Abbas
 
Lucene Bootcamp -1
Lucene Bootcamp -1 Lucene Bootcamp -1
Lucene Bootcamp -1 GokulD
 
Lucene's Latest (for Libraries)
Lucene's Latest (for Libraries)Lucene's Latest (for Libraries)
Lucene's Latest (for Libraries)Erik Hatcher
 
Elasticsearch - Zero to Hero
Elasticsearch - Zero to HeroElasticsearch - Zero to Hero
Elasticsearch - Zero to HeroDaniel Ziv
 
ElasticSearch for .NET Developers
ElasticSearch for .NET DevelopersElasticSearch for .NET Developers
ElasticSearch for .NET DevelopersBen van Mol
 
Sumo Logic "How to" Webinar: Advanced Analytics
Sumo Logic "How to" Webinar: Advanced AnalyticsSumo Logic "How to" Webinar: Advanced Analytics
Sumo Logic "How to" Webinar: Advanced AnalyticsSumo Logic
 
Solr中国6月21日企业搜索
Solr中国6月21日企业搜索Solr中国6月21日企业搜索
Solr中国6月21日企业搜索longkeyy
 
Tools for mxf-embedded bucore metadata, Dieter Van Rijsselbergen, Jean-Pierre...
Tools for mxf-embedded bucore metadata, Dieter Van Rijsselbergen, Jean-Pierre...Tools for mxf-embedded bucore metadata, Dieter Van Rijsselbergen, Jean-Pierre...
Tools for mxf-embedded bucore metadata, Dieter Van Rijsselbergen, Jean-Pierre...FIAT/IFTA
 
Self-learned Relevancy with Apache Solr
Self-learned Relevancy with Apache SolrSelf-learned Relevancy with Apache Solr
Self-learned Relevancy with Apache SolrTrey Grainger
 
The Apache Solr Smart Data Ecosystem
The Apache Solr Smart Data EcosystemThe Apache Solr Smart Data Ecosystem
The Apache Solr Smart Data EcosystemTrey Grainger
 
Apache Lucene: Searching the Web and Everything Else (Jazoon07)
Apache Lucene: Searching the Web and Everything Else (Jazoon07)Apache Lucene: Searching the Web and Everything Else (Jazoon07)
Apache Lucene: Searching the Web and Everything Else (Jazoon07)dnaber
 
Elasticsearch and Spark
Elasticsearch and SparkElasticsearch and Spark
Elasticsearch and SparkAudible, Inc.
 
What is the best full text search engine for Python?
What is the best full text search engine for Python?What is the best full text search engine for Python?
What is the best full text search engine for Python?Andrii Soldatenko
 

Ähnlich wie MTAS Henny Brugman (20)

Advanced full text searching techniques using Lucene
Advanced full text searching techniques using LuceneAdvanced full text searching techniques using Lucene
Advanced full text searching techniques using Lucene
 
Lucene Bootcamp -1
Lucene Bootcamp -1 Lucene Bootcamp -1
Lucene Bootcamp -1
 
Lucene's Latest (for Libraries)
Lucene's Latest (for Libraries)Lucene's Latest (for Libraries)
Lucene's Latest (for Libraries)
 
Elasticsearch - Zero to Hero
Elasticsearch - Zero to HeroElasticsearch - Zero to Hero
Elasticsearch - Zero to Hero
 
ElasticSearch for .NET Developers
ElasticSearch for .NET DevelopersElasticSearch for .NET Developers
ElasticSearch for .NET Developers
 
Sumo Logic "How to" Webinar: Advanced Analytics
Sumo Logic "How to" Webinar: Advanced AnalyticsSumo Logic "How to" Webinar: Advanced Analytics
Sumo Logic "How to" Webinar: Advanced Analytics
 
Solr中国6月21日企业搜索
Solr中国6月21日企业搜索Solr中国6月21日企业搜索
Solr中国6月21日企业搜索
 
Tools for mxf-embedded bucore metadata, Dieter Van Rijsselbergen, Jean-Pierre...
Tools for mxf-embedded bucore metadata, Dieter Van Rijsselbergen, Jean-Pierre...Tools for mxf-embedded bucore metadata, Dieter Van Rijsselbergen, Jean-Pierre...
Tools for mxf-embedded bucore metadata, Dieter Van Rijsselbergen, Jean-Pierre...
 
 
Self-learned Relevancy with Apache Solr
Self-learned Relevancy with Apache SolrSelf-learned Relevancy with Apache Solr
Self-learned Relevancy with Apache Solr
 
The Apache Solr Smart Data Ecosystem
The Apache Solr Smart Data EcosystemThe Apache Solr Smart Data Ecosystem
The Apache Solr Smart Data Ecosystem
 
Apache Lucene Searching The Web
Apache Lucene Searching The WebApache Lucene Searching The Web
Apache Lucene Searching The Web
 
Apache Lucene: Searching the Web and Everything Else (Jazoon07)
Apache Lucene: Searching the Web and Everything Else (Jazoon07)Apache Lucene: Searching the Web and Everything Else (Jazoon07)
Apache Lucene: Searching the Web and Everything Else (Jazoon07)
 
Elasticsearch and Spark
Elasticsearch and SparkElasticsearch and Spark
Elasticsearch and Spark
 
What is the best full text search engine for Python?
What is the best full text search engine for Python?What is the best full text search engine for Python?
What is the best full text search engine for Python?
 
Sparkflows Use Cases
Sparkflows Use CasesSparkflows Use Cases
Sparkflows Use Cases
 
SparkFlow
SparkFlow SparkFlow
SparkFlow
 
Apache Solr
Apache SolrApache Solr
Apache Solr
 
115 sem 3_o-pesch
115 sem 3_o-pesch115 sem 3_o-pesch
115 sem 3_o-pesch
 
Apache lucene
Apache luceneApache lucene
Apache lucene
 

Mehr von CLARIAH

ACAD Presentation by Wilbert Spooren, CLARIAH Toogdag 19-10-2018
ACAD Presentation by Wilbert Spooren, CLARIAH Toogdag 19-10-2018ACAD Presentation by Wilbert Spooren, CLARIAH Toogdag 19-10-2018
ACAD Presentation by Wilbert Spooren, CLARIAH Toogdag 19-10-2018CLARIAH
 
DB:CCC Presentation of Karin Hofmeester, CLARIAH Toogdag 19-10-2018
DB:CCC Presentation of Karin Hofmeester, CLARIAH Toogdag 19-10-2018DB:CCC Presentation of Karin Hofmeester, CLARIAH Toogdag 19-10-2018
DB:CCC Presentation of Karin Hofmeester, CLARIAH Toogdag 19-10-2018CLARIAH
 
Masterclass innosurance 2018
Masterclass innosurance 2018Masterclass innosurance 2018
Masterclass innosurance 2018CLARIAH
 
Flat TLA
Flat TLAFlat TLA
Flat TLACLARIAH
 
Collection registration for the CLARIAH Media Suite.
Collection registration for the CLARIAH Media Suite.Collection registration for the CLARIAH Media Suite.
Collection registration for the CLARIAH Media Suite.CLARIAH
 
CMDI2RDF
CMDI2RDFCMDI2RDF
CMDI2RDFCLARIAH
 
2016 05-20-clariah-wp4
2016 05-20-clariah-wp42016 05-20-clariah-wp4
2016 05-20-clariah-wp4CLARIAH
 
2016 05-20-clariah-wp3
2016 05-20-clariah-wp32016 05-20-clariah-wp3
2016 05-20-clariah-wp3CLARIAH
 
2016 05-20-clariah-wp2
2016 05-20-clariah-wp22016 05-20-clariah-wp2
2016 05-20-clariah-wp2CLARIAH
 
2016 05-20-clariah-wp5
2016 05-20-clariah-wp52016 05-20-clariah-wp5
2016 05-20-clariah-wp5CLARIAH
 
Paqu Gertjan van Noord en Jan Odijk
Paqu Gertjan van Noord en Jan OdijkPaqu Gertjan van Noord en Jan Odijk
Paqu Gertjan van Noord en Jan OdijkCLARIAH
 
Open sonar martinreynaert
Open sonar martinreynaertOpen sonar martinreynaert
Open sonar martinreynaertCLARIAH
 
Struc data Auke Rijpma
Struc data Auke RijpmaStruc data Auke Rijpma
Struc data Auke RijpmaCLARIAH
 
Diachronous conceptuallexicons Marieke van Erp / Piek Vossen
Diachronous conceptuallexicons Marieke van Erp / Piek VossenDiachronous conceptuallexicons Marieke van Erp / Piek Vossen
Diachronous conceptuallexicons Marieke van Erp / Piek VossenCLARIAH
 
Corpus studio Erwin Komen
Corpus studio Erwin KomenCorpus studio Erwin Komen
Corpus studio Erwin KomenCLARIAH
 
Athena richard zijdeman
Athena richard zijdemanAthena richard zijdeman
Athena richard zijdemanCLARIAH
 
Struc data aukerijpma
Struc data aukerijpmaStruc data aukerijpma
Struc data aukerijpmaCLARIAH
 
Anansi jauco noordzij
Anansi jauco noordzijAnansi jauco noordzij
Anansi jauco noordzijCLARIAH
 
Clariah dag 2016_wp1_ocw
Clariah dag 2016_wp1_ocwClariah dag 2016_wp1_ocw
Clariah dag 2016_wp1_ocwCLARIAH
 
WP4: overzicht van de voortgang van WP4 op de CLARIAH-dag 22 januari 2016
WP4: overzicht van de voortgang van WP4 op de CLARIAH-dag 22 januari 2016WP4: overzicht van de voortgang van WP4 op de CLARIAH-dag 22 januari 2016
WP4: overzicht van de voortgang van WP4 op de CLARIAH-dag 22 januari 2016CLARIAH
 

Mehr von CLARIAH (20)

ACAD Presentation by Wilbert Spooren, CLARIAH Toogdag 19-10-2018
ACAD Presentation by Wilbert Spooren, CLARIAH Toogdag 19-10-2018ACAD Presentation by Wilbert Spooren, CLARIAH Toogdag 19-10-2018
ACAD Presentation by Wilbert Spooren, CLARIAH Toogdag 19-10-2018
 
DB:CCC Presentation of Karin Hofmeester, CLARIAH Toogdag 19-10-2018
DB:CCC Presentation of Karin Hofmeester, CLARIAH Toogdag 19-10-2018DB:CCC Presentation of Karin Hofmeester, CLARIAH Toogdag 19-10-2018
DB:CCC Presentation of Karin Hofmeester, CLARIAH Toogdag 19-10-2018
 
Masterclass innosurance 2018
Masterclass innosurance 2018Masterclass innosurance 2018
Masterclass innosurance 2018
 
Flat TLA
Flat TLAFlat TLA
Flat TLA
 
Collection registration for the CLARIAH Media Suite.
Collection registration for the CLARIAH Media Suite.Collection registration for the CLARIAH Media Suite.
Collection registration for the CLARIAH Media Suite.
 
CMDI2RDF
CMDI2RDFCMDI2RDF
CMDI2RDF
 
2016 05-20-clariah-wp4
2016 05-20-clariah-wp42016 05-20-clariah-wp4
2016 05-20-clariah-wp4
 
2016 05-20-clariah-wp3
2016 05-20-clariah-wp32016 05-20-clariah-wp3
2016 05-20-clariah-wp3
 
2016 05-20-clariah-wp2
2016 05-20-clariah-wp22016 05-20-clariah-wp2
2016 05-20-clariah-wp2
 
2016 05-20-clariah-wp5
2016 05-20-clariah-wp52016 05-20-clariah-wp5
2016 05-20-clariah-wp5
 
Paqu Gertjan van Noord en Jan Odijk
Paqu Gertjan van Noord en Jan OdijkPaqu Gertjan van Noord en Jan Odijk
Paqu Gertjan van Noord en Jan Odijk
 
Open sonar martinreynaert
Open sonar martinreynaertOpen sonar martinreynaert
Open sonar martinreynaert
 
Struc data Auke Rijpma
Struc data Auke RijpmaStruc data Auke Rijpma
Struc data Auke Rijpma
 
Diachronous conceptuallexicons Marieke van Erp / Piek Vossen
Diachronous conceptuallexicons Marieke van Erp / Piek VossenDiachronous conceptuallexicons Marieke van Erp / Piek Vossen
Diachronous conceptuallexicons Marieke van Erp / Piek Vossen
 
Corpus studio Erwin Komen
Corpus studio Erwin KomenCorpus studio Erwin Komen
Corpus studio Erwin Komen
 
Athena richard zijdeman
Athena richard zijdemanAthena richard zijdeman
Athena richard zijdeman
 
Struc data aukerijpma
Struc data aukerijpmaStruc data aukerijpma
Struc data aukerijpma
 
Anansi jauco noordzij
Anansi jauco noordzijAnansi jauco noordzij
Anansi jauco noordzij
 
Clariah dag 2016_wp1_ocw
Clariah dag 2016_wp1_ocwClariah dag 2016_wp1_ocw
Clariah dag 2016_wp1_ocw
 
WP4: overzicht van de voortgang van WP4 op de CLARIAH-dag 22 januari 2016
WP4: overzicht van de voortgang van WP4 op de CLARIAH-dag 22 januari 2016WP4: overzicht van de voortgang van WP4 op de CLARIAH-dag 22 januari 2016
WP4: overzicht van de voortgang van WP4 op de CLARIAH-dag 22 januari 2016
 

Kürzlich hochgeladen

High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...chandars293
 
Creating and Analyzing Definitive Screening Designs
Creating and Analyzing Definitive Screening DesignsCreating and Analyzing Definitive Screening Designs
Creating and Analyzing Definitive Screening DesignsNurulAfiqah307317
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsSérgio Sacani
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticssakshisoni2385
 
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRLKochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRLkantirani197
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTSérgio Sacani
 
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.Nitya salvi
 
Green chemistry and Sustainable development.pptx
Green chemistry  and Sustainable development.pptxGreen chemistry  and Sustainable development.pptx
Green chemistry and Sustainable development.pptxRajatChauhan518211
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bSérgio Sacani
 
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICESAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICEayushi9330
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPirithiRaju
 
Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfSumit Kumar yadav
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsSérgio Sacani
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Lokesh Kothari
 
Seismic Method Estimate velocity from seismic data.pptx
Seismic Method Estimate velocity from seismic  data.pptxSeismic Method Estimate velocity from seismic  data.pptx
Seismic Method Estimate velocity from seismic data.pptxAlMamun560346
 
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...ssuser79fe74
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)Areesha Ahmad
 
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptxSCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptxRizalinePalanog2
 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfmuntazimhurra
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)Areesha Ahmad
 

Kürzlich hochgeladen (20)

High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
 
Creating and Analyzing Definitive Screening Designs
Creating and Analyzing Definitive Screening DesignsCreating and Analyzing Definitive Screening Designs
Creating and Analyzing Definitive Screening Designs
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
 
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRLKochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOST
 
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
 
Green chemistry and Sustainable development.pptx
Green chemistry  and Sustainable development.pptxGreen chemistry  and Sustainable development.pptx
Green chemistry and Sustainable development.pptx
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
 
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICESAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
 
Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdf
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
 
Seismic Method Estimate velocity from seismic data.pptx
Seismic Method Estimate velocity from seismic  data.pptxSeismic Method Estimate velocity from seismic  data.pptx
Seismic Method Estimate velocity from seismic data.pptx
 
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)
 
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptxSCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdf
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)
 

MTAS Henny Brugman

  • 1. Introduction Lucene MTAS Tokenizer FoLiA Search using CQL Results Multi Tier Annotation Search MTAS Matthijs Brouwer Meertens Institute December 8, 2015 Matthijs Brouwer Multi Tier Annotation Search
  • 2. Introduction Lucene MTAS Tokenizer FoLiA Search using CQL Results 1 Introduction 2 Lucene 3 MTAS 4 Tokenizer FoLiA 5 Search using CQL 6 Results Matthijs Brouwer Multi Tier Annotation Search
  • 3. Introduction Lucene MTAS Tokenizer FoLiA Search using CQL Results Text and Metadata Annotated Text Requirements Provide Search on Combination of Text and Metadata Example data Author Eduard Douwes Dekker Place of birth Amsterdam Date of birth 1820, March 2 Pseudonym Max Havelaar Title Multatuli Published 1860 Text Ik ben makelaar in ko e en woon op de Lauriergracht no 37 . . . Matthijs Brouwer Multi Tier Annotation Search
  • 4. Introduction Lucene MTAS Tokenizer FoLiA Search using CQL Results Text and Metadata Annotated Text Requirements Solution based on Apache Solr Reverse Index Apache Solr (based on Apache Lucene) Index on both Text and Metadata Advantages Search Facets Scalable Custom plugin (join) Actively developed Matthijs Brouwer Multi Tier Annotation Search
  • 5. Introduction Lucene MTAS Tokenizer FoLiA Search using CQL Results Text and Metadata Annotated Text Requirements Search Text ’Ik ben makelaar in ko e, en woon op de Lauriergracht no 37.’ We can search for ”Makelaar” ”Makelaar in ko e” ”Makel.* in ko e” Matthijs Brouwer Multi Tier Annotation Search
  • 6. Introduction Lucene MTAS Tokenizer FoLiA Search using CQL Results Text and Metadata Annotated Text Requirements Annotations ’Ik ben makelaar in ko e, en woon op de Lauriergracht no 37.’ text lemma pos/features Ik ik VNW(pers,pron,nomin,vol,1,ev) ben zijn WW(pv,tgw,ev) makelaar makelaar N(soort,ev,basis,zijd,stan) in in VZ(init) ko e ko e N(soort,ev,basis,zijd,stan) , , LET() . . . . . . . . . Matthijs Brouwer Multi Tier Annotation Search
  • 7. Introduction Lucene MTAS Tokenizer FoLiA Search using CQL Results Text and Metadata Annotated Text Requirements FoLiA <text xml:id=”untitled.text”> <p xml:id=”untitled.p.1”> <s xml:id=”untitled.p.1.s.1”> <w xml:id=”untitled.p.1.s.1.w.1” class=”WORD”> <t>Ik</t> <pos class=”VNW(pers,pron,nomin,vol,1,ev)” confidence=”0.999791” head=”VNW”> <feat class=”pers” subset=”vwtype”/> <feat class=”pron” subset=”pdtype”/> <feat class=”nomin” subset=”naamval”/> <feat class=”vol” subset=”status”/> <feat class=”1” subset=”persoon”/> <feat class=”ev” subset=”getal”/> </pos> <morphology> <morpheme> <t o↵set=”0”>ik</t> </morpheme> </morphology> <lemma class=”ik”/> </w> . . . Matthijs Brouwer Multi Tier Annotation Search
  • 8. Introduction Lucene MTAS Tokenizer FoLiA Search using CQL Results Text and Metadata Annotated Text Requirements Required functionality Extend current Solr solution Search on annotations like pos, lemma, features, . . . Search on sentences, paragraphs, chapters, . . . Search on entities and chunks Search on dependencies Statistics, grouping, facets, . . . Important Maintaining functionality and scalability Upgradeable to new releases Solr/Lucene Matthijs Brouwer Multi Tier Annotation Search
  • 9. Introduction Lucene MTAS Tokenizer FoLiA Search using CQL Results Tokenization Reverse Index Limitations Alternatives Tokenization Something about Lucene internals Focus on text Tokenization Text is split up into tokens value, e.g. ”ko e” position, e.g. 4 o↵set, e.g. 19 24 payload, e.g. 1.000 ’Ik ben makelaar in ko e, en woon op de Lauriergracht no 37.’ Matthijs Brouwer Multi Tier Annotation Search
  • 10. Introduction Lucene MTAS Tokenizer FoLiA Search using CQL Results Tokenization Reverse Index Limitations Alternatives Reverse Index Tokenstream used to construct Reverse Index text document position o↵set payload ben 0 1 3 5 0.500 de 0 9 38 39 0.200 en 0 6 27 28 0.250 in 0 3 16 17 0.350 ko e 0 4 19 24 0.900 makelaar 0 2 7 14 0.800 . . . . . . . . . . . . . . . This enables fast search, since the locations of matching terms can be found very quickly. Matthijs Brouwer Multi Tier Annotation Search
  • 11. Introduction Lucene MTAS Tokenizer FoLiA Search using CQL Results Tokenization Reverse Index Limitations Alternatives Limitations Limitations of this approach Heavily based on grouping by document Collecting statistics Grouping results Not possible to include Structural information: sentences, paragraphs, . . . Annotations: pos, lemma’s, . . . Relations: dependencies, chunking, . . . No real forward index Finding all tokens for a given position Matthijs Brouwer Multi Tier Annotation Search
  • 12. Introduction Lucene MTAS Tokenizer FoLiA Search using CQL Results Tokenization Reverse Index Limitations Alternatives Alternatives Alternative solutions Graph Database Experiments Neo4j: problems scalability and performance Too general, doesn’t use sequential nature of textual data BlackLab Based on Lucene, no integration with Solr Di↵erent fields for each annotation layer Matthijs Brouwer Multi Tier Annotation Search
  • 13. Introduction Lucene MTAS Tokenizer FoLiA Search using CQL Results General Prefixes Payload Forward Indexes Additional requirements Extension provided by MTAS Store multiple tokens on the same position, and use prefixes to distinguish between di↵erent layers of annotations Use the payload to encode additional information on each token Construct forward indexes by extending the Lucene Codec Implementation Extension based on the Lucene Library Provide query handlers for extended data structures Provide Solr Plugin using the MTAS extension Matthijs Brouwer Multi Tier Annotation Search
  • 14. Introduction Lucene MTAS Tokenizer FoLiA Search using CQL Results General Prefixes Payload Forward Indexes Additional requirements Prefixes Store multiple tokens on the same position, and use prefixes to distinguish between di↵erent layers of annotations text document position lemma:de 0 9 lemma:zijn 0 1 . . . . . . . . . pos:LID 0 9 pos:WW 0 1 . . . . . . . . . t:ben 0 1 t:de 0 9 . . . . . . . . . Matthijs Brouwer Multi Tier Annotation Search
  • 15. Introduction Lucene MTAS Tokenizer FoLiA Search using CQL Results General Prefixes Payload Forward Indexes Additional requirements Payload Use the payload to encode additional information on each token mtas id integer identifying token within a document position type of position: single, range or set additional information for range or set o↵set start and end o↵set real o↵set start and end real o↵set parent reference to another token by its mtas id payload original payload Matthijs Brouwer Multi Tier Annotation Search
  • 16. Introduction Lucene MTAS Tokenizer FoLiA Search using CQL Results General Prefixes Payload Forward Indexes Additional requirements Forward Indexes Construct forward indexes by extending the Lucene Codec Position Given the position within the document, return references to all objects on that position. Parent Id Given the mtas id, return references to all objects referring to this mtas id as parent Object Id Given the id, return a reference to the object Prefix/Position Given prefix and position, return the value Matthijs Brouwer Multi Tier Annotation Search
  • 17. Introduction Lucene MTAS Tokenizer FoLiA Search using CQL Results General Prefixes Payload Forward Indexes Additional requirements Usage new structure The additions make it possible to quickly retrieve the required information for queries and results based on the annotated text. To take advantage of these additions to the Lucene structure, we need Tokenizer mapping the original annotated data (FoLiA) on the new structure Query handlers, and query language: CQL Matthijs Brouwer Multi Tier Annotation Search
  • 18. Introduction Lucene MTAS Tokenizer FoLiA Search using CQL Results FoLiA <text xml:id=”untitled.text”> <p xml:id=”untitled.p.1”> <s xml:id=”untitled.p.1.s.1”> <w xml:id=”untitled.p.1.s.1.w.1” class=”WORD”> <t>Ik</t> <pos class=”VNW(pers,pron,nomin,vol,1,ev)” confidence=”0.999791” head=”VNW”> <feat class=”pers” subset=”vwtype”/> <feat class=”pron” subset=”pdtype”/> <feat class=”nomin” subset=”naamval”/> <feat class=”vol” subset=”status”/> <feat class=”1” subset=”persoon”/> <feat class=”ev” subset=”getal”/> </pos> <morphology> <morpheme> <t o↵set=”0”>ik</t> </morpheme> </morphology> <lemma class=”ik”/> </w> . . . Matthijs Brouwer Multi Tier Annotation Search
  • 19. Introduction Lucene MTAS Tokenizer FoLiA Search using CQL Results Tokenizer FoLiA Several elements can be distinguished: Words : <w/> Annotations on Words : <pos/>, <t/>, <lemma/> Groups of Words : <p/>, <s/>, <div/> Annotations on Groups : <lang/> References : <wref/> Relations : <entity/> The configurable FoLiA tokenizer enables to define these items and map them onto the new index structure. Matthijs Brouwer Multi Tier Annotation Search
  • 20. Introduction Lucene MTAS Tokenizer FoLiA Search using CQL Results Search using CQL For new MTAS data structure Query handlers provided Support Corpus Query Language (CQL) Enables to define conditions on annotations Confusion about the exact interpretation and implementation Matthijs Brouwer Multi Tier Annotation Search
  • 21. Introduction Lucene MTAS Tokenizer FoLiA Search using CQL Results Search using CQL the big green shiny apple LID ADJ ADJ ADJ N Ambiguities illustrated by examples [pos = ”LID”|word = ”the”] (1) [word = ”b. ⇤ ”|word = ”. ⇤ g”] (2) [pos = ”ADJ”]{2} (3) [pos = ”ADJ”]? [pos = ”N”] (4) Matthijs Brouwer Multi Tier Annotation Search
  • 22. Introduction Lucene MTAS Tokenizer FoLiA Search using CQL Results Search using CQL Within MTAS Results should be considered as equal if and only if the positions of both results exactly match. Di↵ers from the default query interpretation of Lucene and the CQL interpretation as used in other applications No options to refer to parts of the matched pattern to e.g. sort, group or collect statistics Matthijs Brouwer Multi Tier Annotation Search
  • 23. Introduction Lucene MTAS Tokenizer FoLiA Search using CQL Results Size indexes Performance TODO Size indexes Collection # FoLiA Zipped Size Index Positions DBNL T 9, 465 29GB 198GB 677,476,310 DBNL DT 131, 177 95GB 395,530,191 SONAR 2, 063, 880 22GB 127GB 504,393,711 Search on combined indexes using Solr sharding # FoLiA 2, 204, 522 # Positions 1, 577, 400, 212 # Sentences 92, 584, 655 There are approximately 10 tokens on each position. Matthijs Brouwer Multi Tier Annotation Search
  • 24. Introduction Lucene MTAS Tokenizer FoLiA Search using CQL Results Size indexes Performance TODO Performance Virtual Machine, Ubuntu, 8 cores, 48GB (40GB Solr) Computing stats (sum, mean, median, standarddeviation, etc.) on full set of 2, 204, 522 documents and 1, 577, 400, 212 positions. CQL Time Hits Docs [t = ”de”] 3, 023 ms 57, 531, 353 1, 801, 583 [t = ”de” & pos = ”LID”] 7, 877 ms 56, 704, 921 1, 799, 499 [t = ”de” & !pos = ”LID”] 3, 105 ms 826, 432 132, 722 < s > [t = ”De”] 11, 568 ms 6, 085, 643 1, 090, 127 [pos = ”N”] 6, 200 ms 259, 942, 340 2, 189, 750 [pos = ”ADJ”] [pos = ”N”] 42, 977 ms 45, 366, 603 1, 821, 716 [pos = ”ADJ”]? [pos = ”N”] 207, 795 ms 305, 308, 943 2, 189, 750 Matthijs Brouwer Multi Tier Annotation Search
  • 25. Introduction Lucene MTAS Tokenizer FoLiA Search using CQL Results Size indexes Performance TODO TODO Group results Facets Performance . . . Matthijs Brouwer Multi Tier Annotation Search
  • 26. Introduction Lucene MTAS Tokenizer FoLiA Search using CQL Results Size indexes Performance TODO The end Matthijs Brouwer Multi Tier Annotation Search