SlideShare a Scribd company logo
1 of 35
María Navas-Loro
Ontology Engineering Group
Universidad Politécnica de Madrid, Spain
Víctor Rodríguez-Doncel
Ontology Engineering Group
Universidad Politécnica de Madrid, Spain
Añotador:
a Temporal Tagger for Spanish
mnavas@fi.upm.es
https://marianavas.oeg-upm.net
29/10/2019
LKE 2019
Añotador: a Temporal Tagger for Spanish
Outline
2
Outline
 Brief Introduction
 State of the Art
o Temporal Taggers
o Corpora
 Añotador
 Corpus
 Results
 Challenges
 Conclusions
3
BRIEF INTRODUCTION
Temporal Annotation
Añotador: a Temporal Tagger for Spanish
Temporal information
4
Temporal information is everywhere:
news, medical texts, legal documents…
… and we can use it for a lot of things!
TIMELINE GENERATION
SEARCH
ENGINESUMMARIZATION
PATTERN
DETECTION
It was lodged on 5 March 2010.
It was rejected related to La
Vanguardia. No information
related to Google Inc.
Was it accepted?
When was the complaint against
Google Inc made?
QA
Nevertheless, there are
not a lot of tools to
extract temporal
information in Spanish.
Añotador: a Temporal Tagger for Spanish
The task
5
Temporal taggers:
1) Detect temporal expressions:
- DATE - DURATION
- SET - TIME
2) Normalize them.
3) Additionally, event and relation extraction
STATE OF THE ART
What has been done
6
Añotador: a Temporal Tagger for Spanish
Corpora
7
 A lot in English!
o News:
• Timebank corpus [1]
• TempEval challenges datasets [2]
• MEANTIME corpus [3]
o Medical: THYME [4]
o Historical: Wikiwars corpus [5]
o Scientific: Abstracts [6]
o Colloquial: SMS [6] and tweets [7]
 Less for Spanish…
o News: The Spanish TimeBank corpus [8], MEANTIME [3]
o Historical: The ModeS TimeBank 1.06 (texts from the 17th and
18th centuries) [9]
* TempEval 2 and TempEval 3 challenges (fragment of TimeBank)
Añotador: a Temporal Tagger for Spanish
Temporal Taggers
8
 A lot in English!
o Rule-based:
• HeidelTime [10]
• SUTime* [11]
• SynTime [12]
o Machine-Learning-based
• ClearTK-TimeML [13]
• USFD2 [14]
• UWTime [15]
• TARSQI [16]
• CAEVO [17]
• TIPSem [18]
 For Spanish, just the bold ones… (there are more in
literature, but we could not find them available!)
 No corpora, no taggers!
9
AÑOTADOR
Our tool
Añotador
10
Añotador (aka Annotador) is a temporal tagger for Spanish
and English. It is able to find and normalize dates, times,
durations and sets (temporal expressions that repeat
periodically, such as “monthly”, “twice a year” or “every
Tuesday”). There is a demo available in its webpage, and
can be called as webservice or downloaded from GitHub
(where all the code and a docker image are available).
http://annotador.oeg-upm.net/
Añotador: a Temporal Tagger for Spanish
Pipeline of Añotador
11
CoreNLP pipeline
+ IxaPipes
RULES
for each
sentence
for each
expression
reference
current
Date?
do
Calculus
type?
DATE
YES
setFormat
INPUT
text + anchor date
OUTPUT
NORMALIZATION ALGORITHM
token rules identify numbers, months,
granularities ( day )...
basic TEs rules # + granularity: two days
compound rules merge TEs: 1 month (1M)
and 2 days (2D) = P1M2D
YYYY-MM-DD
TIME DURATION SET
normalize
XX values
NO
add
lastDate
process
Durations
NO
reference
next/last
YES
NO
is
anchored
?
YES
update
lastDate
ANNOTADOR
Añotador: a Temporal Tagger for Spanish
Pipeline of Añotador
12
CoreNLP pipeline
+ IxaPipes
RULES
for each
sentence
for each
expression
reference
current
Date?
do
Calculus
type?
DATE
YES
setFormat
INPUT
text + anchor date
OUTPUT
NORMALIZATION ALGORITHM
token rules identify numbers, months,
granularities ( day )...
basic TEs rules # + granularity: two days
compound rules merge TEs: 1 month (1M)
and 2 days (2D) = P1M2D
YYYY-MM-DD
TIME DURATION SET
normalize
XX values
NO
add
lastDate
process
Durations
NO
reference
next/last
YES
NO
is
anchored
?
YES
update
lastDate
ANNOTADOR
1. Preprocessing:
 We get as input the text and the
anchor date (if none, we assume
the current day)
 We use CoreNLP for
lemmatizing, sentence splitting…
 We added IxaPipes models for
Spanish to improve the quality of
the output.
Añotador: a Temporal Tagger for Spanish
Pipeline of Añotador
13
CoreNLP pipeline
+ IxaPipes
RULES
for each
sentence
for each
expression
reference
current
Date?
do
Calculus
type?
DATE
YES
setFormat
INPUT
text + anchor date
OUTPUT
NORMALIZATION ALGORITHM
token rules identify numbers, months,
granularities ( day )...
basic TEs rules # + granularity: two days
compound rules merge TEs: 1 month (1M)
and 2 days (2D) = P1M2D
YYYY-MM-DD
TIME DURATION SET
normalize
XX values
NO
add
lastDate
process
Durations
NO
reference
next/last
YES
NO
is
anchored
?
YES
update
lastDate
ANNOTADOR
2. Rules:
More than 100 rules written in CoreNLP
TokensRegex format.
1. Token-based rules for expressions
such as numerals, granularities…
2. Basic temporal expression rules,
working on previously found basic
expressions
3. Compound expression rules, for
inheritance values or composition.
4. Literal expression rules, for specific
expressions such as “previously”,
polysemic expressions (Spanish
“mañana”) or expressions including
tokens that can appear in other
temporal expressions, requiring a
dedicated approach.
Añotador: a Temporal Tagger for Spanish
Mañana, a tricky case
14
Additionally, some expressions are really tricky…
… also syntactically!
“por la mañana” vs “en la mañana” (“in the morning”).
Añotador: a Temporal Tagger for Spanish
Pipeline of Añotador
15
CoreNLP pipeline
+ IxaPipes
RULES
for each
sentence
for each
expression
reference
current
Date?
do
Calculus
type?
DATE
YES
setFormat
INPUT
text + anchor date
OUTPUT
NORMALIZATION ALGORITHM
token rules identify numbers, months,
granularities ( day )...
basic TEs rules # + granularity: two days
compound rules merge TEs: 1 month (1M)
and 2 days (2D) = P1M2D
YYYY-MM-DD
TIME DURATION SET
normalize
XX values
NO
add
lastDate
process
Durations
NO
reference
next/last
YES
NO
is
anchored
?
YES
update
lastDate
ANNOTADOR
3. Normalization
algorithm:
 Works for each
sentence
sepparately.
 Different
approaches for
each type of
expressions.
 Can take into
account different
reference dates.
Añotador: a Temporal Tagger for Spanish
Pipeline of Añotador
16
CoreNLP pipeline
+ IxaPipes
RULES
for each
sentence
for each
expression
reference
current
Date?
do
Calculus
type?
DATE
YES
setFormat
INPUT
text + anchor date
OUTPUT
NORMALIZATION ALGORITHM
token rules identify numbers, months,
granularities ( day )...
basic TEs rules # + granularity: two days
compound rules merge TEs: 1 month (1M)
and 2 days (2D) = P1M2D
YYYY-MM-DD
TIME DURATION SET
normalize
XX values
NO
add
lastDate
process
Durations
NO
reference
next/last
YES
NO
is
anchored
?
YES
update
lastDate
ANNOTADOR
4. Output format:
- TimeML
- JSON
17
HOURGLASS
Corpus built
Añotador: a Temporal Tagger for Spanish
Hourglass corpus
While building Añotador, we noticed both the scarcity and
lack of heterogeneity of corpora in Spanish.
 We decided to create a synthetic dataset of expressions
that a temporal tagger should cover, including common
expressions and sentences that tend to produce false
positives.
 We additionally asked people foreign to the task to
provide temporal expressions. These contributors:
o Had different backgrounds (linguists, computer scientists,
translators…)
o Were from different Spanish-speaking countries and regions
(such as Venezuelan or Ecuatorians, and people from different
regions in Spain)
18
Añotador: a Temporal Tagger for Spanish
Hourglass corpus
The result was the Hourglass corpus, composed of two parts:
19
SYNTHETIC part:
o Short texts annotated with
TimeML.
o It includes additional tags in
order to facilitate the
evaluation of different
temporal expressions, such
as “Dates”, “Fractions” or
“False” (referring to false
positives like “50/2/1991” or
“February 30”).
PEOPLE part:
o Sentences from contributors
with different backgrounds
and from different Spanish-
speaking countries.
o Also tagged and annotated
following the standards, as
well as with the register of
the sentence (colloquial,
chat, latin expressions, Latin
American expressions,
phrases and one literary
sentence).
Añotador: a Temporal Tagger for Spanish
Hourglass corpus in numbers
20
21
EVALUATION
Performance
Añotador: a Temporal Tagger for Spanish
Hourglass corpus
We evaluated Añotador against the available state of the art
temporal taggers for Spanish*:
 HeidelTime
 SUTime
We used the software GATE and the measures precision (P),
recall (R) and F1-measure (F1), on the following corpora:
 Hourglass corpus.
 TempEval2 dataset**
* We also tried to compare it to TIPSem, but we were not able to run it for Spanish
despite contacting the creator. We nevertheless thank Héctor Llorens for his help.
** TempEval3 test is not available online, and TimeModeS contains historical
texts, out of our target.
22
Añotador: a Temporal Tagger for Spanish
Results against the Hourglass corpus
23
In the HourGlass corpus, Añotador always shows the best
performance.
Value means the normalization is correct
Type means that the temporal expression was correctly classified
Extent means that the temporal expression was correctly marked in the text
Añotador: a Temporal Tagger for Spanish
Results against the TempEval2 corpus
24
Results of the temporal taggers in the TempEval 2 corpus.
HeidelTime is slightly better at finding the normalized value
(0.0027 on average), but Añotador is better in the rest of
metrics. SUTime shows good Precission.
Añotador: a Temporal Tagger for Spanish
Some examples
The following examples were difficult to handle to the taggers:
25
Example Añotador SUTime HeidelTime
“1 año, 6 meses y un día”
(“1 year, 6 months and one day”)
1 año, 6
meses y un
día
1 año, 6
meses y
un día
1 año, 6
meses y
un día
“Cinco para las 11.”
(“Five to eleven.”)
Cinco para
las 11.
Cinco para
las 11.
Cinco para
las 11.
“lo vuestro dura 1h, no?”
(“your stuff lasts 1h, right?”)
lo vuestro
dura 1h, no?
lo vuestro
dura 1h,
no?
lo vuestro
dura 1h,
no?
“en cero coma”
(in a short amount of time)
en cero coma en cero
coma
en cero
coma
26
CHALLENGES
To do
Añotador: a Temporal Tagger for Spanish
Context-dependent temporal expressions
 Most temporal expressions are context-free, so we
require no extra information to detect or normalize them,
or depend just on an anchor date.
 Therefore, temporal taggers tend to cover them
easily.
 But we additionally find context-dependent ones, that can
be challenging.
 After some analysis, we detected three types of
general dependencies:
o TE dependent on the register
o TE dependent on temporal information
o TE dependent on geographical information
27
Añotador: a Temporal Tagger for Spanish
Context dependent temporal expressions
TE dependent on the register.
 Non-temporal words used in a temporal sense:
o “Él tiene 37 castañas/tacos” (“He has 37 chestnuts/tacos”),
both meaning “He is 37 years old”.
 Meronyms:
o “Tiene 30 primaveras/abriles ya” (“He already has 30
springs/Aprils”), meaning “He is 30 years old”.
 Idioms involving temporal expressions:
o “Hasta el 40 de mayo no te quites el sayo” (“Until 40th May,
do not take off the jacket”), meaning that the beginning of
June can still be chilly.
 Latin America:
o “la hora del moro” (“the hour of the moor”) means “lunch
time” in the Dominican Republic.
28
Añotador: a Temporal Tagger for Spanish
Context dependent temporal expressions
TE dependent on temporal information
 We usually asume the Gregorian calendar and the
reference date as the current one, but this is not always
that simple.
 There are other calendars, both historical and
contemporary, such as:
o The Julian calendar
o The French Republican calendar
o The Japanese era system.
o The Chinese Calendar
o Persian Solar Hijri calendar
o …
29
Añotador: a Temporal Tagger for Spanish
Context dependent temporal expressions
TE dependent on geographical information
 Geographical information infludes in how to normalize:
o Temponyms such as:
• International Children’s Day (15/04 Spain, 30/04 Mexico)
• National Day (12/10 Spain, 14/07 France)
o Date formats:
• Some countries have their own separators (eg, Germany uses
DD.MM.YYYY)
• Europe tends to DD/MM/YYYY
• EEUU tends to MM/DD/YYYY
• Japan tends to YYYY/MM/DD (when using our calendar!)
• eg: 11/09 is 11th September in Spain, but 9th November in
EEUU.
Can be a combination (eg, Emperor’s birthday in Japan)
30
31
CONCLUSIONS
Future work
Añotador: a Temporal Tagger for Spanish
Conclusions
 Results:
o Añotador:
• Able to process Spanish (and English) texts and extract
temporal expressions.
• Improves the state of the art both in our corpus and in
TempEval2.
o Hourglass corpus: new corpus with two parts.
• Synthetic dataset for systematic testing of temporal taggers,
including new registers and non-Castilian Spanish texts.
• People contributions of real common temporal expressions.
 Future work:
o Improve the system to make better cover the examples in
HourGlass.
o Deal with Context Dependent Temporal Expressions.
32
María Navas-Loro
Ontology Engineering Group
Universidad Politécnica de Madrid, Spain
Víctor Rodríguez-Doncel
Ontology Engineering Group
Universidad Politécnica de Madrid, Spain
Añotador:
a Temporal Tagger for Spanish
mnavas@fi.upm.es
https://marianavas.oeg-upm.net
29/10/2019
LKE 2019
Añotador: a Temporal Tagger for Spanish
References - Corpora
34
1. J. Pustejovsky et al., “The timebank corpus,” in Corpus linguistics, vol. 2003, p. 40,
Lancaster, UK, 2003.
2. http://semeval2.fbk.eu/semeval2.php?location=tasks&taskid=5
3. L. Minard et al., “Meantime, the newsreader multilingual event and time corpus,” in
Proc. LREC 2016, 2016.
4. W. Styler IV et al., “Temporal annotation in the clinical domain,” Transactions of
ACL, vol. 2, pp. 143–154, 2014.
5. P. Mazur et al., “Wikiwars: A new corpus for research on temporal expressions,” in
Proc. EMNLP 2010, pp. 913–922, 2010.
6. J. Strötgen et al., “Temporal tagging on different domains: Challenges, strategies,
and gold standards,” in LREC, vol. 12, pp. 3746–3753, 2012.
7. J. Tabassum et al., “Tweetime: A minimally supervised method for recognizing and
normalizing time expressions in twitter,” arXiv preprint arXiv:1608.02904, 2016.
8. https://catalog.ldc.upenn.edu/LDC2012T12
9. https://catalog.ldc.upenn.edu/LDC2012T01
Añotador: a Temporal Tagger for Spanish
References - Taggers
35
10. J. Strötgen and M. Gertz, “Multilingual and cross-domain temporal tagging,” LREv,
vol. 47, no. 2, pp. 269–298, 2013.
11. A. X. Chang and C. D. Manning, “Sutime: A library for recognizing and normalizing
time expressions,” in LREC, vol. 2012, pp. 3735–3740, 2012.
12. X. Zhong et al., “Time expression analysis and recognition using syntactic token
types and general heuristic rules,” in Proc. the 55th Annual Meeting of the ACL, vol.
1, pp. 420–429, 2017.
13. S. Bethard, “ClearTK-TimeML: A minimalist approach to TempEval 2013,” in Proc.
the Workshop SemEval 2013, pp. 10–14, ACL, 2013.
14. L. Derczynski et al., “Usfd2: Annotating temporal expresions and tlinks for
tempeval-2,” in Proc. the Workshop SemEval, pp. 337–340, ACL, 2010.
15. K. Lee et al., “Context-dependent semantic parsing for time expressions,” in Proc.
the 52nd Annual Meeting of the ACL, vol. 1, pp. 1437–1447, 2014.
16. M. Verhagen et al., “Automating Temporal Annotation with TARSQI,” in Proc. the
ACL 2005 on Interactive Poster and Demonstration Sessions, pp. 81–84, ACL,
2005.
17. N. Chambers et al., “Dense event ordering with a multi-pass architecture,”
Transactions of ACL, vol. 2, pp. 273–284, 2014.
18. H. Llorens et al., “Tipsem (english and spanish): Evaluating crfs and semantic roles
in tempeval-2,” in Proc. the Workshop SemEval, pp. 284–291, ACL, 2010.

More Related Content

Similar to Añotador: a Temporal Tagger for Spanish

EXPERT_Malaga-ESR03
EXPERT_Malaga-ESR03EXPERT_Malaga-ESR03
EXPERT_Malaga-ESR03
hpcosta
 
12. Gloria Corpas, Jorge Leiva, Miriam Seghiri (UMA) Human Translation & Tran...
12. Gloria Corpas, Jorge Leiva, Miriam Seghiri (UMA) Human Translation & Tran...12. Gloria Corpas, Jorge Leiva, Miriam Seghiri (UMA) Human Translation & Tran...
12. Gloria Corpas, Jorge Leiva, Miriam Seghiri (UMA) Human Translation & Tran...
RIILP
 

Similar to Añotador: a Temporal Tagger for Spanish (10)

DSDT Meetup Septembre 2021
DSDT Meetup Septembre 2021DSDT Meetup Septembre 2021
DSDT Meetup Septembre 2021
 
ESR3 Hernani Costa - EXPERT Summer School - Malaga 2015
ESR3 Hernani Costa - EXPERT Summer School - Malaga 2015ESR3 Hernani Costa - EXPERT Summer School - Malaga 2015
ESR3 Hernani Costa - EXPERT Summer School - Malaga 2015
 
EXPERT_Malaga-ESR03
EXPERT_Malaga-ESR03EXPERT_Malaga-ESR03
EXPERT_Malaga-ESR03
 
Applying Linked Open Data to a digital library: best practices and lessons le...
Applying Linked Open Data to a digital library: best practices and lessons le...Applying Linked Open Data to a digital library: best practices and lessons le...
Applying Linked Open Data to a digital library: best practices and lessons le...
 
Diachronic Analysis of Language exploiting Google Ngram
Diachronic Analysis of Language exploiting Google NgramDiachronic Analysis of Language exploiting Google Ngram
Diachronic Analysis of Language exploiting Google Ngram
 
OpenWordnet-PT: A Project Report
OpenWordnet-PT: A Project ReportOpenWordnet-PT: A Project Report
OpenWordnet-PT: A Project Report
 
Time and computers
Time and computersTime and computers
Time and computers
 
12. Gloria Corpas, Jorge Leiva, Miriam Seghiri (UMA) Human Translation & Tran...
12. Gloria Corpas, Jorge Leiva, Miriam Seghiri (UMA) Human Translation & Tran...12. Gloria Corpas, Jorge Leiva, Miriam Seghiri (UMA) Human Translation & Tran...
12. Gloria Corpas, Jorge Leiva, Miriam Seghiri (UMA) Human Translation & Tran...
 
Mapping Early Modern News Networks
Mapping Early Modern News NetworksMapping Early Modern News Networks
Mapping Early Modern News Networks
 
Anil timeline construction
Anil timeline constructionAnil timeline construction
Anil timeline construction
 

Recently uploaded

introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfintroduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
VishalKumarJha10
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
mohitmore19
 

Recently uploaded (20)

5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
 
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfintroduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.js
 
10 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 202410 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 2024
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdfAzure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with Precision
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
 
Define the academic and professional writing..pdf
Define the academic and professional writing..pdfDefine the academic and professional writing..pdf
Define the academic and professional writing..pdf
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 

Añotador: a Temporal Tagger for Spanish

  • 1. María Navas-Loro Ontology Engineering Group Universidad Politécnica de Madrid, Spain Víctor Rodríguez-Doncel Ontology Engineering Group Universidad Politécnica de Madrid, Spain Añotador: a Temporal Tagger for Spanish mnavas@fi.upm.es https://marianavas.oeg-upm.net 29/10/2019 LKE 2019
  • 2. Añotador: a Temporal Tagger for Spanish Outline 2 Outline  Brief Introduction  State of the Art o Temporal Taggers o Corpora  Añotador  Corpus  Results  Challenges  Conclusions
  • 4. Añotador: a Temporal Tagger for Spanish Temporal information 4 Temporal information is everywhere: news, medical texts, legal documents… … and we can use it for a lot of things! TIMELINE GENERATION SEARCH ENGINESUMMARIZATION PATTERN DETECTION It was lodged on 5 March 2010. It was rejected related to La Vanguardia. No information related to Google Inc. Was it accepted? When was the complaint against Google Inc made? QA Nevertheless, there are not a lot of tools to extract temporal information in Spanish.
  • 5. Añotador: a Temporal Tagger for Spanish The task 5 Temporal taggers: 1) Detect temporal expressions: - DATE - DURATION - SET - TIME 2) Normalize them. 3) Additionally, event and relation extraction
  • 6. STATE OF THE ART What has been done 6
  • 7. Añotador: a Temporal Tagger for Spanish Corpora 7  A lot in English! o News: • Timebank corpus [1] • TempEval challenges datasets [2] • MEANTIME corpus [3] o Medical: THYME [4] o Historical: Wikiwars corpus [5] o Scientific: Abstracts [6] o Colloquial: SMS [6] and tweets [7]  Less for Spanish… o News: The Spanish TimeBank corpus [8], MEANTIME [3] o Historical: The ModeS TimeBank 1.06 (texts from the 17th and 18th centuries) [9] * TempEval 2 and TempEval 3 challenges (fragment of TimeBank)
  • 8. Añotador: a Temporal Tagger for Spanish Temporal Taggers 8  A lot in English! o Rule-based: • HeidelTime [10] • SUTime* [11] • SynTime [12] o Machine-Learning-based • ClearTK-TimeML [13] • USFD2 [14] • UWTime [15] • TARSQI [16] • CAEVO [17] • TIPSem [18]  For Spanish, just the bold ones… (there are more in literature, but we could not find them available!)  No corpora, no taggers!
  • 10. Añotador 10 Añotador (aka Annotador) is a temporal tagger for Spanish and English. It is able to find and normalize dates, times, durations and sets (temporal expressions that repeat periodically, such as “monthly”, “twice a year” or “every Tuesday”). There is a demo available in its webpage, and can be called as webservice or downloaded from GitHub (where all the code and a docker image are available). http://annotador.oeg-upm.net/
  • 11. Añotador: a Temporal Tagger for Spanish Pipeline of Añotador 11 CoreNLP pipeline + IxaPipes RULES for each sentence for each expression reference current Date? do Calculus type? DATE YES setFormat INPUT text + anchor date OUTPUT NORMALIZATION ALGORITHM token rules identify numbers, months, granularities ( day )... basic TEs rules # + granularity: two days compound rules merge TEs: 1 month (1M) and 2 days (2D) = P1M2D YYYY-MM-DD TIME DURATION SET normalize XX values NO add lastDate process Durations NO reference next/last YES NO is anchored ? YES update lastDate ANNOTADOR
  • 12. Añotador: a Temporal Tagger for Spanish Pipeline of Añotador 12 CoreNLP pipeline + IxaPipes RULES for each sentence for each expression reference current Date? do Calculus type? DATE YES setFormat INPUT text + anchor date OUTPUT NORMALIZATION ALGORITHM token rules identify numbers, months, granularities ( day )... basic TEs rules # + granularity: two days compound rules merge TEs: 1 month (1M) and 2 days (2D) = P1M2D YYYY-MM-DD TIME DURATION SET normalize XX values NO add lastDate process Durations NO reference next/last YES NO is anchored ? YES update lastDate ANNOTADOR 1. Preprocessing:  We get as input the text and the anchor date (if none, we assume the current day)  We use CoreNLP for lemmatizing, sentence splitting…  We added IxaPipes models for Spanish to improve the quality of the output.
  • 13. Añotador: a Temporal Tagger for Spanish Pipeline of Añotador 13 CoreNLP pipeline + IxaPipes RULES for each sentence for each expression reference current Date? do Calculus type? DATE YES setFormat INPUT text + anchor date OUTPUT NORMALIZATION ALGORITHM token rules identify numbers, months, granularities ( day )... basic TEs rules # + granularity: two days compound rules merge TEs: 1 month (1M) and 2 days (2D) = P1M2D YYYY-MM-DD TIME DURATION SET normalize XX values NO add lastDate process Durations NO reference next/last YES NO is anchored ? YES update lastDate ANNOTADOR 2. Rules: More than 100 rules written in CoreNLP TokensRegex format. 1. Token-based rules for expressions such as numerals, granularities… 2. Basic temporal expression rules, working on previously found basic expressions 3. Compound expression rules, for inheritance values or composition. 4. Literal expression rules, for specific expressions such as “previously”, polysemic expressions (Spanish “mañana”) or expressions including tokens that can appear in other temporal expressions, requiring a dedicated approach.
  • 14. Añotador: a Temporal Tagger for Spanish Mañana, a tricky case 14 Additionally, some expressions are really tricky… … also syntactically! “por la mañana” vs “en la mañana” (“in the morning”).
  • 15. Añotador: a Temporal Tagger for Spanish Pipeline of Añotador 15 CoreNLP pipeline + IxaPipes RULES for each sentence for each expression reference current Date? do Calculus type? DATE YES setFormat INPUT text + anchor date OUTPUT NORMALIZATION ALGORITHM token rules identify numbers, months, granularities ( day )... basic TEs rules # + granularity: two days compound rules merge TEs: 1 month (1M) and 2 days (2D) = P1M2D YYYY-MM-DD TIME DURATION SET normalize XX values NO add lastDate process Durations NO reference next/last YES NO is anchored ? YES update lastDate ANNOTADOR 3. Normalization algorithm:  Works for each sentence sepparately.  Different approaches for each type of expressions.  Can take into account different reference dates.
  • 16. Añotador: a Temporal Tagger for Spanish Pipeline of Añotador 16 CoreNLP pipeline + IxaPipes RULES for each sentence for each expression reference current Date? do Calculus type? DATE YES setFormat INPUT text + anchor date OUTPUT NORMALIZATION ALGORITHM token rules identify numbers, months, granularities ( day )... basic TEs rules # + granularity: two days compound rules merge TEs: 1 month (1M) and 2 days (2D) = P1M2D YYYY-MM-DD TIME DURATION SET normalize XX values NO add lastDate process Durations NO reference next/last YES NO is anchored ? YES update lastDate ANNOTADOR 4. Output format: - TimeML - JSON
  • 18. Añotador: a Temporal Tagger for Spanish Hourglass corpus While building Añotador, we noticed both the scarcity and lack of heterogeneity of corpora in Spanish.  We decided to create a synthetic dataset of expressions that a temporal tagger should cover, including common expressions and sentences that tend to produce false positives.  We additionally asked people foreign to the task to provide temporal expressions. These contributors: o Had different backgrounds (linguists, computer scientists, translators…) o Were from different Spanish-speaking countries and regions (such as Venezuelan or Ecuatorians, and people from different regions in Spain) 18
  • 19. Añotador: a Temporal Tagger for Spanish Hourglass corpus The result was the Hourglass corpus, composed of two parts: 19 SYNTHETIC part: o Short texts annotated with TimeML. o It includes additional tags in order to facilitate the evaluation of different temporal expressions, such as “Dates”, “Fractions” or “False” (referring to false positives like “50/2/1991” or “February 30”). PEOPLE part: o Sentences from contributors with different backgrounds and from different Spanish- speaking countries. o Also tagged and annotated following the standards, as well as with the register of the sentence (colloquial, chat, latin expressions, Latin American expressions, phrases and one literary sentence).
  • 20. Añotador: a Temporal Tagger for Spanish Hourglass corpus in numbers 20
  • 22. Añotador: a Temporal Tagger for Spanish Hourglass corpus We evaluated Añotador against the available state of the art temporal taggers for Spanish*:  HeidelTime  SUTime We used the software GATE and the measures precision (P), recall (R) and F1-measure (F1), on the following corpora:  Hourglass corpus.  TempEval2 dataset** * We also tried to compare it to TIPSem, but we were not able to run it for Spanish despite contacting the creator. We nevertheless thank Héctor Llorens for his help. ** TempEval3 test is not available online, and TimeModeS contains historical texts, out of our target. 22
  • 23. Añotador: a Temporal Tagger for Spanish Results against the Hourglass corpus 23 In the HourGlass corpus, Añotador always shows the best performance. Value means the normalization is correct Type means that the temporal expression was correctly classified Extent means that the temporal expression was correctly marked in the text
  • 24. Añotador: a Temporal Tagger for Spanish Results against the TempEval2 corpus 24 Results of the temporal taggers in the TempEval 2 corpus. HeidelTime is slightly better at finding the normalized value (0.0027 on average), but Añotador is better in the rest of metrics. SUTime shows good Precission.
  • 25. Añotador: a Temporal Tagger for Spanish Some examples The following examples were difficult to handle to the taggers: 25 Example Añotador SUTime HeidelTime “1 año, 6 meses y un día” (“1 year, 6 months and one day”) 1 año, 6 meses y un día 1 año, 6 meses y un día 1 año, 6 meses y un día “Cinco para las 11.” (“Five to eleven.”) Cinco para las 11. Cinco para las 11. Cinco para las 11. “lo vuestro dura 1h, no?” (“your stuff lasts 1h, right?”) lo vuestro dura 1h, no? lo vuestro dura 1h, no? lo vuestro dura 1h, no? “en cero coma” (in a short amount of time) en cero coma en cero coma en cero coma
  • 27. Añotador: a Temporal Tagger for Spanish Context-dependent temporal expressions  Most temporal expressions are context-free, so we require no extra information to detect or normalize them, or depend just on an anchor date.  Therefore, temporal taggers tend to cover them easily.  But we additionally find context-dependent ones, that can be challenging.  After some analysis, we detected three types of general dependencies: o TE dependent on the register o TE dependent on temporal information o TE dependent on geographical information 27
  • 28. Añotador: a Temporal Tagger for Spanish Context dependent temporal expressions TE dependent on the register.  Non-temporal words used in a temporal sense: o “Él tiene 37 castañas/tacos” (“He has 37 chestnuts/tacos”), both meaning “He is 37 years old”.  Meronyms: o “Tiene 30 primaveras/abriles ya” (“He already has 30 springs/Aprils”), meaning “He is 30 years old”.  Idioms involving temporal expressions: o “Hasta el 40 de mayo no te quites el sayo” (“Until 40th May, do not take off the jacket”), meaning that the beginning of June can still be chilly.  Latin America: o “la hora del moro” (“the hour of the moor”) means “lunch time” in the Dominican Republic. 28
  • 29. Añotador: a Temporal Tagger for Spanish Context dependent temporal expressions TE dependent on temporal information  We usually asume the Gregorian calendar and the reference date as the current one, but this is not always that simple.  There are other calendars, both historical and contemporary, such as: o The Julian calendar o The French Republican calendar o The Japanese era system. o The Chinese Calendar o Persian Solar Hijri calendar o … 29
  • 30. Añotador: a Temporal Tagger for Spanish Context dependent temporal expressions TE dependent on geographical information  Geographical information infludes in how to normalize: o Temponyms such as: • International Children’s Day (15/04 Spain, 30/04 Mexico) • National Day (12/10 Spain, 14/07 France) o Date formats: • Some countries have their own separators (eg, Germany uses DD.MM.YYYY) • Europe tends to DD/MM/YYYY • EEUU tends to MM/DD/YYYY • Japan tends to YYYY/MM/DD (when using our calendar!) • eg: 11/09 is 11th September in Spain, but 9th November in EEUU. Can be a combination (eg, Emperor’s birthday in Japan) 30
  • 32. Añotador: a Temporal Tagger for Spanish Conclusions  Results: o Añotador: • Able to process Spanish (and English) texts and extract temporal expressions. • Improves the state of the art both in our corpus and in TempEval2. o Hourglass corpus: new corpus with two parts. • Synthetic dataset for systematic testing of temporal taggers, including new registers and non-Castilian Spanish texts. • People contributions of real common temporal expressions.  Future work: o Improve the system to make better cover the examples in HourGlass. o Deal with Context Dependent Temporal Expressions. 32
  • 33. María Navas-Loro Ontology Engineering Group Universidad Politécnica de Madrid, Spain Víctor Rodríguez-Doncel Ontology Engineering Group Universidad Politécnica de Madrid, Spain Añotador: a Temporal Tagger for Spanish mnavas@fi.upm.es https://marianavas.oeg-upm.net 29/10/2019 LKE 2019
  • 34. Añotador: a Temporal Tagger for Spanish References - Corpora 34 1. J. Pustejovsky et al., “The timebank corpus,” in Corpus linguistics, vol. 2003, p. 40, Lancaster, UK, 2003. 2. http://semeval2.fbk.eu/semeval2.php?location=tasks&taskid=5 3. L. Minard et al., “Meantime, the newsreader multilingual event and time corpus,” in Proc. LREC 2016, 2016. 4. W. Styler IV et al., “Temporal annotation in the clinical domain,” Transactions of ACL, vol. 2, pp. 143–154, 2014. 5. P. Mazur et al., “Wikiwars: A new corpus for research on temporal expressions,” in Proc. EMNLP 2010, pp. 913–922, 2010. 6. J. Strötgen et al., “Temporal tagging on different domains: Challenges, strategies, and gold standards,” in LREC, vol. 12, pp. 3746–3753, 2012. 7. J. Tabassum et al., “Tweetime: A minimally supervised method for recognizing and normalizing time expressions in twitter,” arXiv preprint arXiv:1608.02904, 2016. 8. https://catalog.ldc.upenn.edu/LDC2012T12 9. https://catalog.ldc.upenn.edu/LDC2012T01
  • 35. Añotador: a Temporal Tagger for Spanish References - Taggers 35 10. J. Strötgen and M. Gertz, “Multilingual and cross-domain temporal tagging,” LREv, vol. 47, no. 2, pp. 269–298, 2013. 11. A. X. Chang and C. D. Manning, “Sutime: A library for recognizing and normalizing time expressions,” in LREC, vol. 2012, pp. 3735–3740, 2012. 12. X. Zhong et al., “Time expression analysis and recognition using syntactic token types and general heuristic rules,” in Proc. the 55th Annual Meeting of the ACL, vol. 1, pp. 420–429, 2017. 13. S. Bethard, “ClearTK-TimeML: A minimalist approach to TempEval 2013,” in Proc. the Workshop SemEval 2013, pp. 10–14, ACL, 2013. 14. L. Derczynski et al., “Usfd2: Annotating temporal expresions and tlinks for tempeval-2,” in Proc. the Workshop SemEval, pp. 337–340, ACL, 2010. 15. K. Lee et al., “Context-dependent semantic parsing for time expressions,” in Proc. the 52nd Annual Meeting of the ACL, vol. 1, pp. 1437–1447, 2014. 16. M. Verhagen et al., “Automating Temporal Annotation with TARSQI,” in Proc. the ACL 2005 on Interactive Poster and Demonstration Sessions, pp. 81–84, ACL, 2005. 17. N. Chambers et al., “Dense event ordering with a multi-pass architecture,” Transactions of ACL, vol. 2, pp. 273–284, 2014. 18. H. Llorens et al., “Tipsem (english and spanish): Evaluating crfs and semantic roles in tempeval-2,” in Proc. the Workshop SemEval, pp. 284–291, ACL, 2010.