SlideShare ist ein Scribd-Unternehmen logo
1 von 1
Downloaden Sie, um offline zu lesen
Cardinal Virtues: Extracting Relation Cardinalities from Text
Paramita Mirza1, Simon Razniewski2, Fariz Darari2 and Gerhard Weikum1
1 Max Planck Institute for Informatics, Germany
2 Free University of Bozen-Bolzano, Italy
1. Overview
• IE has largely focused on answering “Who has won which award?”
• However, some facts are never fully mentioned and no IE method has perfect recall
• Sentences like “John lives with his spouse and 5 children on a farm in Alabama” are
much more frequent in texts.
• We focus instead on answering “How many awards has someone won?”
• Useful for aggregate query answering, e.g., “Who won the most awards?”
• Contributions:
• We introduce the problem of Relation Cardinality Extraction
• We present a distant supervision method using Conditional Random Fields
• We discuss specific challenges that set it apart from standard IE
Relation Cardinality
a mention that expresses relation cardinality
is
a cardinal number that states the number of objects
that stand in a specific relation with a certain subject
“Barack and Michelle Obama have two children, which are currently ….”
Acknowledgment
This work has been partially supported by the projects “TCFR - The Call for Recall”, funded by the Free University of Bozen-Bolzano.
2. Motivation A: Knowledge Base (KB) curation
KB recall is highly variant and mostly unknown 
“Barack and Michelle Obama have two children, which are currently ….”
4. Relation Cardinality Extraction
“Given a well defined relation/predicate p, a subject s and a correspondingtext about s,
we try to estimate the relation cardinality,
i.e., the count of <s, p, *> triples”
Methodology
• Sequence labelling problem:
Barack and Michelle Obama have two children , which are currently ….
Barack and Michelle Obama have _num_ child , which be currently …  lemma
O O O O O CHILD O O O O O
• Conditional Random Fields (CRF) model using CRF++ (Kudo, 2005)
• Feature set: lemma of observed token t, context lemmas (windows size = 5),
bigrams and trigrams containing t
• Distant supervision for generatingtraining data
• Given an <s, p> pair we identify:
‐ the triple count |<s, p, *>| from Wikidata(Vrandečić and Krötzsch, 2014);
and
‐ candidatesentences from English Wikipedia article of s
‐ candidatenumbers (not labelled as TEMPORAL, MONEY or PERCENT) in each
sentence (if any)
• We generatetraining examples by labellinga candidatenumber n with p if
n = |<s, p, *>|, otherwise, it is labelled as O, like the rest of non-number tokens
• Prediction
• Having the annotated sentences by the CRF-based model,
• Relation cardinality for a given <s, p> pair is the candidatenumber labelled with
p, which has the highest confidencescore (i.e., marginal probability of a token
labelled as such, resulting from forward-backward inference)
Experiments
• Evaluation on manually annotated randomly sampled subjects for 4 Wikidata properties:
20 (has part), 100 (contains admin.) and 200 (child and spouse)
• baseline: randomly select a number from a pool of numbers in text
• only nummod: consider only candidate numbers that modify a noun
KB: 0 KB: 1 KB: 2
Recall: 0% Recall: 50% Recall: 100%
3. Motivation B: Disregarded by state-of-the-art (Open) IE systems
Despite its frequency 
• Open IE (Mausam et al. 2012; Del Corro and Gemulla, 2013)
• No way to interpretthe numeric expression in the Object slot , e.g., <Obama, has,
two children>
• KB-population IE, e.g., NELL (Mitchellet al., 2015)
• Knows 13 relations about the number of casualties and injuries in disasters, e.g.,
<Berlin2016attack, hasNumOfVictims, 32>
• Contains only seed facts and no learned facts
Stanford Named Entity (NE) tagger on cardinal numbers in 10K Wikipedia articles
July 30-August 4, 2017 ∙ Vancouver, Canada
DBpedia contains currently
only 6 out of 35 Dijkstra
Prize winners 
According to YAGO, the average number of
children per person is 0.02 
167 out of 199 Nobel laureates
in Physics are in DBpedia ☺
2 out of 2 children of Obama are in
Wikidata ☺
5. Challenges in Relation Cardinality Extraction
Quality of Training Data
• Distant supervision from highly incompleteKB
• e.g., manual annotation on child evaluation set  Wikidata is only ±50% accurate.
• Unlike in classical IE, missing ground truth may lead to false positives as well.
• Possible approaches:
• Filtering ground truth  consider only popular entities for training.
• Incompleteness-resilient distant supervision  label all numbers equal or higher
than the KB count as positive examples.
Compositionality
• “They have two sons and one daughter together; he has four children from his first wife.”
• 16% of false positives in extracting child cardinalities
• Possible approaches:
• Aggregating numbers  in training data generation, label a sequence of numbers
as correct cardinalitiesif the sum is equal to the KB count; in prediction step, sum
up all consecutivecardinalities.
• Learning composition rules  e.g., children are composed of sons and daughters.
Linguistic Variance
• Ordinals are quite common to express lower bounds, e.g., John’s first wife, Mary, …”.
• Relation cardinalities are sometimes expressed with non-numerals, e.g., “He never married”,
“They have a daughter together”, “The book is a trilogy”.
• Possible approaches:
• Translation to numbers  translate certain kinds of negation and indefinitearticles
into expressions containing 0 and 1.
• Word similarity with cardinals  consider words bear high similaritywith cardinal
numbers, possibly in other language such as Latin or Greek.
p #s train
baseline vanilla only nummod
P P R F1 P R F1
has part (creative work series) 261 .050 .333 .316 .324 .353 .316 .333
contains admin 18,000 .034 .390 .188 .254 .548 .200 .293
spouse 45,917 0 .014 .011 .013 .028 .017 .021
child 35,057 .112 .151 .129 .139 .320 .219 .260
child (manual ground truth) 6,408 .374 .309 .338 .452 .315 .317
Further Reading
• Predicting Completeness in Knowledge Bases, Luis Galárraga, Simon Razniewski, Antoine Amarilli,
Fabian M. Suchanek, WSDM, Cambridge, UK, 2017
• Expanding Wikidata’s Parenthood Information by 178%, or How To Mine Relation Cardinalities,
Paramita Mirza, Simon Razniewski, Werner Nutt, ISWC Poster, Osaka, Japan, 2016
• But What Do We Actually Know?, Simon Razniewski, Fabian Suchanek, Werner Nutt, AKBC
workshop at NAACL, San Diego, USA, 2016
• Identifying the Extent of Completeness of Query Answers over Partially Complete Databases, Simon
Razniewski, Flip Korn, Werner Nutt, Divesh Srivastava, SIGMOD, Melbourne, Australia, 2015
• A tool for crowdsourced completeness annotations for Wikidata: http://cool-wd.inf.unibz.it/
18.86%

Weitere ähnliche Inhalte

Ähnlich wie Paramita Mirza - 2017 - Cardinal Virtues: Extracting Relation Cardinalities from Text

Logic programming (1)
Logic programming (1)Logic programming (1)
Logic programming (1)
Nitesh Singh
 
Please review this assignment tutorial for help filling out this w.docx
Please review this assignment tutorial for help filling out this w.docxPlease review this assignment tutorial for help filling out this w.docx
Please review this assignment tutorial for help filling out this w.docx
mattjtoni51554
 
LiteracyOutcomesLADT_Poster_FINAL
LiteracyOutcomesLADT_Poster_FINALLiteracyOutcomesLADT_Poster_FINAL
LiteracyOutcomesLADT_Poster_FINAL
Samantha Kienemund
 

Ähnlich wie Paramita Mirza - 2017 - Cardinal Virtues: Extracting Relation Cardinalities from Text (20)

Conceptual Plurality in Japanese EFL Learners' Online Sentence Processing: A ...
Conceptual Plurality in Japanese EFL Learners' Online Sentence Processing: A ...Conceptual Plurality in Japanese EFL Learners' Online Sentence Processing: A ...
Conceptual Plurality in Japanese EFL Learners' Online Sentence Processing: A ...
 
Second and foreign language data
Second and foreign language dataSecond and foreign language data
Second and foreign language data
 
Logic programming (1)
Logic programming (1)Logic programming (1)
Logic programming (1)
 
Algebra__Day1.ppt
Algebra__Day1.pptAlgebra__Day1.ppt
Algebra__Day1.ppt
 
The Science and Art of Qualitative Writing and Analysis UWI 2023
The Science and Art of Qualitative Writing and Analysis UWI 2023The Science and Art of Qualitative Writing and Analysis UWI 2023
The Science and Art of Qualitative Writing and Analysis UWI 2023
 
Quantitative analysis in language research
Quantitative analysis in language researchQuantitative analysis in language research
Quantitative analysis in language research
 
IE: Named Entity Recognition (NER)
IE: Named Entity Recognition (NER)IE: Named Entity Recognition (NER)
IE: Named Entity Recognition (NER)
 
Introduction to Natural Language Processing
Introduction to Natural Language ProcessingIntroduction to Natural Language Processing
Introduction to Natural Language Processing
 
Harnessing Linked Knowledge Sources for Topic Classification in Social Media
Harnessing Linked Knowledge Sources for Topic Classification in Social MediaHarnessing Linked Knowledge Sources for Topic Classification in Social Media
Harnessing Linked Knowledge Sources for Topic Classification in Social Media
 
Meeting the needs of slife de capua sc 09 03-15
Meeting the needs of slife de capua sc 09 03-15 Meeting the needs of slife de capua sc 09 03-15
Meeting the needs of slife de capua sc 09 03-15
 
Please review this assignment tutorial for help filling out this w.docx
Please review this assignment tutorial for help filling out this w.docxPlease review this assignment tutorial for help filling out this w.docx
Please review this assignment tutorial for help filling out this w.docx
 
The Triplex Approach for Recognizing Semantic Relations from Noun Phrases, Ap...
The Triplex Approach for Recognizing Semantic Relations from Noun Phrases, Ap...The Triplex Approach for Recognizing Semantic Relations from Noun Phrases, Ap...
The Triplex Approach for Recognizing Semantic Relations from Noun Phrases, Ap...
 
Babak Rasolzadeh: The importance of entities
Babak Rasolzadeh: The importance of entitiesBabak Rasolzadeh: The importance of entities
Babak Rasolzadeh: The importance of entities
 
Big Data LDN 2017: Machine Learning on Structured Data. Why Is Learning Rules...
Big Data LDN 2017: Machine Learning on Structured Data. Why Is Learning Rules...Big Data LDN 2017: Machine Learning on Structured Data. Why Is Learning Rules...
Big Data LDN 2017: Machine Learning on Structured Data. Why Is Learning Rules...
 
DATA641 Lecture 3 - Word meaning.pptx
DATA641 Lecture 3 - Word meaning.pptxDATA641 Lecture 3 - Word meaning.pptx
DATA641 Lecture 3 - Word meaning.pptx
 
Measuring massive multitask language understanding
Measuring massive multitask language understandingMeasuring massive multitask language understanding
Measuring massive multitask language understanding
 
LiteracyOutcomesLADT_Poster_FINAL
LiteracyOutcomesLADT_Poster_FINALLiteracyOutcomesLADT_Poster_FINAL
LiteracyOutcomesLADT_Poster_FINAL
 
Modeling health related topics in an online forum designed for the deaf & har...
Modeling health related topics in an online forum designed for the deaf & har...Modeling health related topics in an online forum designed for the deaf & har...
Modeling health related topics in an online forum designed for the deaf & har...
 
Mathematics: skills, understanding or both?
Mathematics: skills, understanding or both?Mathematics: skills, understanding or both?
Mathematics: skills, understanding or both?
 
Are children with_specific_language_impairment_competent_with_the_pragmatics_...
Are children with_specific_language_impairment_competent_with_the_pragmatics_...Are children with_specific_language_impairment_competent_with_the_pragmatics_...
Are children with_specific_language_impairment_competent_with_the_pragmatics_...
 

Mehr von Association for Computational Linguistics

Mehr von Association for Computational Linguistics (20)

Muis - 2016 - Weak Semi-Markov CRFs for NP Chunking in Informal Text
Muis - 2016 - Weak Semi-Markov CRFs for NP Chunking in Informal TextMuis - 2016 - Weak Semi-Markov CRFs for NP Chunking in Informal Text
Muis - 2016 - Weak Semi-Markov CRFs for NP Chunking in Informal Text
 
Castro - 2018 - A High Coverage Method for Automatic False Friends Detection ...
Castro - 2018 - A High Coverage Method for Automatic False Friends Detection ...Castro - 2018 - A High Coverage Method for Automatic False Friends Detection ...
Castro - 2018 - A High Coverage Method for Automatic False Friends Detection ...
 
Castro - 2018 - A Crowd-Annotated Spanish Corpus for Humour Analysis
Castro - 2018 - A Crowd-Annotated Spanish Corpus for Humour AnalysisCastro - 2018 - A Crowd-Annotated Spanish Corpus for Humour Analysis
Castro - 2018 - A Crowd-Annotated Spanish Corpus for Humour Analysis
 
Muthu Kumar Chandrasekaran - 2018 - Countering Position Bias in Instructor In...
Muthu Kumar Chandrasekaran - 2018 - Countering Position Bias in Instructor In...Muthu Kumar Chandrasekaran - 2018 - Countering Position Bias in Instructor In...
Muthu Kumar Chandrasekaran - 2018 - Countering Position Bias in Instructor In...
 
Daniel Gildea - 2018 - The ACL Anthology: Current State and Future Directions
Daniel Gildea - 2018 - The ACL Anthology: Current State and Future DirectionsDaniel Gildea - 2018 - The ACL Anthology: Current State and Future Directions
Daniel Gildea - 2018 - The ACL Anthology: Current State and Future Directions
 
Elior Sulem - 2018 - Semantic Structural Evaluation for Text Simplification
Elior Sulem - 2018 - Semantic Structural Evaluation for Text SimplificationElior Sulem - 2018 - Semantic Structural Evaluation for Text Simplification
Elior Sulem - 2018 - Semantic Structural Evaluation for Text Simplification
 
Daniel Gildea - 2018 - The ACL Anthology: Current State and Future Directions
Daniel Gildea - 2018 - The ACL Anthology: Current State and Future DirectionsDaniel Gildea - 2018 - The ACL Anthology: Current State and Future Directions
Daniel Gildea - 2018 - The ACL Anthology: Current State and Future Directions
 
Wenqiang Lei - 2018 - Sequicity: Simplifying Task-oriented Dialogue Systems w...
Wenqiang Lei - 2018 - Sequicity: Simplifying Task-oriented Dialogue Systems w...Wenqiang Lei - 2018 - Sequicity: Simplifying Task-oriented Dialogue Systems w...
Wenqiang Lei - 2018 - Sequicity: Simplifying Task-oriented Dialogue Systems w...
 
Matthew Marge - 2017 - Exploring Variation of Natural Human Commands to a Rob...
Matthew Marge - 2017 - Exploring Variation of Natural Human Commands to a Rob...Matthew Marge - 2017 - Exploring Variation of Natural Human Commands to a Rob...
Matthew Marge - 2017 - Exploring Variation of Natural Human Commands to a Rob...
 
Venkatesh Duppada - 2017 - SeerNet at EmoInt-2017: Tweet Emotion Intensity Es...
Venkatesh Duppada - 2017 - SeerNet at EmoInt-2017: Tweet Emotion Intensity Es...Venkatesh Duppada - 2017 - SeerNet at EmoInt-2017: Tweet Emotion Intensity Es...
Venkatesh Duppada - 2017 - SeerNet at EmoInt-2017: Tweet Emotion Intensity Es...
 
Satoshi Sonoh - 2015 - Toshiba MT System Description for the WAT2015 Workshop
Satoshi Sonoh - 2015 - Toshiba MT System Description for the WAT2015 WorkshopSatoshi Sonoh - 2015 - Toshiba MT System Description for the WAT2015 Workshop
Satoshi Sonoh - 2015 - Toshiba MT System Description for the WAT2015 Workshop
 
Chenchen Ding - 2015 - NICT at WAT 2015
Chenchen Ding - 2015 - NICT at WAT 2015Chenchen Ding - 2015 - NICT at WAT 2015
Chenchen Ding - 2015 - NICT at WAT 2015
 
John Richardson - 2015 - KyotoEBMT System Description for the 2nd Workshop on...
John Richardson - 2015 - KyotoEBMT System Description for the 2nd Workshop on...John Richardson - 2015 - KyotoEBMT System Description for the 2nd Workshop on...
John Richardson - 2015 - KyotoEBMT System Description for the 2nd Workshop on...
 
John Richardson - 2015 - KyotoEBMT System Description for the 2nd Workshop on...
John Richardson - 2015 - KyotoEBMT System Description for the 2nd Workshop on...John Richardson - 2015 - KyotoEBMT System Description for the 2nd Workshop on...
John Richardson - 2015 - KyotoEBMT System Description for the 2nd Workshop on...
 
Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...
Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...
Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...
 
Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...
Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...
Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...
 
Hyoung-Gyu Lee - 2015 - NAVER Machine Translation System for WAT 2015
Hyoung-Gyu Lee - 2015 - NAVER Machine Translation System for WAT 2015Hyoung-Gyu Lee - 2015 - NAVER Machine Translation System for WAT 2015
Hyoung-Gyu Lee - 2015 - NAVER Machine Translation System for WAT 2015
 
Satoshi Sonoh - 2015 - Toshiba MT System Description for the WAT2015 Workshop
Satoshi Sonoh - 2015 - Toshiba MT System Description for the WAT2015 WorkshopSatoshi Sonoh - 2015 - Toshiba MT System Description for the WAT2015 Workshop
Satoshi Sonoh - 2015 - Toshiba MT System Description for the WAT2015 Workshop
 
Chenchen Ding - 2015 - NICT at WAT 2015
Chenchen Ding - 2015 - NICT at WAT 2015Chenchen Ding - 2015 - NICT at WAT 2015
Chenchen Ding - 2015 - NICT at WAT 2015
 
Graham Neubig - 2015 - Neural Reranking Improves Subjective Quality of Machin...
Graham Neubig - 2015 - Neural Reranking Improves Subjective Quality of Machin...Graham Neubig - 2015 - Neural Reranking Improves Subjective Quality of Machin...
Graham Neubig - 2015 - Neural Reranking Improves Subjective Quality of Machin...
 

Kürzlich hochgeladen

The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
heathfieldcps1
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
heathfieldcps1
 

Kürzlich hochgeladen (20)

Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
 
Sociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning ExhibitSociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning Exhibit
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot Graph
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 
Unit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxUnit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptx
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17
 
ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701
 
Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptx
 
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptxINDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptx
 
Energy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural Resources
Energy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural ResourcesEnergy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural Resources
Energy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural Resources
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdf
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 

Paramita Mirza - 2017 - Cardinal Virtues: Extracting Relation Cardinalities from Text

  • 1. Cardinal Virtues: Extracting Relation Cardinalities from Text Paramita Mirza1, Simon Razniewski2, Fariz Darari2 and Gerhard Weikum1 1 Max Planck Institute for Informatics, Germany 2 Free University of Bozen-Bolzano, Italy 1. Overview • IE has largely focused on answering “Who has won which award?” • However, some facts are never fully mentioned and no IE method has perfect recall • Sentences like “John lives with his spouse and 5 children on a farm in Alabama” are much more frequent in texts. • We focus instead on answering “How many awards has someone won?” • Useful for aggregate query answering, e.g., “Who won the most awards?” • Contributions: • We introduce the problem of Relation Cardinality Extraction • We present a distant supervision method using Conditional Random Fields • We discuss specific challenges that set it apart from standard IE Relation Cardinality a mention that expresses relation cardinality is a cardinal number that states the number of objects that stand in a specific relation with a certain subject “Barack and Michelle Obama have two children, which are currently ….” Acknowledgment This work has been partially supported by the projects “TCFR - The Call for Recall”, funded by the Free University of Bozen-Bolzano. 2. Motivation A: Knowledge Base (KB) curation KB recall is highly variant and mostly unknown  “Barack and Michelle Obama have two children, which are currently ….” 4. Relation Cardinality Extraction “Given a well defined relation/predicate p, a subject s and a correspondingtext about s, we try to estimate the relation cardinality, i.e., the count of <s, p, *> triples” Methodology • Sequence labelling problem: Barack and Michelle Obama have two children , which are currently …. Barack and Michelle Obama have _num_ child , which be currently …  lemma O O O O O CHILD O O O O O • Conditional Random Fields (CRF) model using CRF++ (Kudo, 2005) • Feature set: lemma of observed token t, context lemmas (windows size = 5), bigrams and trigrams containing t • Distant supervision for generatingtraining data • Given an <s, p> pair we identify: ‐ the triple count |<s, p, *>| from Wikidata(Vrandečić and Krötzsch, 2014); and ‐ candidatesentences from English Wikipedia article of s ‐ candidatenumbers (not labelled as TEMPORAL, MONEY or PERCENT) in each sentence (if any) • We generatetraining examples by labellinga candidatenumber n with p if n = |<s, p, *>|, otherwise, it is labelled as O, like the rest of non-number tokens • Prediction • Having the annotated sentences by the CRF-based model, • Relation cardinality for a given <s, p> pair is the candidatenumber labelled with p, which has the highest confidencescore (i.e., marginal probability of a token labelled as such, resulting from forward-backward inference) Experiments • Evaluation on manually annotated randomly sampled subjects for 4 Wikidata properties: 20 (has part), 100 (contains admin.) and 200 (child and spouse) • baseline: randomly select a number from a pool of numbers in text • only nummod: consider only candidate numbers that modify a noun KB: 0 KB: 1 KB: 2 Recall: 0% Recall: 50% Recall: 100% 3. Motivation B: Disregarded by state-of-the-art (Open) IE systems Despite its frequency  • Open IE (Mausam et al. 2012; Del Corro and Gemulla, 2013) • No way to interpretthe numeric expression in the Object slot , e.g., <Obama, has, two children> • KB-population IE, e.g., NELL (Mitchellet al., 2015) • Knows 13 relations about the number of casualties and injuries in disasters, e.g., <Berlin2016attack, hasNumOfVictims, 32> • Contains only seed facts and no learned facts Stanford Named Entity (NE) tagger on cardinal numbers in 10K Wikipedia articles July 30-August 4, 2017 ∙ Vancouver, Canada DBpedia contains currently only 6 out of 35 Dijkstra Prize winners  According to YAGO, the average number of children per person is 0.02  167 out of 199 Nobel laureates in Physics are in DBpedia ☺ 2 out of 2 children of Obama are in Wikidata ☺ 5. Challenges in Relation Cardinality Extraction Quality of Training Data • Distant supervision from highly incompleteKB • e.g., manual annotation on child evaluation set  Wikidata is only ±50% accurate. • Unlike in classical IE, missing ground truth may lead to false positives as well. • Possible approaches: • Filtering ground truth  consider only popular entities for training. • Incompleteness-resilient distant supervision  label all numbers equal or higher than the KB count as positive examples. Compositionality • “They have two sons and one daughter together; he has four children from his first wife.” • 16% of false positives in extracting child cardinalities • Possible approaches: • Aggregating numbers  in training data generation, label a sequence of numbers as correct cardinalitiesif the sum is equal to the KB count; in prediction step, sum up all consecutivecardinalities. • Learning composition rules  e.g., children are composed of sons and daughters. Linguistic Variance • Ordinals are quite common to express lower bounds, e.g., John’s first wife, Mary, …”. • Relation cardinalities are sometimes expressed with non-numerals, e.g., “He never married”, “They have a daughter together”, “The book is a trilogy”. • Possible approaches: • Translation to numbers  translate certain kinds of negation and indefinitearticles into expressions containing 0 and 1. • Word similarity with cardinals  consider words bear high similaritywith cardinal numbers, possibly in other language such as Latin or Greek. p #s train baseline vanilla only nummod P P R F1 P R F1 has part (creative work series) 261 .050 .333 .316 .324 .353 .316 .333 contains admin 18,000 .034 .390 .188 .254 .548 .200 .293 spouse 45,917 0 .014 .011 .013 .028 .017 .021 child 35,057 .112 .151 .129 .139 .320 .219 .260 child (manual ground truth) 6,408 .374 .309 .338 .452 .315 .317 Further Reading • Predicting Completeness in Knowledge Bases, Luis Galárraga, Simon Razniewski, Antoine Amarilli, Fabian M. Suchanek, WSDM, Cambridge, UK, 2017 • Expanding Wikidata’s Parenthood Information by 178%, or How To Mine Relation Cardinalities, Paramita Mirza, Simon Razniewski, Werner Nutt, ISWC Poster, Osaka, Japan, 2016 • But What Do We Actually Know?, Simon Razniewski, Fabian Suchanek, Werner Nutt, AKBC workshop at NAACL, San Diego, USA, 2016 • Identifying the Extent of Completeness of Query Answers over Partially Complete Databases, Simon Razniewski, Flip Korn, Werner Nutt, Divesh Srivastava, SIGMOD, Melbourne, Australia, 2015 • A tool for crowdsourced completeness annotations for Wikidata: http://cool-wd.inf.unibz.it/ 18.86%