Scientific Paper Analysis
Yuji Matsumoto
Computational Linguistics Lab
Graduate School of Information Science
March 6, 2015
Big Data Symposium
at NAIST
Large Scale Text Data
Data on the Web
SNS: twitter, blog
Wikipedia
News, …
Scientific/Technical documents
Scientific Papers
Legal documents: law reports, casebooks
Patent documents
Knowledge Bases
Constructed manually
WordNet, Domain ontologies
Constructed by community
(Wikipedia)
Freebase
Constructed automatically
NELL: Never-Ending Language Learning
MindNet
Structures of KB
Linked structure
entities and relations
PDF
Entity: person, country, products, etc
Relation: born_in(Barack Obama, Honolulu)
locates_in(Honolulu, Hawaii)
state_of(Hawaii, USA)
Natural Language Analysis
How text is analyzed
Word segmentation, Part-of-speech
tagging
Named entity recognition
Syntactic parsing
Semantic disambiguation
Semantic parsing
Discourse analysis
Linked Knowledge Extraction
Named entity recognition
Extraction of entities, concepts
Syntactic dependency parsing
direct dependency between entities
Semantic parsing
predicate argument structure analysis
subject-predicate-object, relation between entities
Discourse analysis
co-reference – the same entity by different
mentions
relation between facts: temporal, causal
8
We analyzed the effect on the binding and
the activity of transcription factors at a regulatory element.
TPA induction inhibits the binding of the transcription factor NF-E2 to
this transcriptional
control element.
TPA induction increases the binding of AP-1 factors to this element.
Cause Theme
Theme
Theme Theme
S1
S2
S3
Semantic Parsing: Example
Katsumasa Yoshikawa, Sebastian Riedel, Tsutomu Hirao, Masayuki Asahara, Yuji Matsumoto,
"Coreference Based Event-Argument Relation Extraction on Biomedical Text,“
Journal of Biomedical Semantics, Volume 2, Supplement 5, S6, October 2011
9
"this element" in S2 is coreferent to…
"a regulatory element" in S1
We analyzed the effect on the binding and
the activity of transcription factors at a regulatory element.
Corefer
TPA induction inhibits the binding of the transcription factor NF-E2 to
this transcriptional control element.
TPA induction increases the binding of AP-1 factors to this element.
Cause Theme
Theme
Theme Theme
S1
S2
S3
Co-reference analysis
10
The true argument (Theme) of binding is "a regulatory
element“ and "this element" is just an anaphor of it
Transitivity enables us to conflate the information
We analyzed the effect on the binding and
the activity of transcription factors at a regulatory element.
(B) Corefer
(C) Theme
TPA induction inhibits the binding of the transcription factor NF-E2 to
this transcriptional control element.
TPA induction increases the binding of AP-1 factors to this element.
Cause Theme
Theme
Theme (A) Theme
S1
S2
S3
(A) Theme & (B) Corefer => (C) Theme
Information conflation
11
We analyzed the effect on the binding and
the activity of transcription factors at a regulatory element.
Corefer
Theme
TPA induction inhibits the binding of the transcription factor NF-E2 to
this transcriptional control element.
TPA induction increases the binding of AP-1 factors to this element.
Cause Theme
Theme
Theme Theme
Theme
CoreferTheme
S1
S2
S3
Discourse analysis
What we can do with Scientific
Papers
Knowledge extraction (domain knowledge)
New fact discovery
Content-aware paper search
Summarization
Automatic generation of abstracts
Keyword generation
Survey generation
Recommendation of related papers
Similar article/case search
Structural similarity: papers, law reports, patents
Related Project
Big Mechanism (2014.07-, by DARPA)
http://www.darpa.mil/Our_Work/I2O/Programs/Big_Mechanism.aspx
The Big Mechanism program aims to develop
technology to read research abstracts and papers
to extract pieces of causal mechanisms, assemble
these pieces into more complete causal models,
and reason over these models to produce
explanations. The domain of the program is
cancer biology with an emphasis on signaling
pathways.
Deep Language Analysis
Complex sentence structure analysis
Robust Semantic Parsing
Discourse Analysis
Co-reference
Causal / Temporal relation
Representation and Reasoning
Explanation / Anticipation
Confidence/credibility (of extracted facts /
what is written in documents)
Large-scale Text Data
syntactic dependency structure
argument structure, coreference
rhetorical / document structure
POS tags, phrase/NE chunking
relations ( temporal, causal,
entailment )
18
Knowledg
Ontology
Language Processing and Document
Analysis Layers
Document Analysis
(Document Understanding, Similarity-based Search,
Knowledge Discovery/Assembling)
We may be able to do more
Research Trend Survey
Research (paper) Evaluation
Content-aware citation analysis
Innovation Foresight
Eg: Foresight and Understanding from Scientific
Exposition (FUSE) Project
http://www.iarpa.gov/index.php/research-programs/fuse
Collaboration with people in application areas
who need to read/understand documents
Hinweis der Redaktion
OK, let see a typical example of event-argument relations including coreference information.
Probably, most people here know biomedical event extraction much better than me.
Actually I'm a stranger of bioinformatics. I'm a researcher of NLP.
OK, anyway, here, we can see events and arguments in S2.
E-A relations in S2 are perfectly labeled at least under the intra-sentential constraints.
However, arguments are often related to the other mentions through coreference relations.
So, when considering the contexts from forward and backward sentences...
We can see "this element" in S2 is coreferent to "a regulatory element" in S1
Corefer means that more than two mentions refer to a same entity.
In that case the true Theme of "binding" should be "a regulatory element"
On the other hand, "this element" is just an anaphor of it.
Here, we find a Transitivity.
I'll show you that.
Now, “this element is a Theme of the “binding”.
And “this element” is coreferent to the mention, “a regulatory element”.
Then “a regulatory element” is also a Theme of the “binding”.
It’s very very simple but pretty effective in order to identify this kind of Event-argument relations.
This is Transitivity.
OK, let move on to another strategy.
If we see the third sentence, another phrase, “this transcriptional control element” is coreferent to “a regulatory element”.
So, the entity described by “a regulatory element” is mentioned several times, over and over again, right?
This red line is sometimes called anaphoric chain and the arguments in such a long chain have higher Salience in Discourse.
They are valuable in discourse structure and can help our document understanding.
So, we want to extracts such arguments aggressively.
Our approach with Markov Logic can implement this idea in very direct fashion.
Moreover, such arguments are more likely to be arguments of events and this information can improve performance of event-argument extraction.