Material of the Natural Language Processing (NLP) Workshop with STIC-Asia representatives and the Nepal team.
August 30-31, 2007.
Institution: Institut de Recherche en Informatique de Toulouse (IRIT)
Patan Dhoka, Lalitpur, Nepal.
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Text Processing for Procedural Question Answering
1. Text Processing for Procedural
Question Answering
Undergoing work for TextCoop project
ILPL group, presentation by Estelle Delpech
2. Text Processing for Procedural
Question Answering
I.
INTRODUCTION : GLOBAL
ARCHITECTURE
II.
CLUES TO IDENTIFY TITLES/
INSTRUCTIONNAL COMPOUNDS
III.
THE WHOLE PROCESS
IV.
MAIN ISSUES
V.
DEMO
5. TEXT PROCESSING for Procedural QA :
Identification of task structure
.html
PRE-PROCESSING
SEGMENTER
TEXT GRAMMAR
TASK
HTML cleaning
MS tagging
Identification of
terminal symbols
Xbar analysis of
task structure
DATABASE
spec
G’
Pre-requisite Goal
Title
complemen
t
Instructional
Compound
6. II . CORPUS OBSERVATION :
WHAT CLUES TO IDENTIFY
-INSTRUCTIONNAL COMPOUNDS ?
-TITLES ?
7. 1. Clues for Instructional Compounds
Identification
Definition : kernel instructions linked to various clauses by rhetorical
or logical relations.
Identification in two steps :
Detect presence of instructions : expression of obligation
Find instructionnal compound boudaries, e.g. connectors…
Fixing the first wall plate (or shelf bracket)
Fixing the first wall plate (or shelf bracket)
Fixing the first wall plate (or shelf bracket)
We are going to mark the first wall plate (or bracket) for drilling.
We are going to mark the first wall plate (or bracket) for drilling.
First,position the face plate so one screw lines up with the mark on the wall you
First, position the face plate sosoone screw lines up with the mark on the wall you made
First, position the face plate one screw lines up with the mark on the wall you made
made in the last step and the level on topon top of the faceto ensure it is level. level.
in the last step and place the level on top of the face plate to ensure it is level.
in the last step and place place the level of the face plate plate to ensure it is
Second, you should mark thethewall in the next screw hole, again by turning the screw
Second,you should mark the wallthethe next screw hole, again turning thethe screw
Second, you should mark wall in in next screw hole, again by by turning screw
until it bites into the wall (see fig 1.3).
until it bites into the wall (see fig 1.3).
It is advised that you mark any remaining screw holes while keeping the wall plate
It is advised that you mark any remaining screw holes while keeping the wall plate
firmly in position.
firmly in position.
Now you have toto choose suitable drill bitbit (masonry or the right type for the
Now you have choose a a suitable drill (masonry or or right type for the surface). It
Now you have to choosea suitable drill bit (masonry thethe right type for the
surface). It should be theas the wall plug thebe used. to be used.
surface). the same width same width as to wall plug
should beIt should be the same width as the wall plug to be used.
Get to hand one of the wall plugs, and place itit against the tip of the drill bit (seefig
Get to hand one of the wall plugs, and place against the tip of the drill bit (see fig
Get to hand one of the wall plugs, and place it against the tip of the drill bit (see fig
1.4).
1.4).
Finally, Place a piece of masking tape on the drill bit to use as a guide, this will ensure
piece of masking tape on the drill bit to use as a guide, this will ensure
Finally, place aa piece of masking tape on the drill bit to use as a guide, this will ensure
Finally, place
you don't drill too deep.
you don't drill too deep.
8. 1. Clues for Instructional Compounds
Identification
Presence of instructions :
Morpho-lexical patterns
You should pre-heat the oven
shall Adv* base form verb
Have to Adv* base form verb
You have to pre-heat the oven
## Op? adv* base form verb
Do not pre-heat the oven
it be adv* (necessary|compulsory) that It is better that you pre-heat
the oven
Compound boudaries :
Morpho-lexical patterns
## to Adv* base form verb .* ,
(##|Conj) (if|then|after )
[To cook the cake, pre-heat the oven]
[and then start peeling …
[If you want to cook the cake, preHTML tags (typo-disposition) : heat the oven.] [If you don’t want to
cook …
<p> </p> <li> </li>
<li> [ Pre-heat the oven … ]</li>
9. 2. Titles identification :
About the HTML encoding of titles
The <hn> tag can not be used as a single clue for
title identification
HTML encoding is free, the code can be
underspecified (css)
Corpus observation :
80 % titles are encoded with <b>
57 % <b> encode titles
64 % <h> encode titles
the coding varies from a web site to another
We had to find some other clues …
10. 2. Clues for Title Identification
Some helpful visual Clues :
Short sequence of word
Emphasized
Spaced from the rest of the text
emphasized
not
not a title
not short
11. 2. Clues for Title Identification
Linguistic Clues :
Rarely contains tensed verb
Can be a single question
?
?
Textual environment clues :
Occurs between two
paragraphs of text
Occurs between title and a
paragraph of text
No single clue, but a bundle
of clues
?
?
12. III. THE WHOLE PROCESS
HTML cleaning
MS tagging
PRE-PROCESSING
SEGMENTER
Identification of
terminal symbols
Title
Instructional
Compound
13. 1. HTML Cleaning module
Raw HTML
Code
HTML
Cleaning
Text chunks tags
The output of the HTML
<p>
Cleaning module is :
<div>
<p>
<ol>
a list of text chunks,
<ul>
corresponding more or less
to paragraph breaks
Subdivision tags
<br>
<br>
Their corresponding typo<li>
<li>
dispositionnal structure
Emphasis tags
<h>
<b>
<u>
<i>
Main typo-dispostional information
<p>
<b>
<p>
<li>
<li>
<p>
<b>
<p>
<b>
<br>
<br>
<p>
<b>
<b>
<br>
14. 2. Clues Collection module
STRUCTURE
<b>
<li>
<li>
TEXT
MS Tagging
TAGS
Collection module is :
TreeTagger
<b>
<br>
<br>
<b>
<br>
<b>
<li>
<li>
the list of text chunks with :
Nb corresponding typoTheir of instructions
Instructions types
dispositionnal structure
Nb of goals
Text with tagged
instructions, goals,
Nb of words
connectors
Nb of sentences
Linguistic information
Nb of question
This information is used for :
Nb of tensed verbs
Titles identification
Instructionnal compounds
identification
<b>
<b>
Clues
The output collection
of the Clues
CLUES
15. 3. Processing each chunk : text or title ?
TEXT
CHUNKS
TYPE
unknown
unknown
Short chunk
spaced from the rest of the
text
with emphasis
a single question
Identification of
unambiguous
Titles
unknown
unknown
unknown
unknown
unknown
unknown
title
text
text
ambiguous
unknown
unknown
TEXT
CHUNKS
Identification of
unambiguous
paragraphs of
text
Long chunk
No emphasis
Subdivided
+ than 1 instruction
presence of tensed verbs
ambiguous
title
ambiguous
text
text
ambiguous
16. 3. Ambiguous chunks : text or title ?
Short chunks with no
emphasis
Instruction-like short chunks
Use of textual environement clues :
1. Identify unambiguous titles/paragraphs of text
2. Desambiguates the remaining chunks
17. 3. Ambiguous chunks : text or title ?
TEXT
CHUNKS
title
text
text
Desambiguisation
using textual
environment clues
ambiguous
a series of ambiguous
paragraphs become text
an ambiguous
paragraph between two
paragraphs of text
becomes a title
ambiguous
title
ambiguous
text
ambiguous
text
TEXT
CHUNKS
title
text
text
text
text
an ambiguous
paragraph between two
paragraphs of text
becomes a title
title
title
text
title
text
19. IV. Main issues : noise in web pages
« noise » of web pages : advertisements,
lists of links, navigation help...
interfers with compouds /title identification :
short sequence
emphasis
linguistic form:
Base form verb at the beginning of a sentence
typical of a title or an instruction
but it is a list of links !!
titles
instruction
titles
20. IV. Main issues : refining goal/titles
identification
only sub-goals sub tasks relations are
identified
what about the hierarchy task/sub-task(s) ?
what about the head title / main goal ?
the head title is not always the 1st
identified title (noise)
sometimes there is no head title
what if the action is implicit ?
ex : the room and the bed
implicit : how to clean the room and the
bed
some ideas :
choose a title that has vocabulary in
common with instructions
identify action verbs in relation with the
nouns of the title