1. Table of Contents
1 INTRODUCTION..........................................................................................................3
2 HISTORY.........................................................................................................................6
2.1 Early systems...........................................................................................................6
3 NATURAL LANGUAGE PARSING...........................................................................7
3.1 Rule-Based Syntactic Parsing ...............................................................................7
3.2 Terminal Symbols...................................................................................................7
3.3 Non-terminal symbols............................................................................................7
3.4 Production Rules.....................................................................................................7
3.4.1 Grammar...........................................................................................................7
3.4.2 Parse tree...........................................................................................................8
3.4.2.1 Top down..................................................................................................8
3.4.2.2 Bottom up ..................................................................................................9
3.5 Probabilistic Parsing.............................................................................................10
3.5.1 Disambiguation..............................................................................................11
3.5.2 Training...........................................................................................................11
3.5.2.1 Treebank...................................................................................................12
3.5.2.2 Incremental learning...............................................................................12
3.6 Semantic Parsing...................................................................................................13
3.6.1 Semantic Data Models...................................................................................13
3.6.2 Case Based Reasoning...................................................................................14
3.6.3 Semantic Representation...............................................................................15
3.6.4 Actions of the Parser......................................................................................15
4 NLIDB ARCHITECTURE...........................................................................................17
4.1 Pattern-matching systems....................................................................................17
4.2 Parsing based systems..........................................................................................17
4.2.1 Semantic grammar based parsing...............................................................18
4.2.2 Translation......................................................................................................19
5 MARKET TEST.............................................................................................................23
5.1 Goals.......................................................................................................................23
5.2 Tests........................................................................................................................23
5.3 Results.....................................................................................................................23
5.3.1 Impressions.....................................................................................................23
5.3.1.1 Microsoft English Query........................................................................23
5.3.1.2 Elfsoft........................................................................................................24
5.3.2 Query results..................................................................................................25
6 FUTURE........................................................................................................................27
6.1 Language challenges............................................................................................27
6.2 Portability challenges...........................................................................................27
6.3 Competing systems...............................................................................................27
6.4 Possible avenues....................................................................................................27
1
2. 6.4.1 Adaptation techniques..................................................................................28
6.4.2 Speech-based techniques..............................................................................28
6.4.3 Learning algorithms......................................................................................28
6.4.3.1 User Dialogue..........................................................................................28
6.4.3.2 Neural Networks.....................................................................................29
6.4.3.3 Genetic Algorithms.................................................................................29
7 CONCLUSIONS...........................................................................................................30
8 BIBLIOGRAPHY..........................................................................................................33
9 CONTRIBUTIONS.......................................................................................................36
APPENDIX A..................................................................................................................37
Evaluating Systems....................................................................................................37
Introduction............................................................................................................37
Why is there a need?..............................................................................................37
Current Marketing.................................................................................................37
Problems .................................................................................................................37
Black box metrics...................................................................................................38
Proposed black box evaluation scheme..............................................................38
Overall Characteristics..........................................................................................38
Vocabulary..............................................................................................................38
Ease of Interaction..................................................................................................39
Accuracy based on input complexity..................................................................39
APPENDIX B..................................................................................................................40
Test Protocol...............................................................................................................40
2
3. 1 INTRODUCTION
The ability to exercise language to convey different thoughts and feelings
differentiates human beings from animals. The definition of Natural Language
Processing is the capability of a machine to understand the full context of human
language about a particular topic so that the unspecified guess and general
knowledge can be understood. “Thus if the machine is able to achieve this, it has
come close to the notion of artificial intelligence itself”1.
One may find interacting with a foreign person with no knowledge of English
intricate and frustrating. Thus, a translator will have to come into the picture to
allow one to communicate with the foreigner. Companies have related this
problem to extracting data from a database management system (DBMS) such as
MS Access, Oracle and others. A person with no knowledge of Structured Query
Language (SQL) may find himself or herself handicapped in corresponding with
the database. Therefore, companies like Microsoft and Elfsoft (English Language
Frontend Software) have analysed the abilities of Natural Language Processing
to develop products for people to interact with the database in simple English.
This enables a user to simply enter queries in English to the Natural Language
database interface. This kind of application is known as a Natural Language
Interface to a DataBase (NLIDB).
The system works by utilizing the use of syntactic knowledge and the knowledge
it has been provided about the relevant database.2 Hence, it is able to implement
the natural language input to the structure, scope and contents of the database.
The program translates the whole query into the standard query language to
extract the relevant information from the database. Thus, these products have
created a revolution in extracting information from databases. They have
discarded the fuss of learning SQL and time is also saved in learning this query
language.
This report will look at the performance of each database interface connected to a
standard database. The Northwind database has been chosen as the default
database to work on. There are several companies that are offering such products
in the market. Our group has found several of them, which include English
Query, Elfsoft, EasyAsk and NLBean created by Mr Mark Watson. We have
requested for these companies for their permissions to test their products in
regards to our research. We received positive responses from Elfsoft and
NLBean, but had to settle for tests on Microsoft English Query and Elfsoft only.
We have also contacted EasyAsk via email but the company has provided
minimal assistance in our research.
1
Manas Tungare
2
Manas Tungare
3
4. In order to produce accurate conclusions on the different interpretations of each
software, we have listed out over thirty questions to test the products. Each
product will be asked the same questions in the same order. The questions have
been carefully planned to test the pros and the cons of each product.
These questions include:
• Listing the specific columns and rows
• Counting
• Calculations
• Cross referencing from more than one tables
• Ordinal positions
• Followed-ups
• Conclusions
• Semantics
• Grammar mistakes
• Spelling mistakes
• Out-of-context questions
There are three components in a natural language dialog system: analysis,
evaluation and generation.3 The analysis component translates the query as
entered by the user into a semantic representation which is transcripted in the
knowledge representation language. There may be several communication
sessions between the natural language access system and user interface system to
the user in order to carry out the action to derive the result. The evaluation
component allows information to be absorbed by the dialog system when queries
have to be satisfied or the system needs to alert the user about any major state
changes. The generation component gathers the intended information that the
user wants to see as provided in the query. This component will generate text,
graphs, query or any other responses according to the situational context of the
query.4
The knowledge-based database assistant (KDA) as stated, is a practical
development of an intelligent database front-end to assist novice users in
retrieving desirable information from an unfamiliar database system.5 This
component exists in both Microsoft English Query and Elfsoft. Thus, this useful
program directs the novice user to get the relevant results by entering the
accurate query or by prompting the user when insufficient information is entered
to get the appropriate answer. This component can be seen in the later part in
this report in both programs.
3
Dialog-Oriented Use of Natural Language
4
Dialog-Oriented Use of Natural Language
5
Manas Tungare
4
5. In addition, “the KDA's responding functionality, which could change the user's
knowledge state, is called query guidance”.6 It can detect a user’s scope of
knowledge about the relevant database by studying the query entered by the
user. If it sensed that the user has limited awareness about the database and
could not retrieve his or her desired answer, the query guidance will jump into
action and provide similar queries to allow the user to gather the appropriate
facts from the database or present the most relevant query to the user based on
the user’s perceived intention. Such a component allows the novice to get
familiar with the database fast and enables the user to learn about the scope of
the database based on the prompt messages and the queries generated from the
KDA without the expense of learning those mass databases stored in most
organizations.
6
Manas Tungare
5
6. 2 HISTORY
As the use of databases for data storage spread during the 1970’s, the user
interface to these systems represented a burden for designers worldwide. At this
point, both the relational database model and the SQL interface language were
yet to be developed, which means that the task of inserting and querying data
was tedious and difficult.
It was therefore a logical step for programmers to attempt to develop more user-
friendly and “human” interfaces to the databases. One of these approaches was
the use of natural language processing, where the user interactively would be
allowed to interrogate the stored information.
2.1 Early systems
The most well-known historical natural language database interface systems are:
• LUNAR, interfacing a database with information on rocks collected
during American moon expeditions. It was originally published in 1972.
When evaluated in 1977, it answered 78 % of questions correctly. Based on
syntactic parsing, it tended to build several parse trees for the same query,
and was deemed as inefficient7 and too domain-specific and inflexible.
• LADDER, the first semantic grammar-based system, interfacing a
database with information on US Navy ships.
• CHAT-80, probably the most famous example. It interfaced a database of
world geography facts. The entire application (both the database and the
user interface) was developed in Prolog. As the source code was freely
distributed, it is still used and cited. An online version can be found at8.
7
Hafner, C. D. and Gooden, K. pp 141-164
8
ECL I Vertiefung: natürlichsprachliche Zugangssysteme: chat80
6
7. 3 NATURAL LANGUAGE PARSING
3.1 Rule-Based Syntactic Parsing
Syntax means ways that words can fit together to form higher-level units such as
phrases, clauses, and sentences. Therefore syntactically driven parsing means
interpretations of larger groups of words are built up out of the interpretation of
their syntactic constituent words or phrases. In a way this is the opposite of
pattern matching as here the interpretation of the input is done as a whole.
Syntactic analyses are obtained by application of a grammar that determines
what sentences are legal in the language that is being parsed.
Syntactic parsing operates through the translation of the natural language query
into a parse tree, which is then converted to a SQL query. There are a number of
fundamental concepts in the theory of syntactic parsing.
3.2 Terminal Symbols
A terminal symbol is the basic building block of the language, i.e. words and
delimiters. Together, the set of terminal symbols form the “dictionary of words”9
recognised by the system, i.e. the range of the vocabulary that it can read and
interpret.
3.3 Non-terminal symbols
Non-terminal symbols are higher-level language terms describing concepts and
connections in the syntax of the language. Examples of non-terminal symbols
include1 sentence, noun phrase, verb phrase, noun, and verb.
3.4 Production Rules
As the query is analysed, a number of production rules fires to identify and
classify the context of the read word. In analogy with a production system (such
as the one used in PROLOG), a production rule in a context-free grammar10
converts a left-hand non-terminal symbol to a sequence of symbols, which can be
either terminal or non-terminal. Examples of production rules:
• Sentence := Noun phrase verb phrase
• Verb phrase := verb
These rules are also commonly referred to as rewrite rules.
3.4.1 Grammar
The combination of the set of terminal symbols, set of non-terminal symbols, the
production rules and an assigned start symbol (the highest-level construct in the
system, usually sentence) form the grammar of the syntax. The role of the
grammar is to define:
• What category each word belongs to;
9
Luger, G.F. and Stubblefield, W.A.
This paper will be constricted to the treatment of context-free grammars and not deal with the
10
more complex set of syntaxes known as context-sensitive.
7
8. • What expressions are legal and syntactically correct;
• How sentences are generated.
3.4.2 Parse tree
The system analyses the sentence by reading the non-terminal symbols in order
and identifying what production rule to fire. As it does so, it gradually builds a
representation of the sentence referred to as a parse tree. The term has been
coined from the tree-like graph that is produced, where the root is the top-level
symbol (e.g. sentence), the children of each node are the right-hand non-terminal
symbols and the leaves are the terminal symbols (the words). The parse tree can
be built in two fundamentally different ways.
3.4.2.1 Top down
A top down parser starts at the root and gradually builds the tree downwards by
matching the read terminal symbols with symbols on the right-hand side of
possible production rules. Terminal or non-terminal symbols on the right hand
side are added at the level below the current symbol.
This is similar to the goal-driven approach of a production system. The basic
architecture of a top down parser is illustrated in figure 1.
8
9. Figure 1 Top down parsing of the sentence "the girl forgot the boy"11
In many situations, the first token alone does not provide enough information to
make the decision on what production rule should be fired. In order to overcome
this, there are two basic methods.
3.4.2.1.1 Recursive Descent
The system starts by firing the first production rule of the candidates for which
the given terminal symbol could fit and builds the initial sub tree from this
information. If this further downwards in the tree results in an inconsistency or
syntactic error, it reverts to the point where the decision was made, removes all
the nodes on the way back up and selects another of the possible productions.
This is a procedure very similar to depth-first searching and backtracking in
production systems.
3.4.2.1.2 Look Ahead
Look Ahead systems will not be contented by just reading one token. Rather, it
reads the number of tokens necessary to identify the given right-hand side
beyond any ambiguities before firing any production rules.
Grammars are characterised by the maximum number of terminal symbols
required to read before all possible conflicts in the choice of production rule can
be resolved. If this number is k, the grammar is referred to as an LL (k)
grammar12. The look ahead procedure is more in analogy with a breadth-first
search technique.
3.4.2.2 Bottom up
A bottom up parser, on the other hand, works from the leaf upward by
“tagging” the tokens, i.e. starting from the right-hand side of the production
rules and associating the read word with its category. When a full right-hand
side has been identified, the production rule fires and the left-hand side non-
terminal symbol is added as a branch in the level above. This methodology
corresponds to the data-driven technique of production systems. The bottom up
parsing technique is illustrated in figure 2.
11
Dougherty, R.C.
12
Eriksson, G.
9
10. Figure 2 Bottom up parsing of the sentence "the girl forgot the boy"13
In some cases, the sentence is ambiguous in itself and there are multiple
production rules that match a given sentence, in which case the parser has to
make a choice between the two potential interpretations. One strategy for
dealing with these situations is referred to as probabilistic parsing.
3.5 Probabilistic Parsing
Probabilistic parsing takes an empirical approach to the difficult task of
disambiguation, i.e. identifying which of several mutually exclusive alternate
13
Dougherty, R.C.
10
11. syntactic parse trees should be generated.
For example, consider the sentence “One morning I shot an elephant in my
pyjamas”14. There are two possible syntactic parses for this sentence15. One
implies that the person was wearing the pyjamas, while the opposing view
would claim that the elephant was in the underwear (hence the joke). Although
the selection between these two interpretations is obvious to a human, how is
this knowledge automated in a computer?
One option, used in a.k.a. attribute grammars, is to encode information for each
verb as a parameter to each production rule. However, as the dictionary grows,
this approach may be too selective and require every different case to be
specifically added to the production rules.
Probabilistic parsing, on the other hand, works by augmenting the rules with
assigned probabilities, representing the chance of the particular expansion
(production rule) being the correct one.
For example, a probabilistic grammar would introduce the following
enhancements to the possible regular syntactic production rules for the
expansion of the non-terminal symbol sentence [15]:
• Sentence:= Nounphrase Verbphrase, P = 0.8
• Sentence:= Auxiliary Nounphrase Verbphrase, P = 0.15
• Sentence:= Verbphrase, P = 0.05
Note that the probabilities for the expansions of any given non-terminal symbol
always add up to 1.
3.5.1 Disambiguation
How does probabilistic parsing choose a parse tree from two possible
interpretations? In most systems, it simply compares the products of all the
probabilities involved in every production required for the competing parses and
selects the one representing the highest of these probabilities.
3.5.2 Training
One important task concerns how to set the probabilities. There are two
fundamentally different techniques for this task [15].
14
Groucho Marx
15
Jurafsky, D. & Martin, J.
11
12. 3.5.2.1 Treebank
A large database of sentences with their correct parses (parsed by knowledgeable
humans) is entered into the system. The respective probabilities are then
calculated as the relative frequencies of each possible parse. For more details, see
[15].
The largest known treebank is known as the Penn Treebank16. The latest version,
Treebank 3 contains parses of17:
• One million words of 1989 Wall Street Journal material;
• A small sample of ATIS-3 transcripts. The Air Travel Information Service
is a joint project of DARPA (Defence Advanced Research Projects Agency)
and SRI International, handling voice-based queries and requests about
flights. More information can be found at18;
• A fully parsed tagged version of the Brown Corpus, consisting of one
million words from 500 different sources (novels, academic books,
newspapers, non-fiction books etc. [15]);
• Parsed and tagged text from a set of 560 transcripts of telephone
conversation (a.k.a. the Switchboard-1 corpus).
• This is a widely used “training set” (in analogy with an artificial neural
network) enabling the parser to learn what classes of speech a given word
can belong to and how frequently a particular expression is to be
interpreted in different ways.
3.5.2.2 Incremental learning
The other technique is a “trial and error” method, in which the parsing system
much like an artificial neural network learns as it is used.
The initial probabilities can be assigned randomly or by the user. After that, the
system adjusts these probabilities according to the following rules [15]:
• If the sentence was unambiguous, its parse count is increased by 1, i.e. pi:
= pi +1;
• If the sentence was ambiguous, each of the possible parses have their
counts incremented by their respective probabilities, i.e. pi: = pi + P (p i ).
The algorithm for this computation is referred to as the Inside - Outside
Algorithm. It was originally proposed in19 and is described in detail in20.
16
Penn Treebank Project.
17
Quoted by the LDC office of the University of Pennsylvania in an email dated 10/7-2001.
18
Language Reference
19
Baker, J.K. pp. 547-550.
20
Manning, C.D. and Schutze, H.
12
13. 3.6 Semantic Parsing
The syntactic structure of a sentence is not enough to express its meaning. For
instance, the noun phrase the catch can have different meanings depending on
whether one is talking about a baseball game or a fishing expedition. To talk
about different possible readings of the phrase the catch, one therefore has to
define each specific sense of the phrase. The representation of the context-
independent meaning of a sentence is called its logical form.21 Natural language
analysis based on semantic grammar is similar to syntactically driven parsing
except that in semantic grammar the categories used are defined semantically.
Database items can be ambiguous when the same item is listed under more than
one attribute. For example, the term “Mississippi” is ambiguous between being a
river name or a state name, in other words, two different logical forms. The two
different meanings have to be represented distinctly for an interpretation of a
user query.
3.6.1 Semantic Data Models
Semantic data models (SDM) are widely researched in the database community.
They are closely related to semantic networks used in artificial intelligence,
which were originally developed to support natural language processing. Hence,
as database management systems they are capable of supporting large amounts
of information, while still offering the potential of advanced inferencing
capabilities including NLP, machine learning, and query processing.
“SDMs can be seen as formalising many of the relationships, expressed in an ad
hoc manor in conventional hypermedia systems.”22 SDMs support a variety of
formalised links and relationships. An example of a small network on insects is
shown in figure 3. The links in this graph express generalisation relationships or
"ISA" (beneficial insect IS-A insect), part/whole (Abdomen is part of an Insect),
association (Ladybugs eat Aphids), and class/instance (Ladybug is an instance of
Beneficial Insect). 23
21
Tang, R. L. p5
22
Beck, H., Mobini, A., Kadambari, V
23
Beck, H., Mobini, A., Kadambari, V
13
14. Figure 3 Semantic Data Model describing insects24
In figure 3, solid lines are ISA relationships, diamonds are part/whole, circles are
associations, and Instances are underlined.
Since concepts in SDMs are described by structured graphs expressing the
relationships among symbols rather than connections between text files as in
conventional hypertext, there exists the capability for manipulation of SDMs to
produce a number of desirable functions. Foremost is that of search or query
processing. [8] Suggests query processing based on graph matching techniques
by which the query is expressed as a small semantic network. This query graph
is then matched against the larger database graph to find connections. This gives
a much more precise search capability than is possible with Boolean keyword
searches over text files.
3.6.2 Case Based Reasoning
In order to construct an NLP system, one must construct a large dictionary.
Much of the recent advances in text understanding systems can be attributed to
advances in design and construction of large lexicons. But that presupposes that
word meaning is easily represented and a case-based reasoning approach to
meaning is used. Words obtain meaning by how they are used. A particular
word is used in many different situations and contexts. Each occurrence of the
word is treated as one case. Similarities among cases can be observed, and cases
with similar usage can be clustered together into categories. When a word is
used in a new situation, similar cases are retrieved from the case-based memory
in order to apply what happened before to the new context. The meaning of a
particular word is established by a large case base, and thus a single word may
be "worth 1,000 cases". 25
24
http://archive.ncsa.uiuc.edu/SDG/IT94/Proceedings/Searching/beck/fig1.gif
25
Beck, H., Mobini, A., Kadambari, V.
14
15. 3.6.3 Semantic Representation
The most basic constructs of the representation language are the terms used to
describe objects in the database and the basic relations between them. Database
objects bear relationships to each other or can be related to other objects of
interest to a user who is requesting information from it. For instance, in a user
query like “What is the capital of Texas?”, the data of interest is a city with certain
relationship to a state called Texas, or more precisely its capital. The capital/2
relation, or predicate, is therefore defined to handle questions that require them.
Predicates Description
city (C) C is a city
capital (S,C) C is the capital of S
density (S,D) D is the population density of
state S
loc (X,Y) X is located in Y
len (R,L) L is the length of river R
next_to (S1,S2) State S1 borders S2
traverse (R,S) River R traverses state S
Table 1 Sample of predicates26
3.6.4 Actions of the Parser
Using the parser actions in CHILL [8] known as shift-reduce parsing we will
discuss the working of the parser. The parser actions are generated from
templates given by a logical query. An action template will be instantiated to
form a specific parsing action. Recall that the parser also requires a lexicon to
interpret meaning of phrases into specific logical forms. Consider the following
example27:
Sentence: What is the capital of Texas?
Logical Query: answer(C,(capital(C,S),const(S,stateid(Texas)))).
A very simple lexicon will map ‘capital’ to ‘capital(_,_)’ and ‘Texas’ to
‘const(_,stateid(texas))’. The parser begins with an initial stack and a buffer
holding the input sentence, which is the initial parse state. Each predicate on the
parse stack has an attached buffer to hold the context in which it was introduced.
Words from the input sentence are shifted onto the stack buffer during parsing.
The initial parse is as follow:
Parse Stack: [answer(_,_):[]]
Input Buffer: [what,is,the,capital,of,texas,?]
26
Lappoon R. T. p6
27
Tang, R.L.
15
16. Since the first three words in the input buffer do not map to any logical forms,
the next sequence of steps is to push the three words from the input buffer onto
the parse stack. The process has the following result:
Parse Stack: [answer(_,_):[the,is,what]]
Input Buffer: [capital,of,Texas,?]
Now, ‘capital’ is at the head of the input buffer and is mapped to ‘capital(_,_)’ in
the lexicon. The next action is to push the logical form onto the parse stack. The
resulting parse state is as followed:
Parse Stack: [capital(_,_):[],answer(_,_):[the,is,what]]
Input Buffer: [capital,of,Texas,?]
The parser then binds two arguments of two different logical forms to the same
variable, resulting in the following parse state:
Parse Stack: [capital(C,_):[],answer(C,_):[the,is,what]]
Input Buffer: [capital,of,Texas,?]
The sequence repeats itself producing a parse state:
Parse Stack: [const(S,stateid(Texas)):[?,Texas]capital(C,S):[of,capital],answer(C,_):
[the,is,what]]
Input Buffer: []
The final step is to take the logical form on the parse stack and put it into one of
the arguments of the meta-predicate resulting in:
Parse Stack: [answer(C,(capital(C,S), const(S,stateid(Texas)))):
[?,Texas,of,capital,the,is,what]]
Input Buffer: []
As this is the final parse state, the logical query is then constructed from the
parse stack.
16
17. 4 NLIDB ARCHITECTURE
4.1 Pattern-matching systems
The first NLIDBs were based on pattern-matching techniques. As a simple
illustration of pattern matching technique, consider the following database:
Countries_Table
Country Capital Language
France Paris French
Italy Rome Italian
… … …
Table 2 Sample Database Table28
A primitive pattern-matching system according to [8] may use rules as:
Pattern: … ”capital” … <country>
Action: Report CAPITAL of row where COUNTRY = <country>
Pattern: … “capital” … “country”
Action: Report CAPITAL and COUNTRY of each row
If the user asked “What is the capital of France?”, using the first pattern rule the
system would report “Paris”. The system would also use the same rule to handle
questions such as “Print the capital of Italy”, “Could you please tell me what is the
capital of France?” etc.
Some advantages of this approach are that it requires no complicated parsing or
interpretation modules, and that it is easy to implement. But the main advantage
of this approach is its simplicity. However the shallowness of this approach often
lead to bad failures. An example is when a pattern-matching NLIDB was asked
“TITLES OF EMPLOYEES IN LOS ANGELES.” the system reported the state
where each employee worked, assuming the “IN” to denote the post code of
Indiana, and assumed that the question was about employees and states.29
4.2 Parsing based systems
In general as [8] suggests, the system architectures of some NLIDBs can be seen
as being made of two major modules. The first module controls the natural
language, where a question is submitted and successively transformed. At the
end of this process one or more intermediate logical query expressions is
obtained. Given the dimension of the domain and the flexibility of the natural
language, there usually exist several interpretations of the same question. The
28
Androutsopoulos, I., Ritchie, G.D., and Thanisch, P. p.14
29
Androutsopoulos, I., Ritchie, G.D., and Thanisch, P., pp.14-15
17
18. second component is in charge of the connection with the database, translating
the expressions to structured query language (SQL) expressions (using mapping)
and sending them to the Data Base Management System (DBMS) to produce the
answers.30
For a graphical explanation of the structure, examine Figure 4.
Figure 4 NLIDB Architecture31
As described in the previous section, the source language sentence is first parsed,
producing a parse tree. The two methods often found of parsing are the syntax
based and semantic grammar based.
4.2.1 Semantic grammar based parsing
Using this technique, the grammar’s categories do not necessarily correspond to
syntactic concepts. Examine the following figure:
30
Reis, P., Matias, J. and Mamede N. p.3-4
31
Androutsopoulos, I., Ritchie, G.D., and Thanisch, P. p.18
18
19. Figure 5 Semantic base parsing tree32
Notice that some categories of the grammar (e.g.: Substance, Magnesium,
Specimen_question) do not correspond to syntactic constituents (e.g.: Noun-
Phrase, Noun, Sentence). This is because the semantic information about the
knowledge domain (e.g.: a question may either refer to specimens or spacecraft)
is hared-wired into the semantic grammar.33
Because the semantic grammar approach contains hard-wired knowledge about
a specific knowledge domain, it is very difficult to transfer it to other knowledge
domain. A new semantic grammar has to be written whenever the NLIDB is
configured for a new knowledge domain.34
4.2.2 Translation
The translation is usually based on several mapping tables. Figure 6 illustrates
this process for both the addition of new information based on an input sentence
and the processing of a related query. The query is represented by a small graph,
which initiates the mapping to the semantic hierarchy. The small graph is
mapped to the semantic network by creating a link from each node in the smaller
graph to the corresponding nodes in the network starting with the most general
concept (the root) and ending with the most specific. This will create a unique
instance, which is the intersection of all of the nodes involved in the query and
may be used to narrow down a neighbourhood based on the requested
information. 35
The mapping process is bounded by rules, and completely based on the
information of the parse tree. As an example of mapping rules, consider the
previous query of “which rock contains magnesium” taken from [1]:
• The mapping of “which” is for_every X.
32
Androutsopoulos, I., Ritchie, G.D., and Thanisch, P. p.17
33
Androutsopoulos, I., Ritchie, G.D., and Thanisch, P. p.17
34
Androutsopoulos, I., Ritchie, G.D., and Thanisch, P. p.17
35
Beck, H., Mobini, A., Kadambari, V. [online]
19
20. • The mapping of “rock” is (is_rock X).
• The mapping of an NP is Det’ N’, where Det’ and N’ are the mappings of
the determiner and the noun respectively. Thus resulting in for_every X
(is_rock X).
• The mapping of “contains’ is contains.
• The mapping of “magnesium” is magnesium.
• The mapping of a VP (V’ X N’). Thus resulting in (contains X magnesium).
20
21. Figure 6 Mapping and Query Processing Model36
Figure 7 demonstrates when the user ask a query on how John spent his leisure
time and displays how the answer to the query is produced by exploiting the
relationship between "spending leisure time" and "having a chance to go fishing"
(both are "doing").
Figure 7 Query processing model37
In many systems the syntax rules linking non-leaf nodes and the semantic rules
are domain independent, and can be used in any application domain. The
information describing the possible words (leaf nodes) and the logic expressions
is domain dependent and has to be declared in the lexicon.38
As an example, consider the lexicon used in MASQUE [8] listing the possible
words, “capital”, “capitals”, “border”, “borders”, “bordering”, “bordered”.
• The logic expression of “capital”, “capitals” could be
capital_of(Capital,Country).
• The logic expression of “border”, “borders”, “bordering”, “bordered” could be
borders(Country1,Country2).
• The logic expression of “country” could be is_country(Country).
36
http://archive.ncsa.uiuc.edu/SDG/IT94/Proceedings/Searching/beck/fig2.gif
37
http://archive.ncsa.uiuc.edu/SDG/IT94/Proceedings/Searching/beck/fig3.gif
38
Androutsopoulos, I., Ritchie, G.D., and Thanisch, P. p.19
21
22. Then the question, “What is the capital of each country bordering Greece?” would be
mapped to this query:
answer([Capital, Country]):-
is_country(Country),
borders(Country, Greece),
capital_of(Capital, Country).
The meaning of the logic query above is to find all pairs [Capital, Country], such that
Country is a country, Country borders Greece, and Capital is the capital of Country.
The interpreter also needs to consult a world model that describes the structure of
the surrounding world as shown by the figure below. Typically, the model
contains a hierarchy of classes of world objects, and constraints on the types of
arguments each logic predicate may have.39
Figure 8 Hierarchy in world model40
39
Androutsopoulos, I., Ritchie, G.D., and Thanisch, P. p.18-19
40
Androutsopoulos, I., Ritchie, G.D., and Thanisch, P. p.19
22
23. 5 MARKET TEST
In order to get a good estimate of the current state of the technology, the
applications presented in the previous chapter were subjected to a neutral test.
5.1 Goals
The goals of the tests were:
• To get a thorough understanding of contemporary market applications;
• To get an estimate of the relevance and importance of this type of systems;
• To get some insight into what features are more and less important.
5.2 Tests
The tests were carried out on the Northwind database, a sample database with
information on a shipping company. The database comes as a demo with all
distributed copies of Microsoft Access.
A number of different queries of different types were posed to the respective
natural language front ends. The questions were classified as simple (S), average
(A), or complex (C).
For a more comprehensive explanation of the considerations behind the testing
procedures, see Appendix A.
5.3 Results
5.3.1 Impressions
5.3.1.1 Microsoft English Query
English Query is a development environment that enables programmers to
produce natural language front ends for SQL 2000 databases. The product is
included with SQL 2000. The tests were performed on a demo of English Query,
developed by Microsoft to interface with the Northwind database.
The user interface has five fields, with the following functionalities:
• Query (user input)
• Interpretation of query
• Required operations
• Produced SQL statement
• Results
A screen shot from one of the queries is presented in Figure 9.
23
24. Figure 9 Microsoft English Query.
5.3.1.2 Elfsoft
Elfsoft works together with either VB or Access. Queries are entered in a query
window (see Figure 10) and can be output either as database tables (see Figure
11) or in a graphical format.
Figure 10 Elfsoft query window.
24
25. Figure 11 Elfsoft answer output.
Elfsoft also includes several other options for enhanced portability, including:
• Automatic analyser of any Access database
• Enabling the user to teach program meanings of phrases
• Allowing the user to explain why a query failed (what was missing
and/or wrong)
• Permitting the user to edit the dictionary
• Logging of queries for statistics
5.3.2 Query results
The results are summarised in Table 3. A full recollection of the questions asked
is presented in Appendix B.
Table 3 Accuracy percentages.
Type of query English Elfsoft
Query
Simple 71 23
Average 50 40
25
27. 6 FUTURE
During the mid-eighties it was believed that natural language processing
systems would become a universal interface to databases worldwide41. However,
due to the emergence of graphical interfaces to databases, the relative simplicity
of SQL and the inherent problems of natural language processing they have
never really caught on commercially42.
The current position of NLIDBs is probably best described by “it’s a great idea,
but…” Although their usefulness is appreciated, they are still at a research stage.
There are several reasons as to why their usage is not taking off on a broader
scale.
6.1 Language challenges
It is still very hard to encode the vast source, complexity and ambiguity of a
human language into a computer. The formalisms for representing language
patterns are still not comprehensive enough to capture all the different ways that
expressions and terms can be constructed and given meaning depending on the
context.
6.2 Portability challenges
Although several systems for communication with individual databases have
been successfully implemented and used, a general technique, which would
allow the user to specify the database and use a system with any database
management system (whether it be Access, SQL 2000, Oracle or any type), is still
rather elusive. This would require the system to be able to recognize the fields
and attributes of the new storage source seamlessly.
An even bigger hurdle to portability is the nature and scope of language
understanding. Language use in different domains is very dissimilar, which
means that any portable system has to have a huge vocabulary with terms from
many different application domains and be able to recognize expressions from
users of a wide variety of professions.
6.3 Competing systems
Graphical and form-based interfaces have become the de facto standard for
database front ends. Because of the challenges presented above, these other types
of systems are generally possible to develop in shorter time and at a lower cost.
6.4 Possible avenues
There is still a lot of research going on in this area. Having explored the
application of Natural Language Processing as database interfaces, the authors
can see a number of different scenarios.
41
Johnson, T.
42
Androutsopoulos, I., Ritchie, G.D., and Thanisch, P. pp.29-81
27
28. 6.4.1 Adaptation techniques
There is a need for methodologies that would enable the user to specify the data
source in a general descriptive language and to supply a given set of terms used
within the domain. This would make the application portable from database to
database.
This need has been recognised in [8], where a solution based on the general
Resource Description Framework (RDF). The system outlined in [8] learns the
pattern and domain vocabulary of any given database automatically and also
contains an interface that allows the user to change the database model (classes,
properties, tables etc.).
6.4.2 Speech-based techniques
Certain authors [8] believe that natural language keyboard interfaces will be
superseded by speech recognition systems. However, as such systems are of an
even more complex nature, some of the linguistic challenges will have to be
solved first. Research on NLIDBs can therefore be a base for the development of
voice-based systems [8].
6.4.3 Learning algorithms
Every person has its own vocabulary and way of using language. There is
absolutely no way that a program can contain all words in a language or all
different meanings that a term may take on.
Further, the use of language changes over time, which means that the semantics
and vocabulary of a system may become obsolete after a certain time of use.
An important challenge for a natural language database front end (or any natural
language processing system in general) is to possess an ability to learn, as it is
used, evolve with the user and adapt to new users. This ability is after all one of
the definitions of artificial intelligence.
There are several ways in which this could potentially be achieved. Note that
these are suggestions and not based on in-depth research.
6.4.3.1 User Dialogue
One way to achieve learning would be to include a lexical editor, where the user
could enter language terms and link them to their synonyms. They should also
be able to specify the different forms of the word, e.g. noun plurals, adjective
comparative forms, verb tenses etc.
This ability is present in Elfsoft.
28
29. 6.4.3.2 Neural Networks
By use of probabilistic techniques, a system might be able to adjust probabilities
of different parses based on training texts and test texts, which have been parsed
and tagged by the user or obtained from linguists. By continuously retraining the
network with parsed texts from the database-specific domain, the neural
network would be able to pick up language patterns and learn incrementally.
6.4.3.3 Genetic Algorithms
Another way would be for the system to obtain feedback from the user on the
accuracy (e.g. ask the user whether queries were answered correctly) and adjust
its language processing structure (production rules) by the use of genetic
algorithms.
29
30. 7 CONCLUSIONS
The project has focused on two main topics:
• The techniques of translating a question in natural language into a
database query, extracting the results that the user is looking for;
• The leading contemporary applications on the market.
The underlying methods belong in the general natural language processing area,
while any system has to select among several different techniques involving
different degrees of syntactic analysis, semantic processing or a combination. A
general feature seems to be the translation of the query in two steps, first to an
intermediate language and then to a database query language, e.g. SQL.
The topic integrates approaches several other facets of artificial intelligence, e.g.
production systems, neural networks, expert systems, and machine learning.
Two of the leading commercial software packages were tested with mixed
results. Some rather complex queries were handled well, while the systems
tended to have problems handling rather easy tasks. The sample sizes involved
are too small to base any general conclusions on, however. The reason for this is
that the configuration of the university computers at our disposal could not be
used for testing the programs.
Many companies have overestimated the use of natural language processing in
the database interface. Their interpretation of the system is that it is able to
understand the significance of the query accurately. However, the system is not
able to fully comprehend the human language and jargon unless it has been
given the definitions for these terms relating to the relevant database.43 This
mainly involves the semantic analysis. A sentence that is syntactically structured
may have lead to various meanings, which may not even be similar to one
another. Thus, as a result, this will produce undesirable conclusions in the
database queries. This is one main reason why many systems tend to fail and
explains why most companies would still rather rely on SQL programmers for
their database processing.
Although these kinds of applications are rather unpopular, the authors enjoyed
using them and encourage their future evolvement. From the experiences of the
performed tests, systems have the potential to make the task of searching for
information a lot less tedious and time-consuming.
The eventual success for natural language front ends will depend on how well
they can adapt to new environments, both regarding databases and users’ way of
using language. Two proposed benchmarks for these types of systems could be:
43
Timo Honkela
30
31. • It has to be able to learn and understand the database faster than the user;
• It has to learn natural language faster and easier than the user can learn a
programming language.
31
32. ACKNOWLEDGEMENTS
The authors wish to extend their appreciations to the following people for their
support during the course of the project:
• Jon Greenblatt, President of English Language Frontend Software Co.
• Girish Mohata, Teaching Fellow, IT School, Bond University
32
33. 8 BIBLIOGRAPHY
1. Androutsopoulos, I., Ritchiey G.D., and Thanischz, P.: Natural Language
Interfaces to Databases - An Introduction. Journal of Natural Language
Engineering, vol. 1, No. 1. Cambridge University Press 1995
2. Baker, J.K.: Trainable grammars for speech recognition, Speech
Communication Papers for the 97th Meeting of the Acoustical Society of
America, Acoustical Society of America 1979.
3. Beck, H., Mobini, A., Kadambari, V. A Word is Worth 1000 Pictures:
Natural Language Access to Digital Libraries. University of Florida.
http://archive.ncsa.uiuc.edu/SDG/IT94/Proceedings/Searching/beck/b
eckmain.html
4. Dialog-Oriented Use of Natural Language http://www.dfki.uni-
sb.de/vitra/papers/ro-man94/node5.html. Accessed on 310701
5. Dougherty, R.C.: Natural Language Computing An English Generative
Grammar in Prolog. Erlbaum, Lawrence Associates 1994.
6. EasyAsk - Applications Overview
http://www.englishwizard.com/applications/index.cfm -. Accessed
19/7-2001
7. ECL I Vertiefung: natürlichsprachliche Zugangssysteme: chat80.
http://www.ifi.unizh.ch/cl/broder/chat/chat80.htm. Accessed
12/7-2001
8. Eriksson, G.: Översättarteknik. KFS AB 1984.
9. Groucho Marx in the movie Animal Cracker.
10. Hafner, C. D. and Gooden, K.: Portability of Syntax and Semantics in
Datalog. ACM Transactions on Information Systems, vol. 3. Association
for Computing Machinery 1985.
11. Honkela, T., The Www Version Of Self-Organizing Maps In Natural
Language Processing of Helsinki University of Technology – viewed on
22/07/01
http://www.cis.hut.fi/~tho/thesis/
33
34. 12. Johnson, T.: Natural Language Computing: The Commercial Applications.
Ovum 1985.
13. Jurafsky, D. and Martin J. H.: Speech and Language Processing, An
Introduction to Natural Language Processing, Computational Linguistic,
and Speech Recognition. Prentice-Hall 2000
14. Language Reference http://www.darpa.mil/ito/psum2000/h165-0.html.
Accessed 14/7-2001.
15. Luger, G.F. and Stubblefield, W.A.: Artificial Intelligence. Structures and
Strategies for Complex Problem Solving. Third Edition. Addison-Wesley
1999.
16. Manas Tungare – Natural Language Processing
http://www.manastungare.com/articles/nlp/natural-language-
processing.asp. Accessed 30/07/01
17. Manning, C.D. and Schutze, H.: Foundations of Statistical Natural
Language Processing. MIT Press 1999.
18. Natural-Language Database Interfaces from ELF Software Co
http://www.elfsoft.com/ns/FAQ.htm -. Accessed 19/7 – 2001.
19. Palmer, M. and Finin, T.: Workshop on the Evaluation of Natural
Language Processing Systems. Computational Linguistics, vol. 16, pp.
175-181. MIT Press 1990.
20. Penn Treebank Project http://www.cis.upenn.edu/~treebank/. Accessed
10/7 – 2001.
21. Reis, P., Matias, J., Mamede, N.: Edite – A Natural Language Interface to
Databases, A new dimension for an old approach.
http://digitais.ist.utl.pt/cstc/le/Papers/CSTCLE-12.PDF
22. Sharoff, S. and Zhigalov, V.: Register-domain separation as a
Methodology for Development of Natural Language Interfaces to
Databases. Proceedings of the IFIP TC.13 International Conference on
Human-Computer Interaction. International Federation for Information
Processing 1999.
34
35. 23. Tang, R. L.: Integrating Statistical and Relational Learning for Semantic
Parsing: Applications to Learning Natural Language Interfaces for
Databases. University of Texas May 2000.
35
36. 9 CONTRIBUTIONS
The respective chapters were produced by the following group members:
Chapter 1: Jun
Chapter 2: Hakan
Chapter 3: Aris and Hakan
Chapter 4: Aris
Chapter 5: All
Chapter 6: Hakan
Chapter 7: Hakan and Jun
Bibliography and report compilation: Aris
Appendices: Hakan
36
37. APPENDIX A
Evaluating Systems
Introduction
How good is a natural language database interface? The answer to this question
is hard to define. A survey conducted during the course of this project revealed
the existence of no formal evaluation techniques. As long as this situation
remains, an unambiguous answer to the question will elude all stakeholders in
this area.
Why is there a need?
The need for formal evaluation schemes in this field, as in any other arises out of
several stakeholders’ desires:
• Users want a guide for choosing between systems;
• Companies want benchmarks for product development and
improvement;
• Companies need metrics for proving the capabilities of their products.
Current Marketing
The companies behind contemporary techniques market their products with
some of the following arguments:
• Ease of set up and integration with new databases. It is often mentioned
[6,18] that end users will be relieved of the task of having to learn and
understand the internal workings of the DataBase Management System
(DBMS)
• Money saved on searching
• Price
• Ease of integration across different DBMSs (Access, SQL Server, Oracle
etc.)
• Accuracy
• The possibility to perform searches on several data stores simultaneously
Problems
There have been some attempts to define general formal metrics for natural
language processing systems [19]. In [19], it was concluded that this is a difficult
task for a number of reasons:
• Systems are built using a variety of techniques;
• They are used in many different domains, where users’ needs are varying;
• There is a lack of funding for research in this area.
• However, it is also concluded that database front ends constitutes one of
the type of systems where metrics potentially could be developed and
adopted.
38. Black box metrics
In [19], a strong distinction is made between black box and glass box metrics. A
black box approach only looks at the output generated by a certain input and
does not take into account the architecture of system, or the efficiency of
individual components.
Advantages
• It takes the user’s view;
• It can be applied across platforms, on systems with different
implementation details;
• It doesn’t tie to a specific implementation technique;
• It can be used over time, regardless of trends in database and
programming methodologies.
Disadvantages:
• It doesn’t give a good indication to programmers of what is actually
wrong;
• It is badly suited for testing individual components of a system.
Proposed black box evaluation scheme
The proposed evaluation scheme takes into account several different aspects of
the program in question.
Evaluation can be based on the following characteristics:
Overall Characteristics
• User Friendliness: Is the application easy to understand and use? Are help
files accessible and explanatory? Are error messages clear?
• Portability: Can it be used in conjunction with only a specific database? If
no, how easy is it to integrate it with other databases?
• Speed: How fast are answers extracted?
• Fault Tolerance: Can the system recognize off-topic questions (queries on
information that is not in the database) and give an informative response
within a reasonable time frame?
• Accessibility: Can it be used over the web?
Vocabulary
Can the system accurately understand the following expressions44:
• What?
• Which?
• How many?
• How much?
• Show
• List
44
This list is arbitrary and may have to be expanded/contracted.
39. • Tell
• Count
Ease of Interaction
• Linguistic Flexibility: How many spelling errors in a word can the system
tolerate and understand? Can it suggest alternative spellings45?
• Probing questions: Are “follow-up” questions (questions referring to the
previous answer) allowed?
• Can the system adjust for bad grammar and still understand the question?
Accuracy based on input complexity
The system is asked a number of different questions. These questions are ranked
as simple, average or complex. The accuracy (percentage of questions answered
correctly) in each of the three categories is noted.
The evaluation scheme formed the basis of the market tests of chapter 5.
However, because of the small sample size of tested applications, no attempt to
formalize the scheme or develop a metric based on it was made.
45
For an example of this capability, please try a search on http://www.google.com with a word
containing a slight spelling error, e.g. elpheants.
40. APPENDIX B
Test Protocol
The questions asked, their respective classifications, and the outcomes for the
tested programs are presented in table 4. In the classification column, S stands for
Simple, A for Average, and C for Complex.
Table 4. Test Protocol.
Question Class Microsoft English Elfsoft outcome Comments
Query outcome
Who is the oldest S Correct Correct English Query gave
employee? the oldest person,
Elfsoft the one who
had worked the
longest at
Northwind.
Which supplier C Correct Correct
(currently)
supplies the most
products (which
are not
discontinued)?
Which employee A No answer Correct Elfsoft gave too
has handled the much information
most orders?
What product is S Correct No answer
the most
frequently
ordered?
List the country A No answer Correct
that has a
supplier that
ships tofu.
Name the third S No answer No answer
most ordered
product.
What is the least S Wrong No answer
ordered product?
How much is S Correct No answer
1kg of Queso
Cabrales?
41. Question Class Microsoft English Elfsoft outcome Comments
Query outcome
How much tofu A No answer Correct Elfsoft gave too
have been much information
ordered?
Show the phone S Correct Correct
number of united
package.
Tell me the S Correct No answer
names of the
sales
representatives
Tell me the age A Correct No answer
of these people.
And their phone A Correct Correct
numbers?
Count the S Correct Correct
customers in
Germany.
What is the A Correct Wrong
average age of
the employees?
Name the A Correct No answer
employees that
are older than
average
Give the name of S Correct No answer
the sales
manager.
Where is Around S Correct No answer
the Horn from?
What is the A No answer Wrong
median of the
age of the
employees?
List the names of S No answer Wrong
the people
working
currently in the
company.
Who is older S Correct No answer
than Janet?
42. Question Class Microsoft English Elfsoft outcome Comments
Query outcome
What can you tell S Too little No answer
me about Ernst information
Handel?
Which supplier C Correct Correct
supplies tofu but
not longlife tofu?
What are the C No answer Wrong
contact names
and phone
numbers of
customers that
have received
products sent
with Federal
Shipping?
What are the A Correct Correct Microsoft English
products that Query had the
federal shipping wrong
ships interpretation.
What customers A No answer Wrong
received these
shipments?