SlideShare ist ein Scribd-Unternehmen logo
1 von 42
Table of Contents
1 INTRODUCTION..........................................................................................................3
2 HISTORY.........................................................................................................................6
   2.1 Early systems...........................................................................................................6
3 NATURAL LANGUAGE PARSING...........................................................................7
   3.1 Rule-Based Syntactic Parsing ...............................................................................7
   3.2 Terminal Symbols...................................................................................................7
   3.3 Non-terminal symbols............................................................................................7
   3.4 Production Rules.....................................................................................................7
      3.4.1 Grammar...........................................................................................................7
      3.4.2 Parse tree...........................................................................................................8
         3.4.2.1 Top down..................................................................................................8
         3.4.2.2 Bottom up ..................................................................................................9
   3.5 Probabilistic Parsing.............................................................................................10
      3.5.1 Disambiguation..............................................................................................11
      3.5.2 Training...........................................................................................................11
         3.5.2.1 Treebank...................................................................................................12
         3.5.2.2 Incremental learning...............................................................................12
   3.6 Semantic Parsing...................................................................................................13
      3.6.1 Semantic Data Models...................................................................................13
      3.6.2 Case Based Reasoning...................................................................................14
      3.6.3 Semantic Representation...............................................................................15
      3.6.4 Actions of the Parser......................................................................................15
4 NLIDB ARCHITECTURE...........................................................................................17
   4.1 Pattern-matching systems....................................................................................17
   4.2 Parsing based systems..........................................................................................17
      4.2.1 Semantic grammar based parsing...............................................................18
      4.2.2 Translation......................................................................................................19
5 MARKET TEST.............................................................................................................23
   5.1 Goals.......................................................................................................................23
   5.2 Tests........................................................................................................................23
   5.3 Results.....................................................................................................................23
      5.3.1 Impressions.....................................................................................................23
         5.3.1.1 Microsoft English Query........................................................................23
         5.3.1.2 Elfsoft........................................................................................................24
      5.3.2 Query results..................................................................................................25
6 FUTURE........................................................................................................................27
   6.1 Language challenges............................................................................................27
   6.2 Portability challenges...........................................................................................27
   6.3 Competing systems...............................................................................................27
   6.4 Possible avenues....................................................................................................27

                                                                  1
6.4.1 Adaptation techniques..................................................................................28
     6.4.2 Speech-based techniques..............................................................................28
     6.4.3 Learning algorithms......................................................................................28
        6.4.3.1 User Dialogue..........................................................................................28
        6.4.3.2 Neural Networks.....................................................................................29
        6.4.3.3 Genetic Algorithms.................................................................................29
7 CONCLUSIONS...........................................................................................................30
8 BIBLIOGRAPHY..........................................................................................................33
9 CONTRIBUTIONS.......................................................................................................36
 APPENDIX A..................................................................................................................37
   Evaluating Systems....................................................................................................37
      Introduction............................................................................................................37
      Why is there a need?..............................................................................................37
      Current Marketing.................................................................................................37
      Problems .................................................................................................................37
      Black box metrics...................................................................................................38
      Proposed black box evaluation scheme..............................................................38
      Overall Characteristics..........................................................................................38
      Vocabulary..............................................................................................................38
      Ease of Interaction..................................................................................................39
      Accuracy based on input complexity..................................................................39
 APPENDIX B..................................................................................................................40
   Test Protocol...............................................................................................................40




                                                                2
1      INTRODUCTION
The ability to exercise language to convey different thoughts and feelings
differentiates human beings from animals. The definition of Natural Language
Processing is the capability of a machine to understand the full context of human
language about a particular topic so that the unspecified guess and general
knowledge can be understood. “Thus if the machine is able to achieve this, it has
come close to the notion of artificial intelligence itself”1.

One may find interacting with a foreign person with no knowledge of English
intricate and frustrating. Thus, a translator will have to come into the picture to
allow one to communicate with the foreigner. Companies have related this
problem to extracting data from a database management system (DBMS) such as
MS Access, Oracle and others. A person with no knowledge of Structured Query
Language (SQL) may find himself or herself handicapped in corresponding with
the database. Therefore, companies like Microsoft and Elfsoft (English Language
Frontend Software) have analysed the abilities of Natural Language Processing
to develop products for people to interact with the database in simple English.
This enables a user to simply enter queries in English to the Natural Language
database interface. This kind of application is known as a Natural Language
Interface to a DataBase (NLIDB).

The system works by utilizing the use of syntactic knowledge and the knowledge
it has been provided about the relevant database.2 Hence, it is able to implement
the natural language input to the structure, scope and contents of the database.
The program translates the whole query into the standard query language to
extract the relevant information from the database. Thus, these products have
created a revolution in extracting information from databases. They have
discarded the fuss of learning SQL and time is also saved in learning this query
language.

This report will look at the performance of each database interface connected to a
standard database. The Northwind database has been chosen as the default
database to work on. There are several companies that are offering such products
in the market. Our group has found several of them, which include English
Query, Elfsoft, EasyAsk and NLBean created by Mr Mark Watson. We have
requested for these companies for their permissions to test their products in
regards to our research. We received positive responses from Elfsoft and
NLBean, but had to settle for tests on Microsoft English Query and Elfsoft only.
We have also contacted EasyAsk via email but the company has provided
minimal assistance in our research.

1
    Manas Tungare
2
    Manas Tungare

                                         3
In order to produce accurate conclusions on the different interpretations of each
software, we have listed out over thirty questions to test the products. Each
product will be asked the same questions in the same order. The questions have
been carefully planned to test the pros and the cons of each product.

These questions include:
   • Listing the specific columns and rows
   • Counting
   • Calculations
   • Cross referencing from more than one tables
   • Ordinal positions
   • Followed-ups
   • Conclusions
   • Semantics
   • Grammar mistakes
   • Spelling mistakes
   • Out-of-context questions

There are three components in a natural language dialog system: analysis,
evaluation and generation.3 The analysis component translates the query as
entered by the user into a semantic representation which is transcripted in the
knowledge representation language. There may be several communication
sessions between the natural language access system and user interface system to
the user in order to carry out the action to derive the result. The evaluation
component allows information to be absorbed by the dialog system when queries
have to be satisfied or the system needs to alert the user about any major state
changes. The generation component gathers the intended information that the
user wants to see as provided in the query. This component will generate text,
graphs, query or any other responses according to the situational context of the
query.4

The knowledge-based database assistant (KDA) as stated, is a practical
development of an intelligent database front-end to assist novice users in
retrieving desirable information from an unfamiliar database system.5 This
component exists in both Microsoft English Query and Elfsoft. Thus, this useful
program directs the novice user to get the relevant results by entering the
accurate query or by prompting the user when insufficient information is entered
to get the appropriate answer. This component can be seen in the later part in
this report in both programs.

3
  Dialog-Oriented Use of Natural Language
4
  Dialog-Oriented Use of Natural Language
5
  Manas Tungare

                                            4
In addition, “the KDA's responding functionality, which could change the user's
knowledge state, is called query guidance”.6 It can detect a user’s scope of
knowledge about the relevant database by studying the query entered by the
user. If it sensed that the user has limited awareness about the database and
could not retrieve his or her desired answer, the query guidance will jump into
action and provide similar queries to allow the user to gather the appropriate
facts from the database or present the most relevant query to the user based on
the user’s perceived intention. Such a component allows the novice to get
familiar with the database fast and enables the user to learn about the scope of
the database based on the prompt messages and the queries generated from the
KDA without the expense of learning those mass databases stored in most
organizations.




6
    Manas Tungare

                                        5
2      HISTORY
As the use of databases for data storage spread during the 1970’s, the user
interface to these systems represented a burden for designers worldwide. At this
point, both the relational database model and the SQL interface language were
yet to be developed, which means that the task of inserting and querying data
was tedious and difficult.

It was therefore a logical step for programmers to attempt to develop more user-
friendly and “human” interfaces to the databases. One of these approaches was
the use of natural language processing, where the user interactively would be
allowed to interrogate the stored information.
2.1 Early systems
The most well-known historical natural language database interface systems are:
   • LUNAR, interfacing a database with information on rocks collected
     during American moon expeditions. It was originally published in 1972.
     When evaluated in 1977, it answered 78 % of questions correctly. Based on
     syntactic parsing, it tended to build several parse trees for the same query,
     and was deemed as inefficient7 and too domain-specific and inflexible.
   • LADDER, the first semantic grammar-based system, interfacing a
     database with information on US Navy ships.
   • CHAT-80, probably the most famous example. It interfaced a database of
     world geography facts. The entire application (both the database and the
     user interface) was developed in Prolog. As the source code was freely
     distributed, it is still used and cited. An online version can be found at8.




7
    Hafner, C. D. and Gooden, K. pp 141-164
8
    ECL I Vertiefung: natürlichsprachliche Zugangssysteme: chat80

                                                 6
3    NATURAL LANGUAGE PARSING
3.1 Rule-Based Syntactic Parsing
Syntax means ways that words can fit together to form higher-level units such as
phrases, clauses, and sentences. Therefore syntactically driven parsing means
interpretations of larger groups of words are built up out of the interpretation of
their syntactic constituent words or phrases. In a way this is the opposite of
pattern matching as here the interpretation of the input is done as a whole.

Syntactic analyses are obtained by application of a grammar that determines
what sentences are legal in the language that is being parsed.
Syntactic parsing operates through the translation of the natural language query
into a parse tree, which is then converted to a SQL query. There are a number of
fundamental concepts in the theory of syntactic parsing.
3.2 Terminal Symbols
A terminal symbol is the basic building block of the language, i.e. words and
delimiters. Together, the set of terminal symbols form the “dictionary of words”9
recognised by the system, i.e. the range of the vocabulary that it can read and
interpret.
3.3 Non-terminal symbols
Non-terminal symbols are higher-level language terms describing concepts and
connections in the syntax of the language. Examples of non-terminal symbols
include1 sentence, noun phrase, verb phrase, noun, and verb.
3.4 Production Rules
As the query is analysed, a number of production rules fires to identify and
classify the context of the read word. In analogy with a production system (such
as the one used in PROLOG), a production rule in a context-free grammar10
converts a left-hand non-terminal symbol to a sequence of symbols, which can be
either terminal or non-terminal. Examples of production rules:
    • Sentence := Noun phrase verb phrase
    • Verb phrase := verb
These rules are also commonly referred to as rewrite rules.
3.4.1 Grammar
The combination of the set of terminal symbols, set of non-terminal symbols, the
production rules and an assigned start symbol (the highest-level construct in the
system, usually sentence) form the grammar of the syntax. The role of the
grammar is to define:
   • What category each word belongs to;

9
 Luger, G.F. and Stubblefield, W.A.
 This paper will be constricted to the treatment of context-free grammars and not deal with the
10

more complex set of syntaxes known as context-sensitive.

                                               7
•   What expressions are legal and syntactically correct;
   •   How sentences are generated.
3.4.2 Parse tree
The system analyses the sentence by reading the non-terminal symbols in order
and identifying what production rule to fire. As it does so, it gradually builds a
representation of the sentence referred to as a parse tree. The term has been
coined from the tree-like graph that is produced, where the root is the top-level
symbol (e.g. sentence), the children of each node are the right-hand non-terminal
symbols and the leaves are the terminal symbols (the words). The parse tree can
be built in two fundamentally different ways.
3.4.2.1 Top down
A top down parser starts at the root and gradually builds the tree downwards by
matching the read terminal symbols with symbols on the right-hand side of
possible production rules. Terminal or non-terminal symbols on the right hand
side are added at the level below the current symbol.
This is similar to the goal-driven approach of a production system. The basic
architecture of a top down parser is illustrated in figure 1.




                                        8
Figure 1 Top down parsing of the sentence "the girl forgot the boy"11

In many situations, the first token alone does not provide enough information to
make the decision on what production rule should be fired. In order to overcome
this, there are two basic methods.

3.4.2.1.1 Recursive Descent
The system starts by firing the first production rule of the candidates for which
the given terminal symbol could fit and builds the initial sub tree from this
information. If this further downwards in the tree results in an inconsistency or
syntactic error, it reverts to the point where the decision was made, removes all
the nodes on the way back up and selects another of the possible productions.
This is a procedure very similar to depth-first searching and backtracking in
production systems.

3.4.2.1.2 Look Ahead
Look Ahead systems will not be contented by just reading one token. Rather, it
reads the number of tokens necessary to identify the given right-hand side
beyond any ambiguities before firing any production rules.

Grammars are characterised by the maximum number of terminal symbols
required to read before all possible conflicts in the choice of production rule can
be resolved. If this number is k, the grammar is referred to as an LL (k)
grammar12. The look ahead procedure is more in analogy with a breadth-first
search technique.
3.4.2.2 Bottom up
A bottom up parser, on the other hand, works from the leaf upward by
“tagging” the tokens, i.e. starting from the right-hand side of the production
rules and associating the read word with its category. When a full right-hand
side has been identified, the production rule fires and the left-hand side non-
terminal symbol is added as a branch in the level above. This methodology
corresponds to the data-driven technique of production systems. The bottom up
parsing technique is illustrated in figure 2.




11
     Dougherty, R.C.
12
     Eriksson, G.

                                              9
Figure 2 Bottom up parsing of the sentence "the girl forgot the boy"13

In some cases, the sentence is ambiguous in itself and there are multiple
production rules that match a given sentence, in which case the parser has to
make a choice between the two potential interpretations. One strategy for
dealing with these situations is referred to as probabilistic parsing.
3.5 Probabilistic Parsing
Probabilistic parsing takes an empirical approach to the difficult task of
disambiguation, i.e. identifying which of several mutually exclusive alternate

13
     Dougherty, R.C.

                                              10
syntactic parse trees should be generated.

For example, consider the sentence “One morning I shot an elephant in my
pyjamas”14. There are two possible syntactic parses for this sentence15. One
implies that the person was wearing the pyjamas, while the opposing view
would claim that the elephant was in the underwear (hence the joke). Although
the selection between these two interpretations is obvious to a human, how is
this knowledge automated in a computer?

One option, used in a.k.a. attribute grammars, is to encode information for each
verb as a parameter to each production rule. However, as the dictionary grows,
this approach may be too selective and require every different case to be
specifically added to the production rules.

Probabilistic parsing, on the other hand, works by augmenting the rules with
assigned probabilities, representing the chance of the particular expansion
(production rule) being the correct one.

For example, a probabilistic grammar would introduce the following
enhancements to the possible regular syntactic production rules for the
expansion of the non-terminal symbol sentence [15]:

       •   Sentence:= Nounphrase Verbphrase, P = 0.8
       •   Sentence:= Auxiliary Nounphrase Verbphrase, P = 0.15
       •   Sentence:= Verbphrase, P = 0.05

Note that the probabilities for the expansions of any given non-terminal symbol
always add up to 1.


3.5.1 Disambiguation
How does probabilistic parsing choose a parse tree from two possible
interpretations? In most systems, it simply compares the products of all the
probabilities involved in every production required for the competing parses and
selects the one representing the highest of these probabilities.
3.5.2 Training
One important task concerns how to set the probabilities. There are two
fundamentally different techniques for this task [15].




14
     Groucho Marx
15
     Jurafsky, D. & Martin, J.

                                         11
3.5.2.1 Treebank
A large database of sentences with their correct parses (parsed by knowledgeable
humans) is entered into the system. The respective probabilities are then
calculated as the relative frequencies of each possible parse. For more details, see
[15].

The largest known treebank is known as the Penn Treebank16. The latest version,
Treebank 3 contains parses of17:
   • One million words of 1989 Wall Street Journal material;
   • A small sample of ATIS-3 transcripts. The Air Travel Information Service
       is a joint project of DARPA (Defence Advanced Research Projects Agency)
       and SRI International, handling voice-based queries and requests about
       flights. More information can be found at18;
   • A fully parsed tagged version of the Brown Corpus, consisting of one
       million words from 500 different sources (novels, academic books,
       newspapers, non-fiction books etc. [15]);
   • Parsed and tagged text from a set of 560 transcripts of telephone
       conversation (a.k.a. the Switchboard-1 corpus).
   • This is a widely used “training set” (in analogy with an artificial neural
       network) enabling the parser to learn what classes of speech a given word
       can belong to and how frequently a particular expression is to be
       interpreted in different ways.

3.5.2.2 Incremental learning
The other technique is a “trial and error” method, in which the parsing system
much like an artificial neural network learns as it is used.

The initial probabilities can be assigned randomly or by the user. After that, the
system adjusts these probabilities according to the following rules [15]:

     •   If the sentence was unambiguous, its parse count is increased by 1, i.e. pi:
         = pi +1;
     •   If the sentence was ambiguous, each of the possible parses have their
         counts incremented by their respective probabilities, i.e. pi: = pi + P (p i ).

The algorithm for this computation is referred to as the Inside - Outside
Algorithm. It was originally proposed in19 and is described in detail in20.

16
   Penn Treebank Project.
17
   Quoted by the LDC office of the University of Pennsylvania in an email dated 10/7-2001.
18
   Language Reference
19
   Baker, J.K. pp. 547-550.
20
   Manning, C.D. and Schutze, H.

                                               12
3.6 Semantic Parsing
The syntactic structure of a sentence is not enough to express its meaning. For
instance, the noun phrase the catch can have different meanings depending on
whether one is talking about a baseball game or a fishing expedition. To talk
about different possible readings of the phrase the catch, one therefore has to
define each specific sense of the phrase. The representation of the context-
independent meaning of a sentence is called its logical form.21 Natural language
analysis based on semantic grammar is similar to syntactically driven parsing
except that in semantic grammar the categories used are defined semantically.

Database items can be ambiguous when the same item is listed under more than
one attribute. For example, the term “Mississippi” is ambiguous between being a
river name or a state name, in other words, two different logical forms. The two
different meanings have to be represented distinctly for an interpretation of a
user query.

3.6.1 Semantic Data Models
Semantic data models (SDM) are widely researched in the database community.
They are closely related to semantic networks used in artificial intelligence,
which were originally developed to support natural language processing. Hence,
as database management systems they are capable of supporting large amounts
of information, while still offering the potential of advanced inferencing
capabilities including NLP, machine learning, and query processing.

“SDMs can be seen as formalising many of the relationships, expressed in an ad
hoc manor in conventional hypermedia systems.”22 SDMs support a variety of
formalised links and relationships. An example of a small network on insects is
shown in figure 3. The links in this graph express generalisation relationships or
"ISA" (beneficial insect IS-A insect), part/whole (Abdomen is part of an Insect),
association (Ladybugs eat Aphids), and class/instance (Ladybug is an instance of
Beneficial Insect). 23




21
     Tang, R. L. p5
22
     Beck, H., Mobini, A., Kadambari, V
23
     Beck, H., Mobini, A., Kadambari, V
                                          13
Figure 3 Semantic Data Model describing insects24

In figure 3, solid lines are ISA relationships, diamonds are part/whole, circles are
associations, and Instances are underlined.

Since concepts in SDMs are described by structured graphs expressing the
relationships among symbols rather than connections between text files as in
conventional hypertext, there exists the capability for manipulation of SDMs to
produce a number of desirable functions. Foremost is that of search or query
processing. [8] Suggests query processing based on graph matching techniques
by which the query is expressed as a small semantic network. This query graph
is then matched against the larger database graph to find connections. This gives
a much more precise search capability than is possible with Boolean keyword
searches over text files.


3.6.2 Case Based Reasoning
In order to construct an NLP system, one must construct a large dictionary.
Much of the recent advances in text understanding systems can be attributed to
advances in design and construction of large lexicons. But that presupposes that
word meaning is easily represented and a case-based reasoning approach to
meaning is used. Words obtain meaning by how they are used. A particular
word is used in many different situations and contexts. Each occurrence of the
word is treated as one case. Similarities among cases can be observed, and cases
with similar usage can be clustered together into categories. When a word is
used in a new situation, similar cases are retrieved from the case-based memory
in order to apply what happened before to the new context. The meaning of a
particular word is established by a large case base, and thus a single word may
be "worth 1,000 cases". 25




24
     http://archive.ncsa.uiuc.edu/SDG/IT94/Proceedings/Searching/beck/fig1.gif
25
     Beck, H., Mobini, A., Kadambari, V.

                                              14
3.6.3 Semantic Representation
The most basic constructs of the representation language are the terms used to
describe objects in the database and the basic relations between them. Database
objects bear relationships to each other or can be related to other objects of
interest to a user who is requesting information from it. For instance, in a user
query like “What is the capital of Texas?”, the data of interest is a city with certain
relationship to a state called Texas, or more precisely its capital. The capital/2
relation, or predicate, is therefore defined to handle questions that require them.

                        Predicates         Description
                        city (C)           C is a city
                        capital (S,C)      C is the capital of S
                        density (S,D)      D is the population density of
                                           state S
                        loc (X,Y)          X is located in Y
                        len (R,L)          L is the length of river R
                        next_to (S1,S2)    State S1 borders S2
                        traverse (R,S)     River R traverses state S
                                 Table 1 Sample of predicates26


3.6.4 Actions of the Parser
Using the parser actions in CHILL [8] known as shift-reduce parsing we will
discuss the working of the parser. The parser actions are generated from
templates given by a logical query. An action template will be instantiated to
form a specific parsing action. Recall that the parser also requires a lexicon to
interpret meaning of phrases into specific logical forms. Consider the following
example27:


Sentence: What is the capital of Texas?
Logical Query: answer(C,(capital(C,S),const(S,stateid(Texas)))).

A very simple lexicon will map ‘capital’ to ‘capital(_,_)’ and ‘Texas’ to
‘const(_,stateid(texas))’. The parser begins with an initial stack and a buffer
holding the input sentence, which is the initial parse state. Each predicate on the
parse stack has an attached buffer to hold the context in which it was introduced.
Words from the input sentence are shifted onto the stack buffer during parsing.
The initial parse is as follow:

Parse Stack: [answer(_,_):[]]
Input Buffer: [what,is,the,capital,of,texas,?]
26
     Lappoon R. T. p6
27
     Tang, R.L.

                                             15
Since the first three words in the input buffer do not map to any logical forms,
the next sequence of steps is to push the three words from the input buffer onto
the parse stack. The process has the following result:

Parse Stack: [answer(_,_):[the,is,what]]
Input Buffer: [capital,of,Texas,?]

Now, ‘capital’ is at the head of the input buffer and is mapped to ‘capital(_,_)’ in
the lexicon. The next action is to push the logical form onto the parse stack. The
resulting parse state is as followed:

Parse Stack: [capital(_,_):[],answer(_,_):[the,is,what]]
Input Buffer: [capital,of,Texas,?]

The parser then binds two arguments of two different logical forms to the same
variable, resulting in the following parse state:

Parse Stack: [capital(C,_):[],answer(C,_):[the,is,what]]
Input Buffer: [capital,of,Texas,?]

The sequence repeats itself producing a parse state:

Parse Stack: [const(S,stateid(Texas)):[?,Texas]capital(C,S):[of,capital],answer(C,_):
[the,is,what]]
Input Buffer: []

The final step is to take the logical form on the parse stack and put it into one of
the arguments of the meta-predicate resulting in:

Parse Stack: [answer(C,(capital(C,S), const(S,stateid(Texas)))):
[?,Texas,of,capital,the,is,what]]
Input Buffer: []

As this is the final parse state, the logical query is then constructed from the
parse stack.




                                           16
4       NLIDB ARCHITECTURE
4.1 Pattern-matching systems
The first NLIDBs were based on pattern-matching techniques. As a simple
illustration of pattern matching technique, consider the following database:

                  Countries_Table
                  Country           Capital             Language
                  France            Paris               French
                  Italy             Rome                Italian
                  …                 …                   …
                              Table 2 Sample Database Table28

A primitive pattern-matching system according to [8] may use rules as:

Pattern: … ”capital” … <country>
    Action: Report CAPITAL of row where COUNTRY = <country>

Pattern: … “capital” … “country”
    Action: Report CAPITAL and COUNTRY of each row

If the user asked “What is the capital of France?”, using the first pattern rule the
system would report “Paris”. The system would also use the same rule to handle
questions such as “Print the capital of Italy”, “Could you please tell me what is the
capital of France?” etc.

Some advantages of this approach are that it requires no complicated parsing or
interpretation modules, and that it is easy to implement. But the main advantage
of this approach is its simplicity. However the shallowness of this approach often
lead to bad failures. An example is when a pattern-matching NLIDB was asked
“TITLES OF EMPLOYEES IN LOS ANGELES.” the system reported the state
where each employee worked, assuming the “IN” to denote the post code of
Indiana, and assumed that the question was about employees and states.29


4.2 Parsing based systems
In general as [8] suggests, the system architectures of some NLIDBs can be seen
as being made of two major modules. The first module controls the natural
language, where a question is submitted and successively transformed. At the
end of this process one or more intermediate logical query expressions is
obtained. Given the dimension of the domain and the flexibility of the natural
language, there usually exist several interpretations of the same question. The
28
     Androutsopoulos, I., Ritchie, G.D., and Thanisch, P. p.14
29
     Androutsopoulos, I., Ritchie, G.D., and Thanisch, P., pp.14-15

                                                   17
second component is in charge of the connection with the database, translating
the expressions to structured query language (SQL) expressions (using mapping)
and sending them to the Data Base Management System (DBMS) to produce the
answers.30
For a graphical explanation of the structure, examine Figure 4.




                                  Figure 4 NLIDB Architecture31

As described in the previous section, the source language sentence is first parsed,
producing a parse tree. The two methods often found of parsing are the syntax
based and semantic grammar based.

4.2.1 Semantic grammar based parsing
Using this technique, the grammar’s categories do not necessarily correspond to
syntactic concepts. Examine the following figure:




30
     Reis, P., Matias, J. and Mamede N. p.3-4
31
     Androutsopoulos, I., Ritchie, G.D., and Thanisch, P. p.18

                                                   18
Figure 5 Semantic base parsing tree32

Notice that some categories of the grammar (e.g.: Substance, Magnesium,
Specimen_question) do not correspond to syntactic constituents (e.g.: Noun-
Phrase, Noun, Sentence). This is because the semantic information about the
knowledge domain (e.g.: a question may either refer to specimens or spacecraft)
is hared-wired into the semantic grammar.33

Because the semantic grammar approach contains hard-wired knowledge about
a specific knowledge domain, it is very difficult to transfer it to other knowledge
domain. A new semantic grammar has to be written whenever the NLIDB is
configured for a new knowledge domain.34


4.2.2 Translation
The translation is usually based on several mapping tables. Figure 6 illustrates
this process for both the addition of new information based on an input sentence
and the processing of a related query. The query is represented by a small graph,
which initiates the mapping to the semantic hierarchy. The small graph is
mapped to the semantic network by creating a link from each node in the smaller
graph to the corresponding nodes in the network starting with the most general
concept (the root) and ending with the most specific. This will create a unique
instance, which is the intersection of all of the nodes involved in the query and
may be used to narrow down a neighbourhood based on the requested
information. 35

The mapping process is bounded by rules, and completely based on the
information of the parse tree. As an example of mapping rules, consider the
previous query of “which rock contains magnesium” taken from [1]:
    • The mapping of “which” is for_every X.

32
   Androutsopoulos, I., Ritchie, G.D., and Thanisch, P. p.17
33
   Androutsopoulos, I., Ritchie, G.D., and Thanisch, P. p.17
34
   Androutsopoulos, I., Ritchie, G.D., and Thanisch, P. p.17
35
   Beck, H., Mobini, A., Kadambari, V. [online]

                                                19
•   The mapping of “rock” is (is_rock X).
•   The mapping of an NP is Det’ N’, where Det’ and N’ are the mappings of
    the determiner and the noun respectively. Thus resulting in for_every X
    (is_rock X).
•   The mapping of “contains’ is contains.
•   The mapping of “magnesium” is magnesium.
•   The mapping of a VP (V’ X N’). Thus resulting in (contains X magnesium).




                                   20
Figure 6 Mapping and Query Processing Model36

Figure 7 demonstrates when the user ask a query on how John spent his leisure
time and displays how the answer to the query is produced by exploiting the
relationship between "spending leisure time" and "having a chance to go fishing"
(both are "doing").




                           Figure 7 Query processing model37

In many systems the syntax rules linking non-leaf nodes and the semantic rules
are domain independent, and can be used in any application domain. The
information describing the possible words (leaf nodes) and the logic expressions
is domain dependent and has to be declared in the lexicon.38

As an example, consider the lexicon used in MASQUE [8] listing the possible
words, “capital”, “capitals”, “border”, “borders”, “bordering”, “bordered”.
   • The logic expression of “capital”, “capitals” could be
      capital_of(Capital,Country).
   • The logic expression of “border”, “borders”, “bordering”, “bordered” could be
      borders(Country1,Country2).
   • The logic expression of “country” could be is_country(Country).
36
   http://archive.ncsa.uiuc.edu/SDG/IT94/Proceedings/Searching/beck/fig2.gif
37
   http://archive.ncsa.uiuc.edu/SDG/IT94/Proceedings/Searching/beck/fig3.gif
38
   Androutsopoulos, I., Ritchie, G.D., and Thanisch, P. p.19

                                           21
Then the question, “What is the capital of each country bordering Greece?” would be
mapped to this query:
      answer([Capital, Country]):-
      is_country(Country),
      borders(Country, Greece),
      capital_of(Capital, Country).

The meaning of the logic query above is to find all pairs [Capital, Country], such that
Country is a country, Country borders Greece, and Capital is the capital of Country.
The interpreter also needs to consult a world model that describes the structure of
the surrounding world as shown by the figure below. Typically, the model
contains a hierarchy of classes of world objects, and constraints on the types of
arguments each logic predicate may have.39




                                Figure 8 Hierarchy in world model40




39
     Androutsopoulos, I., Ritchie, G.D., and Thanisch, P. p.18-19
40
     Androutsopoulos, I., Ritchie, G.D., and Thanisch, P. p.19

                                                   22
5   MARKET TEST
In order to get a good estimate of the current state of the technology, the
applications presented in the previous chapter were subjected to a neutral test.
5.1 Goals
The goals of the tests were:
   • To get a thorough understanding of contemporary market applications;
   • To get an estimate of the relevance and importance of this type of systems;
   • To get some insight into what features are more and less important.
5.2 Tests
The tests were carried out on the Northwind database, a sample database with
information on a shipping company. The database comes as a demo with all
distributed copies of Microsoft Access.

A number of different queries of different types were posed to the respective
natural language front ends. The questions were classified as simple (S), average
(A), or complex (C).

For a more comprehensive explanation of the considerations behind the testing
procedures, see Appendix A.
5.3 Results
5.3.1 Impressions
5.3.1.1 Microsoft English Query
English Query is a development environment that enables programmers to
produce natural language front ends for SQL 2000 databases. The product is
included with SQL 2000. The tests were performed on a demo of English Query,
developed by Microsoft to interface with the Northwind database.

The user interface has five fields, with the following functionalities:
   • Query (user input)
   • Interpretation of query
   • Required operations
   • Produced SQL statement
   • Results
A screen shot from one of the queries is presented in Figure 9.




                                         23
Figure 9 Microsoft English Query.
5.3.1.2 Elfsoft
Elfsoft works together with either VB or Access. Queries are entered in a query
window (see Figure 10) and can be output either as database tables (see Figure
11) or in a graphical format.




                           Figure 10 Elfsoft query window.




                                         24
Figure 11 Elfsoft answer output.

Elfsoft also includes several other options for enhanced portability, including:
    • Automatic analyser of any Access database
    • Enabling the user to teach program meanings of phrases
    • Allowing the user to explain why a query failed (what was missing
        and/or wrong)
    • Permitting the user to edit the dictionary
    • Logging of queries for statistics
5.3.2 Query results
The results are summarised in Table 3. A full recollection of the questions asked
is presented in Appendix B.
                            Table 3 Accuracy percentages.
                    Type of query                 English      Elfsoft
                                                   Query
                       Simple                       71           23


                       Average                       50          40




                                         25
Complex        67   100




          26
6       FUTURE
During the mid-eighties it was believed that natural language processing
systems would become a universal interface to databases worldwide41. However,
due to the emergence of graphical interfaces to databases, the relative simplicity
of SQL and the inherent problems of natural language processing they have
never really caught on commercially42.

The current position of NLIDBs is probably best described by “it’s a great idea,
but…” Although their usefulness is appreciated, they are still at a research stage.
There are several reasons as to why their usage is not taking off on a broader
scale.
6.1 Language challenges
It is still very hard to encode the vast source, complexity and ambiguity of a
human language into a computer. The formalisms for representing language
patterns are still not comprehensive enough to capture all the different ways that
expressions and terms can be constructed and given meaning depending on the
context.
6.2 Portability challenges
Although several systems for communication with individual databases have
been successfully implemented and used, a general technique, which would
allow the user to specify the database and use a system with any database
management system (whether it be Access, SQL 2000, Oracle or any type), is still
rather elusive. This would require the system to be able to recognize the fields
and attributes of the new storage source seamlessly.

An even bigger hurdle to portability is the nature and scope of language
understanding. Language use in different domains is very dissimilar, which
means that any portable system has to have a huge vocabulary with terms from
many different application domains and be able to recognize expressions from
users of a wide variety of professions.
6.3 Competing systems
Graphical and form-based interfaces have become the de facto standard for
database front ends. Because of the challenges presented above, these other types
of systems are generally possible to develop in shorter time and at a lower cost.
6.4 Possible avenues
There is still a lot of research going on in this area. Having explored the
application of Natural Language Processing as database interfaces, the authors
can see a number of different scenarios.


41
     Johnson, T.
42
     Androutsopoulos, I., Ritchie, G.D., and Thanisch, P. pp.29-81

                                                  27
6.4.1 Adaptation techniques
There is a need for methodologies that would enable the user to specify the data
source in a general descriptive language and to supply a given set of terms used
within the domain. This would make the application portable from database to
database.

This need has been recognised in [8], where a solution based on the general
Resource Description Framework (RDF). The system outlined in [8] learns the
pattern and domain vocabulary of any given database automatically and also
contains an interface that allows the user to change the database model (classes,
properties, tables etc.).
6.4.2 Speech-based techniques
Certain authors [8] believe that natural language keyboard interfaces will be
superseded by speech recognition systems. However, as such systems are of an
even more complex nature, some of the linguistic challenges will have to be
solved first. Research on NLIDBs can therefore be a base for the development of
voice-based systems [8].
6.4.3 Learning algorithms
Every person has its own vocabulary and way of using language. There is
absolutely no way that a program can contain all words in a language or all
different meanings that a term may take on.

Further, the use of language changes over time, which means that the semantics
and vocabulary of a system may become obsolete after a certain time of use.

An important challenge for a natural language database front end (or any natural
language processing system in general) is to possess an ability to learn, as it is
used, evolve with the user and adapt to new users. This ability is after all one of
the definitions of artificial intelligence.

There are several ways in which this could potentially be achieved. Note that
these are suggestions and not based on in-depth research.
6.4.3.1 User Dialogue
One way to achieve learning would be to include a lexical editor, where the user
could enter language terms and link them to their synonyms. They should also
be able to specify the different forms of the word, e.g. noun plurals, adjective
comparative forms, verb tenses etc.

This ability is present in Elfsoft.




                                        28
6.4.3.2 Neural Networks
By use of probabilistic techniques, a system might be able to adjust probabilities
of different parses based on training texts and test texts, which have been parsed
and tagged by the user or obtained from linguists. By continuously retraining the
network with parsed texts from the database-specific domain, the neural
network would be able to pick up language patterns and learn incrementally.

6.4.3.3 Genetic Algorithms
Another way would be for the system to obtain feedback from the user on the
accuracy (e.g. ask the user whether queries were answered correctly) and adjust
its language processing structure (production rules) by the use of genetic
algorithms.




                                       29
7      CONCLUSIONS
The project has focused on two main topics:
   • The techniques of translating a question in natural language into a
      database query, extracting the results that the user is looking for;
   • The leading contemporary applications on the market.

The underlying methods belong in the general natural language processing area,
while any system has to select among several different techniques involving
different degrees of syntactic analysis, semantic processing or a combination. A
general feature seems to be the translation of the query in two steps, first to an
intermediate language and then to a database query language, e.g. SQL.

The topic integrates approaches several other facets of artificial intelligence, e.g.
production systems, neural networks, expert systems, and machine learning.

Two of the leading commercial software packages were tested with mixed
results. Some rather complex queries were handled well, while the systems
tended to have problems handling rather easy tasks. The sample sizes involved
are too small to base any general conclusions on, however. The reason for this is
that the configuration of the university computers at our disposal could not be
used for testing the programs.

Many companies have overestimated the use of natural language processing in
the database interface. Their interpretation of the system is that it is able to
understand the significance of the query accurately. However, the system is not
able to fully comprehend the human language and jargon unless it has been
given the definitions for these terms relating to the relevant database.43 This
mainly involves the semantic analysis. A sentence that is syntactically structured
may have lead to various meanings, which may not even be similar to one
another. Thus, as a result, this will produce undesirable conclusions in the
database queries. This is one main reason why many systems tend to fail and
explains why most companies would still rather rely on SQL programmers for
their database processing.

Although these kinds of applications are rather unpopular, the authors enjoyed
using them and encourage their future evolvement. From the experiences of the
performed tests, systems have the potential to make the task of searching for
information a lot less tedious and time-consuming.

The eventual success for natural language front ends will depend on how well
they can adapt to new environments, both regarding databases and users’ way of
using language. Two proposed benchmarks for these types of systems could be:
43
     Timo Honkela

                                          30
•   It has to be able to learn and understand the database faster than the user;
•   It has to learn natural language faster and easier than the user can learn a
    programming language.




                                     31
ACKNOWLEDGEMENTS
The authors wish to extend their appreciations to the following people for their
support during the course of the project:
   • Jon Greenblatt, President of English Language Frontend Software Co.
   • Girish Mohata, Teaching Fellow, IT School, Bond University




                                        32
8   BIBLIOGRAPHY

    1. Androutsopoulos, I., Ritchiey G.D., and Thanischz, P.: Natural Language
       Interfaces to Databases - An Introduction. Journal of Natural Language
       Engineering, vol. 1, No. 1. Cambridge University Press 1995

    2. Baker, J.K.: Trainable grammars for speech recognition, Speech
       Communication Papers for the 97th Meeting of the Acoustical Society of
       America, Acoustical Society of America 1979.

    3. Beck, H., Mobini, A., Kadambari, V. A Word is Worth 1000 Pictures:
       Natural Language Access to Digital Libraries. University of Florida.
       http://archive.ncsa.uiuc.edu/SDG/IT94/Proceedings/Searching/beck/b
       eckmain.html

    4. Dialog-Oriented Use of Natural Language http://www.dfki.uni-
       sb.de/vitra/papers/ro-man94/node5.html. Accessed on 310701

    5. Dougherty, R.C.: Natural Language Computing An English Generative
       Grammar in Prolog. Erlbaum, Lawrence Associates 1994.

    6. EasyAsk - Applications Overview
       http://www.englishwizard.com/applications/index.cfm -. Accessed
       19/7-2001

    7. ECL I Vertiefung: natürlichsprachliche Zugangssysteme: chat80.
       http://www.ifi.unizh.ch/cl/broder/chat/chat80.htm. Accessed
       12/7-2001

    8. Eriksson, G.: Översättarteknik. KFS AB 1984.

    9. Groucho Marx in the movie Animal Cracker.

    10. Hafner, C. D. and Gooden, K.: Portability of Syntax and Semantics in
        Datalog. ACM Transactions on Information Systems, vol. 3. Association
        for Computing Machinery 1985.

    11. Honkela, T., The Www Version Of Self-Organizing Maps In Natural
        Language Processing of Helsinki University of Technology – viewed on
        22/07/01
        http://www.cis.hut.fi/~tho/thesis/



                                       33
12. Johnson, T.: Natural Language Computing: The Commercial Applications.
    Ovum 1985.

13. Jurafsky, D. and Martin J. H.: Speech and Language Processing, An
    Introduction to Natural Language Processing, Computational Linguistic,
    and Speech Recognition. Prentice-Hall 2000

14. Language Reference http://www.darpa.mil/ito/psum2000/h165-0.html.
    Accessed 14/7-2001.

15. Luger, G.F. and Stubblefield, W.A.: Artificial Intelligence. Structures and
    Strategies for Complex Problem Solving. Third Edition. Addison-Wesley
    1999.

16. Manas Tungare – Natural Language Processing
    http://www.manastungare.com/articles/nlp/natural-language-
    processing.asp. Accessed 30/07/01

17. Manning, C.D. and Schutze, H.: Foundations of Statistical Natural
    Language Processing. MIT Press 1999.

18. Natural-Language Database Interfaces from ELF Software Co
    http://www.elfsoft.com/ns/FAQ.htm -. Accessed 19/7 – 2001.

19. Palmer, M. and Finin, T.: Workshop on the Evaluation of Natural
    Language Processing Systems. Computational Linguistics, vol. 16, pp.
    175-181. MIT Press 1990.

20. Penn Treebank Project http://www.cis.upenn.edu/~treebank/. Accessed
    10/7 – 2001.

21. Reis, P., Matias, J., Mamede, N.: Edite – A Natural Language Interface to
    Databases, A new dimension for an old approach.
    http://digitais.ist.utl.pt/cstc/le/Papers/CSTCLE-12.PDF

22. Sharoff, S. and Zhigalov, V.: Register-domain separation as a
    Methodology for Development of Natural Language Interfaces to
    Databases. Proceedings of the IFIP TC.13 International Conference on
    Human-Computer Interaction. International Federation for Information
    Processing 1999.




                                     34
23. Tang, R. L.: Integrating Statistical and Relational Learning for Semantic
    Parsing: Applications to Learning Natural Language Interfaces for
    Databases. University of Texas May 2000.




                                     35
9   CONTRIBUTIONS
The respective chapters were produced by the following group members:

Chapter 1: Jun
Chapter 2: Hakan
Chapter 3: Aris and Hakan
Chapter 4: Aris
Chapter 5: All
Chapter 6: Hakan
Chapter 7: Hakan and Jun
Bibliography and report compilation: Aris
Appendices: Hakan




                                      36
APPENDIX A
     Evaluating Systems
       Introduction
How good is a natural language database interface? The answer to this question
is hard to define. A survey conducted during the course of this project revealed
the existence of no formal evaluation techniques. As long as this situation
remains, an unambiguous answer to the question will elude all stakeholders in
this area.
       Why is there a need?
The need for formal evaluation schemes in this field, as in any other arises out of
several stakeholders’ desires:
   • Users want a guide for choosing between systems;
   • Companies want benchmarks for product development and
       improvement;
   • Companies need metrics for proving the capabilities of their products.
       Current Marketing
The companies behind contemporary techniques market their products with
some of the following arguments:
   • Ease of set up and integration with new databases. It is often mentioned
      [6,18] that end users will be relieved of the task of having to learn and
      understand the internal workings of the DataBase Management System
      (DBMS)
   • Money saved on searching
   • Price
   • Ease of integration across different DBMSs (Access, SQL Server, Oracle
      etc.)
   • Accuracy
   • The possibility to perform searches on several data stores simultaneously
       Problems
There have been some attempts to define general formal metrics for natural
language processing systems [19]. In [19], it was concluded that this is a difficult
task for a number of reasons:
   • Systems are built using a variety of techniques;
   • They are used in many different domains, where users’ needs are varying;
   • There is a lack of funding for research in this area.
   • However, it is also concluded that database front ends constitutes one of
       the type of systems where metrics potentially could be developed and
       adopted.
Black box metrics
In [19], a strong distinction is made between black box and glass box metrics. A
black box approach only looks at the output generated by a certain input and
does not take into account the architecture of system, or the efficiency of
individual components.

Advantages
   • It takes the user’s view;
   • It can be applied across platforms, on systems with different
      implementation details;
   • It doesn’t tie to a specific implementation technique;
   • It can be used over time, regardless of trends in database and
      programming methodologies.
Disadvantages:
   • It doesn’t give a good indication to programmers of what is actually
      wrong;
   • It is badly suited for testing individual components of a system.
           Proposed black box evaluation scheme
The proposed evaluation scheme takes into account several different aspects of
the program in question.

Evaluation can be based on the following characteristics:
           Overall Characteristics
       •   User Friendliness: Is the application easy to understand and use? Are help
           files accessible and explanatory? Are error messages clear?
       •   Portability: Can it be used in conjunction with only a specific database? If
           no, how easy is it to integrate it with other databases?
       •   Speed: How fast are answers extracted?
       •   Fault Tolerance: Can the system recognize off-topic questions (queries on
           information that is not in the database) and give an informative response
           within a reasonable time frame?
       •   Accessibility: Can it be used over the web?
           Vocabulary
Can the system accurately understand the following expressions44:
   • What?
   • Which?
   • How many?
   • How much?
   • Show
   • List
44
     This list is arbitrary and may have to be expanded/contracted.
•   Tell
     •   Count
         Ease of Interaction
     •   Linguistic Flexibility: How many spelling errors in a word can the system
         tolerate and understand? Can it suggest alternative spellings45?
     •   Probing questions: Are “follow-up” questions (questions referring to the
         previous answer) allowed?
     •   Can the system adjust for bad grammar and still understand the question?
         Accuracy based on input complexity
The system is asked a number of different questions. These questions are ranked
as simple, average or complex. The accuracy (percentage of questions answered
correctly) in each of the three categories is noted.

The evaluation scheme formed the basis of the market tests of chapter 5.
However, because of the small sample size of tested applications, no attempt to
formalize the scheme or develop a metric based on it was made.




45
  For an example of this capability, please try a search on http://www.google.com with a word
containing a slight spelling error, e.g. elpheants.
APPENDIX B
     Test Protocol
The questions asked, their respective classifications, and the outcomes for the
tested programs are presented in table 4. In the classification column, S stands for
Simple, A for Average, and C for Complex.


                               Table 4. Test Protocol.

    Question         Class   Microsoft English       Elfsoft outcome        Comments
                              Query outcome
Who is the oldest     S          Correct                  Correct      English Query gave
  employee?                                                             the oldest person,
                                                                       Elfsoft the one who
                                                                         had worked the
                                                                            longest at
                                                                           Northwind.
 Which supplier       C          Correct                  Correct
    (currently)
supplies the most
 products (which
       are not
  discontinued)?
Which employee        A        No answer                  Correct       Elfsoft gave too
 has handled the                                                       much information
   most orders?
 What product is      S          Correct                 No answer
      the most
    frequently
     ordered?
 List the country     A        No answer                  Correct
     that has a
   supplier that
    ships tofu.
 Name the third       S        No answer                 No answer
   most ordered
      product.
 What is the least    S           Wrong                  No answer
ordered product?
   How much is        S          Correct                 No answer
   1kg of Queso
     Cabrales?
Question        Class   Microsoft English   Elfsoft outcome       Comments
                             Query outcome
 How much tofu       A        No answer            Correct         Elfsoft gave too
     have been                                                    much information
      ordered?
 Show the phone      S          Correct            Correct
number of united
      package.
    Tell me the      S          Correct          No answer
   names of the
         sales
 representatives
 Tell me the age     A          Correct          No answer
 of these people.
And their phone      A          Correct            Correct
     numbers?
     Count the       S          Correct            Correct
   customers in
     Germany.
    What is the      A          Correct            Wrong
  average age of
 the employees?
     Name the        A          Correct          No answer
 employees that
  are older than
       average
Give the name of     S          Correct          No answer
       the sales
      manager.
Where is Around      S          Correct          No answer
 the Horn from?
    What is the      A        No answer            Wrong
  median of the
      age of the
    employees?
List the names of    S        No answer            Wrong
     the people
       working
 currently in the
     company.
   Who is older      S          Correct          No answer
    than Janet?
Question         Class   Microsoft English   Elfsoft outcome       Comments
                              Query outcome
What can you tell     S         Too little        No answer
 me about Ernst               information
     Handel?
 Which supplier       C          Correct            Correct
supplies tofu but
not longlife tofu?
  What are the        C        No answer            Wrong
 contact names
   and phone
   numbers of
 customers that
  have received
  products sent
  with Federal
    Shipping?
  What are the        A          Correct            Correct        Microsoft English
  products that                                                     Query had the
federal shipping                                                        wrong
      ships                                                         interpretation.
What customers        A        No answer            Wrong
 received these
   shipments?

Weitere ähnliche Inhalte

Was ist angesagt?

Data Export 2010 for MySQL
Data Export 2010 for MySQLData Export 2010 for MySQL
Data Export 2010 for MySQLwebhostingguy
 
Xi3 ds administrators_guide_en
Xi3 ds administrators_guide_enXi3 ds administrators_guide_en
Xi3 ds administrators_guide_enSarat Reddy
 
Parallels Plesk Panel 9 Client's Guide
Parallels Plesk Panel 9 Client's GuideParallels Plesk Panel 9 Client's Guide
Parallels Plesk Panel 9 Client's Guidewebhostingguy
 
Rafal_Malanij_MSc_Dissertation
Rafal_Malanij_MSc_DissertationRafal_Malanij_MSc_Dissertation
Rafal_Malanij_MSc_DissertationRafał Małanij
 
html-css-bootstrap-javascript-and-jquery
html-css-bootstrap-javascript-and-jqueryhtml-css-bootstrap-javascript-and-jquery
html-css-bootstrap-javascript-and-jqueryMD. NURUL ISLAM
 
Ibm watson analytics
Ibm watson analyticsIbm watson analytics
Ibm watson analyticsLeon Henry
 
Excel fundamentals-manual
Excel fundamentals-manualExcel fundamentals-manual
Excel fundamentals-manualRAMAVATHSRINU3
 
Codendi 4.0 User Guide
Codendi 4.0 User GuideCodendi 4.0 User Guide
Codendi 4.0 User GuideCodendi
 
Outlook 2007 Tips and Tricks
Outlook 2007 Tips and TricksOutlook 2007 Tips and Tricks
Outlook 2007 Tips and Tricksoutlookbill
 
Plesk Sitebuilder 4.5 for Linux/Unix Wizard User's Guide
Plesk Sitebuilder 4.5 for Linux/Unix Wizard User's GuidePlesk Sitebuilder 4.5 for Linux/Unix Wizard User's Guide
Plesk Sitebuilder 4.5 for Linux/Unix Wizard User's Guidewebhostingguy
 
Protel 99 se_traning_manual_pcb_design
Protel 99 se_traning_manual_pcb_designProtel 99 se_traning_manual_pcb_design
Protel 99 se_traning_manual_pcb_designhoat6061
 
Lync Product Guide from Microsoft and Atidan
Lync Product Guide from Microsoft and AtidanLync Product Guide from Microsoft and Atidan
Lync Product Guide from Microsoft and AtidanDavid J Rosenthal
 
The MySQL Cluster API Developer Guide
The MySQL Cluster API Developer GuideThe MySQL Cluster API Developer Guide
The MySQL Cluster API Developer Guidewebhostingguy
 
Parallels Plesk Panel 9 Reseller's Guide
Parallels Plesk Panel 9 Reseller's GuideParallels Plesk Panel 9 Reseller's Guide
Parallels Plesk Panel 9 Reseller's Guidewebhostingguy
 

Was ist angesagt? (20)

Data Export 2010 for MySQL
Data Export 2010 for MySQLData Export 2010 for MySQL
Data Export 2010 for MySQL
 
Xi3 ds administrators_guide_en
Xi3 ds administrators_guide_enXi3 ds administrators_guide_en
Xi3 ds administrators_guide_en
 
Parallels Plesk Panel 9 Client's Guide
Parallels Plesk Panel 9 Client's GuideParallels Plesk Panel 9 Client's Guide
Parallels Plesk Panel 9 Client's Guide
 
c
cc
c
 
Rafal_Malanij_MSc_Dissertation
Rafal_Malanij_MSc_DissertationRafal_Malanij_MSc_Dissertation
Rafal_Malanij_MSc_Dissertation
 
PSA user manual
PSA user manualPSA user manual
PSA user manual
 
Tommy Marker
Tommy MarkerTommy Marker
Tommy Marker
 
html-css-bootstrap-javascript-and-jquery
html-css-bootstrap-javascript-and-jqueryhtml-css-bootstrap-javascript-and-jquery
html-css-bootstrap-javascript-and-jquery
 
IBM Watson Content Analytics Redbook
IBM Watson Content Analytics RedbookIBM Watson Content Analytics Redbook
IBM Watson Content Analytics Redbook
 
Ibm watson analytics
Ibm watson analyticsIbm watson analytics
Ibm watson analytics
 
0931 excel-fundamentals
0931 excel-fundamentals0931 excel-fundamentals
0931 excel-fundamentals
 
Excel fundamentals-manual
Excel fundamentals-manualExcel fundamentals-manual
Excel fundamentals-manual
 
Codendi 4.0 User Guide
Codendi 4.0 User GuideCodendi 4.0 User Guide
Codendi 4.0 User Guide
 
Outlook 2007 Tips and Tricks
Outlook 2007 Tips and TricksOutlook 2007 Tips and Tricks
Outlook 2007 Tips and Tricks
 
Plesk Sitebuilder 4.5 for Linux/Unix Wizard User's Guide
Plesk Sitebuilder 4.5 for Linux/Unix Wizard User's GuidePlesk Sitebuilder 4.5 for Linux/Unix Wizard User's Guide
Plesk Sitebuilder 4.5 for Linux/Unix Wizard User's Guide
 
Protel 99 se_traning_manual_pcb_design
Protel 99 se_traning_manual_pcb_designProtel 99 se_traning_manual_pcb_design
Protel 99 se_traning_manual_pcb_design
 
Lync Product Guide from Microsoft and Atidan
Lync Product Guide from Microsoft and AtidanLync Product Guide from Microsoft and Atidan
Lync Product Guide from Microsoft and Atidan
 
The MySQL Cluster API Developer Guide
The MySQL Cluster API Developer GuideThe MySQL Cluster API Developer Guide
The MySQL Cluster API Developer Guide
 
User manual
User manualUser manual
User manual
 
Parallels Plesk Panel 9 Reseller's Guide
Parallels Plesk Panel 9 Reseller's GuideParallels Plesk Panel 9 Reseller's Guide
Parallels Plesk Panel 9 Reseller's Guide
 

Ähnlich wie report.doc

Specification of the Linked Media Layer
Specification of the Linked Media LayerSpecification of the Linked Media Layer
Specification of the Linked Media LayerLinkedTV
 
Report on e-Notice App (An Android Application)
Report on e-Notice App (An Android Application)Report on e-Notice App (An Android Application)
Report on e-Notice App (An Android Application)Priyanka Kapoor
 
Capturing Knowledge Of User Preferences With Recommender Systems
Capturing Knowledge Of User Preferences With Recommender SystemsCapturing Knowledge Of User Preferences With Recommender Systems
Capturing Knowledge Of User Preferences With Recommender SystemsMegaVjohnson
 
Soa In The Real World
Soa In The Real WorldSoa In The Real World
Soa In The Real Worldssiliveri
 
PostgreSQL 14 Beta 1 New Features with Examples (English)
PostgreSQL 14 Beta 1 New Features with Examples (English)PostgreSQL 14 Beta 1 New Features with Examples (English)
PostgreSQL 14 Beta 1 New Features with Examples (English)Noriyoshi Shinoda
 
Tivoli business systems manager v2.1 end to-end business impact management sg...
Tivoli business systems manager v2.1 end to-end business impact management sg...Tivoli business systems manager v2.1 end to-end business impact management sg...
Tivoli business systems manager v2.1 end to-end business impact management sg...Banking at Ho Chi Minh city
 
First7124911 visual-cpp-and-mfc-programming
First7124911 visual-cpp-and-mfc-programmingFirst7124911 visual-cpp-and-mfc-programming
First7124911 visual-cpp-and-mfc-programmingxmeszeus
 
MXIE Phone User's Manual
MXIE Phone User's ManualMXIE Phone User's Manual
MXIE Phone User's ManualMatthew Rathbun
 
Fundamentals of HDL (first 4 chapters only) - Godse
Fundamentals of HDL (first 4 chapters only) - GodseFundamentals of HDL (first 4 chapters only) - Godse
Fundamentals of HDL (first 4 chapters only) - GodseHammam
 

Ähnlich wie report.doc (20)

Specification of the Linked Media Layer
Specification of the Linked Media LayerSpecification of the Linked Media Layer
Specification of the Linked Media Layer
 
Report on e-Notice App (An Android Application)
Report on e-Notice App (An Android Application)Report on e-Notice App (An Android Application)
Report on e-Notice App (An Android Application)
 
Capturing Knowledge Of User Preferences With Recommender Systems
Capturing Knowledge Of User Preferences With Recommender SystemsCapturing Knowledge Of User Preferences With Recommender Systems
Capturing Knowledge Of User Preferences With Recommender Systems
 
Fraser_William
Fraser_WilliamFraser_William
Fraser_William
 
Expert_Programming_manual.pdf
Expert_Programming_manual.pdfExpert_Programming_manual.pdf
Expert_Programming_manual.pdf
 
Programming
ProgrammingProgramming
Programming
 
Liebman_Thesis.pdf
Liebman_Thesis.pdfLiebman_Thesis.pdf
Liebman_Thesis.pdf
 
Soa In The Real World
Soa In The Real WorldSoa In The Real World
Soa In The Real World
 
PostgreSQL 14 Beta 1 New Features with Examples (English)
PostgreSQL 14 Beta 1 New Features with Examples (English)PostgreSQL 14 Beta 1 New Features with Examples (English)
PostgreSQL 14 Beta 1 New Features with Examples (English)
 
sg248293
sg248293sg248293
sg248293
 
Software guide 3.20.0
Software guide 3.20.0Software guide 3.20.0
Software guide 3.20.0
 
Tivoli business systems manager v2.1 end to-end business impact management sg...
Tivoli business systems manager v2.1 end to-end business impact management sg...Tivoli business systems manager v2.1 end to-end business impact management sg...
Tivoli business systems manager v2.1 end to-end business impact management sg...
 
Tools Users Guide
Tools Users GuideTools Users Guide
Tools Users Guide
 
dissertation
dissertationdissertation
dissertation
 
First7124911 visual-cpp-and-mfc-programming
First7124911 visual-cpp-and-mfc-programmingFirst7124911 visual-cpp-and-mfc-programming
First7124911 visual-cpp-and-mfc-programming
 
MXIE Phone User's Manual
MXIE Phone User's ManualMXIE Phone User's Manual
MXIE Phone User's Manual
 
Lesson 1...Guide
Lesson 1...GuideLesson 1...Guide
Lesson 1...Guide
 
Fundamentals of HDL (first 4 chapters only) - Godse
Fundamentals of HDL (first 4 chapters only) - GodseFundamentals of HDL (first 4 chapters only) - Godse
Fundamentals of HDL (first 4 chapters only) - Godse
 
By d ui_styleguide_2012_fp35
By d ui_styleguide_2012_fp35By d ui_styleguide_2012_fp35
By d ui_styleguide_2012_fp35
 
test6
test6test6
test6
 

Mehr von butest

EL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEEL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEbutest
 
1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALbutest
 
Timeline: The Life of Michael Jackson
Timeline: The Life of Michael JacksonTimeline: The Life of Michael Jackson
Timeline: The Life of Michael Jacksonbutest
 
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALbutest
 
Com 380, Summer II
Com 380, Summer IICom 380, Summer II
Com 380, Summer IIbutest
 
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet JazzThe MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazzbutest
 
MICHAEL JACKSON.doc
MICHAEL JACKSON.docMICHAEL JACKSON.doc
MICHAEL JACKSON.docbutest
 
Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1butest
 
Facebook
Facebook Facebook
Facebook butest
 
Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...butest
 
Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...butest
 
NEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTNEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTbutest
 
C-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docC-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docbutest
 
MAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docMAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docbutest
 
Mac OS X Guide.doc
Mac OS X Guide.docMac OS X Guide.doc
Mac OS X Guide.docbutest
 
WEB DESIGN!
WEB DESIGN!WEB DESIGN!
WEB DESIGN!butest
 

Mehr von butest (20)

EL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEEL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBE
 
1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
 
Timeline: The Life of Michael Jackson
Timeline: The Life of Michael JacksonTimeline: The Life of Michael Jackson
Timeline: The Life of Michael Jackson
 
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
 
Com 380, Summer II
Com 380, Summer IICom 380, Summer II
Com 380, Summer II
 
PPT
PPTPPT
PPT
 
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet JazzThe MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
 
MICHAEL JACKSON.doc
MICHAEL JACKSON.docMICHAEL JACKSON.doc
MICHAEL JACKSON.doc
 
Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1
 
Facebook
Facebook Facebook
Facebook
 
Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...
 
Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...
 
NEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTNEWS ANNOUNCEMENT
NEWS ANNOUNCEMENT
 
C-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docC-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.doc
 
MAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docMAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.doc
 
Mac OS X Guide.doc
Mac OS X Guide.docMac OS X Guide.doc
Mac OS X Guide.doc
 
hier
hierhier
hier
 
WEB DESIGN!
WEB DESIGN!WEB DESIGN!
WEB DESIGN!
 

report.doc

  • 1. Table of Contents 1 INTRODUCTION..........................................................................................................3 2 HISTORY.........................................................................................................................6 2.1 Early systems...........................................................................................................6 3 NATURAL LANGUAGE PARSING...........................................................................7 3.1 Rule-Based Syntactic Parsing ...............................................................................7 3.2 Terminal Symbols...................................................................................................7 3.3 Non-terminal symbols............................................................................................7 3.4 Production Rules.....................................................................................................7 3.4.1 Grammar...........................................................................................................7 3.4.2 Parse tree...........................................................................................................8 3.4.2.1 Top down..................................................................................................8 3.4.2.2 Bottom up ..................................................................................................9 3.5 Probabilistic Parsing.............................................................................................10 3.5.1 Disambiguation..............................................................................................11 3.5.2 Training...........................................................................................................11 3.5.2.1 Treebank...................................................................................................12 3.5.2.2 Incremental learning...............................................................................12 3.6 Semantic Parsing...................................................................................................13 3.6.1 Semantic Data Models...................................................................................13 3.6.2 Case Based Reasoning...................................................................................14 3.6.3 Semantic Representation...............................................................................15 3.6.4 Actions of the Parser......................................................................................15 4 NLIDB ARCHITECTURE...........................................................................................17 4.1 Pattern-matching systems....................................................................................17 4.2 Parsing based systems..........................................................................................17 4.2.1 Semantic grammar based parsing...............................................................18 4.2.2 Translation......................................................................................................19 5 MARKET TEST.............................................................................................................23 5.1 Goals.......................................................................................................................23 5.2 Tests........................................................................................................................23 5.3 Results.....................................................................................................................23 5.3.1 Impressions.....................................................................................................23 5.3.1.1 Microsoft English Query........................................................................23 5.3.1.2 Elfsoft........................................................................................................24 5.3.2 Query results..................................................................................................25 6 FUTURE........................................................................................................................27 6.1 Language challenges............................................................................................27 6.2 Portability challenges...........................................................................................27 6.3 Competing systems...............................................................................................27 6.4 Possible avenues....................................................................................................27 1
  • 2. 6.4.1 Adaptation techniques..................................................................................28 6.4.2 Speech-based techniques..............................................................................28 6.4.3 Learning algorithms......................................................................................28 6.4.3.1 User Dialogue..........................................................................................28 6.4.3.2 Neural Networks.....................................................................................29 6.4.3.3 Genetic Algorithms.................................................................................29 7 CONCLUSIONS...........................................................................................................30 8 BIBLIOGRAPHY..........................................................................................................33 9 CONTRIBUTIONS.......................................................................................................36 APPENDIX A..................................................................................................................37 Evaluating Systems....................................................................................................37 Introduction............................................................................................................37 Why is there a need?..............................................................................................37 Current Marketing.................................................................................................37 Problems .................................................................................................................37 Black box metrics...................................................................................................38 Proposed black box evaluation scheme..............................................................38 Overall Characteristics..........................................................................................38 Vocabulary..............................................................................................................38 Ease of Interaction..................................................................................................39 Accuracy based on input complexity..................................................................39 APPENDIX B..................................................................................................................40 Test Protocol...............................................................................................................40 2
  • 3. 1 INTRODUCTION The ability to exercise language to convey different thoughts and feelings differentiates human beings from animals. The definition of Natural Language Processing is the capability of a machine to understand the full context of human language about a particular topic so that the unspecified guess and general knowledge can be understood. “Thus if the machine is able to achieve this, it has come close to the notion of artificial intelligence itself”1. One may find interacting with a foreign person with no knowledge of English intricate and frustrating. Thus, a translator will have to come into the picture to allow one to communicate with the foreigner. Companies have related this problem to extracting data from a database management system (DBMS) such as MS Access, Oracle and others. A person with no knowledge of Structured Query Language (SQL) may find himself or herself handicapped in corresponding with the database. Therefore, companies like Microsoft and Elfsoft (English Language Frontend Software) have analysed the abilities of Natural Language Processing to develop products for people to interact with the database in simple English. This enables a user to simply enter queries in English to the Natural Language database interface. This kind of application is known as a Natural Language Interface to a DataBase (NLIDB). The system works by utilizing the use of syntactic knowledge and the knowledge it has been provided about the relevant database.2 Hence, it is able to implement the natural language input to the structure, scope and contents of the database. The program translates the whole query into the standard query language to extract the relevant information from the database. Thus, these products have created a revolution in extracting information from databases. They have discarded the fuss of learning SQL and time is also saved in learning this query language. This report will look at the performance of each database interface connected to a standard database. The Northwind database has been chosen as the default database to work on. There are several companies that are offering such products in the market. Our group has found several of them, which include English Query, Elfsoft, EasyAsk and NLBean created by Mr Mark Watson. We have requested for these companies for their permissions to test their products in regards to our research. We received positive responses from Elfsoft and NLBean, but had to settle for tests on Microsoft English Query and Elfsoft only. We have also contacted EasyAsk via email but the company has provided minimal assistance in our research. 1 Manas Tungare 2 Manas Tungare 3
  • 4. In order to produce accurate conclusions on the different interpretations of each software, we have listed out over thirty questions to test the products. Each product will be asked the same questions in the same order. The questions have been carefully planned to test the pros and the cons of each product. These questions include: • Listing the specific columns and rows • Counting • Calculations • Cross referencing from more than one tables • Ordinal positions • Followed-ups • Conclusions • Semantics • Grammar mistakes • Spelling mistakes • Out-of-context questions There are three components in a natural language dialog system: analysis, evaluation and generation.3 The analysis component translates the query as entered by the user into a semantic representation which is transcripted in the knowledge representation language. There may be several communication sessions between the natural language access system and user interface system to the user in order to carry out the action to derive the result. The evaluation component allows information to be absorbed by the dialog system when queries have to be satisfied or the system needs to alert the user about any major state changes. The generation component gathers the intended information that the user wants to see as provided in the query. This component will generate text, graphs, query or any other responses according to the situational context of the query.4 The knowledge-based database assistant (KDA) as stated, is a practical development of an intelligent database front-end to assist novice users in retrieving desirable information from an unfamiliar database system.5 This component exists in both Microsoft English Query and Elfsoft. Thus, this useful program directs the novice user to get the relevant results by entering the accurate query or by prompting the user when insufficient information is entered to get the appropriate answer. This component can be seen in the later part in this report in both programs. 3 Dialog-Oriented Use of Natural Language 4 Dialog-Oriented Use of Natural Language 5 Manas Tungare 4
  • 5. In addition, “the KDA's responding functionality, which could change the user's knowledge state, is called query guidance”.6 It can detect a user’s scope of knowledge about the relevant database by studying the query entered by the user. If it sensed that the user has limited awareness about the database and could not retrieve his or her desired answer, the query guidance will jump into action and provide similar queries to allow the user to gather the appropriate facts from the database or present the most relevant query to the user based on the user’s perceived intention. Such a component allows the novice to get familiar with the database fast and enables the user to learn about the scope of the database based on the prompt messages and the queries generated from the KDA without the expense of learning those mass databases stored in most organizations. 6 Manas Tungare 5
  • 6. 2 HISTORY As the use of databases for data storage spread during the 1970’s, the user interface to these systems represented a burden for designers worldwide. At this point, both the relational database model and the SQL interface language were yet to be developed, which means that the task of inserting and querying data was tedious and difficult. It was therefore a logical step for programmers to attempt to develop more user- friendly and “human” interfaces to the databases. One of these approaches was the use of natural language processing, where the user interactively would be allowed to interrogate the stored information. 2.1 Early systems The most well-known historical natural language database interface systems are: • LUNAR, interfacing a database with information on rocks collected during American moon expeditions. It was originally published in 1972. When evaluated in 1977, it answered 78 % of questions correctly. Based on syntactic parsing, it tended to build several parse trees for the same query, and was deemed as inefficient7 and too domain-specific and inflexible. • LADDER, the first semantic grammar-based system, interfacing a database with information on US Navy ships. • CHAT-80, probably the most famous example. It interfaced a database of world geography facts. The entire application (both the database and the user interface) was developed in Prolog. As the source code was freely distributed, it is still used and cited. An online version can be found at8. 7 Hafner, C. D. and Gooden, K. pp 141-164 8 ECL I Vertiefung: natürlichsprachliche Zugangssysteme: chat80 6
  • 7. 3 NATURAL LANGUAGE PARSING 3.1 Rule-Based Syntactic Parsing Syntax means ways that words can fit together to form higher-level units such as phrases, clauses, and sentences. Therefore syntactically driven parsing means interpretations of larger groups of words are built up out of the interpretation of their syntactic constituent words or phrases. In a way this is the opposite of pattern matching as here the interpretation of the input is done as a whole. Syntactic analyses are obtained by application of a grammar that determines what sentences are legal in the language that is being parsed. Syntactic parsing operates through the translation of the natural language query into a parse tree, which is then converted to a SQL query. There are a number of fundamental concepts in the theory of syntactic parsing. 3.2 Terminal Symbols A terminal symbol is the basic building block of the language, i.e. words and delimiters. Together, the set of terminal symbols form the “dictionary of words”9 recognised by the system, i.e. the range of the vocabulary that it can read and interpret. 3.3 Non-terminal symbols Non-terminal symbols are higher-level language terms describing concepts and connections in the syntax of the language. Examples of non-terminal symbols include1 sentence, noun phrase, verb phrase, noun, and verb. 3.4 Production Rules As the query is analysed, a number of production rules fires to identify and classify the context of the read word. In analogy with a production system (such as the one used in PROLOG), a production rule in a context-free grammar10 converts a left-hand non-terminal symbol to a sequence of symbols, which can be either terminal or non-terminal. Examples of production rules: • Sentence := Noun phrase verb phrase • Verb phrase := verb These rules are also commonly referred to as rewrite rules. 3.4.1 Grammar The combination of the set of terminal symbols, set of non-terminal symbols, the production rules and an assigned start symbol (the highest-level construct in the system, usually sentence) form the grammar of the syntax. The role of the grammar is to define: • What category each word belongs to; 9 Luger, G.F. and Stubblefield, W.A. This paper will be constricted to the treatment of context-free grammars and not deal with the 10 more complex set of syntaxes known as context-sensitive. 7
  • 8. What expressions are legal and syntactically correct; • How sentences are generated. 3.4.2 Parse tree The system analyses the sentence by reading the non-terminal symbols in order and identifying what production rule to fire. As it does so, it gradually builds a representation of the sentence referred to as a parse tree. The term has been coined from the tree-like graph that is produced, where the root is the top-level symbol (e.g. sentence), the children of each node are the right-hand non-terminal symbols and the leaves are the terminal symbols (the words). The parse tree can be built in two fundamentally different ways. 3.4.2.1 Top down A top down parser starts at the root and gradually builds the tree downwards by matching the read terminal symbols with symbols on the right-hand side of possible production rules. Terminal or non-terminal symbols on the right hand side are added at the level below the current symbol. This is similar to the goal-driven approach of a production system. The basic architecture of a top down parser is illustrated in figure 1. 8
  • 9. Figure 1 Top down parsing of the sentence "the girl forgot the boy"11 In many situations, the first token alone does not provide enough information to make the decision on what production rule should be fired. In order to overcome this, there are two basic methods. 3.4.2.1.1 Recursive Descent The system starts by firing the first production rule of the candidates for which the given terminal symbol could fit and builds the initial sub tree from this information. If this further downwards in the tree results in an inconsistency or syntactic error, it reverts to the point where the decision was made, removes all the nodes on the way back up and selects another of the possible productions. This is a procedure very similar to depth-first searching and backtracking in production systems. 3.4.2.1.2 Look Ahead Look Ahead systems will not be contented by just reading one token. Rather, it reads the number of tokens necessary to identify the given right-hand side beyond any ambiguities before firing any production rules. Grammars are characterised by the maximum number of terminal symbols required to read before all possible conflicts in the choice of production rule can be resolved. If this number is k, the grammar is referred to as an LL (k) grammar12. The look ahead procedure is more in analogy with a breadth-first search technique. 3.4.2.2 Bottom up A bottom up parser, on the other hand, works from the leaf upward by “tagging” the tokens, i.e. starting from the right-hand side of the production rules and associating the read word with its category. When a full right-hand side has been identified, the production rule fires and the left-hand side non- terminal symbol is added as a branch in the level above. This methodology corresponds to the data-driven technique of production systems. The bottom up parsing technique is illustrated in figure 2. 11 Dougherty, R.C. 12 Eriksson, G. 9
  • 10. Figure 2 Bottom up parsing of the sentence "the girl forgot the boy"13 In some cases, the sentence is ambiguous in itself and there are multiple production rules that match a given sentence, in which case the parser has to make a choice between the two potential interpretations. One strategy for dealing with these situations is referred to as probabilistic parsing. 3.5 Probabilistic Parsing Probabilistic parsing takes an empirical approach to the difficult task of disambiguation, i.e. identifying which of several mutually exclusive alternate 13 Dougherty, R.C. 10
  • 11. syntactic parse trees should be generated. For example, consider the sentence “One morning I shot an elephant in my pyjamas”14. There are two possible syntactic parses for this sentence15. One implies that the person was wearing the pyjamas, while the opposing view would claim that the elephant was in the underwear (hence the joke). Although the selection between these two interpretations is obvious to a human, how is this knowledge automated in a computer? One option, used in a.k.a. attribute grammars, is to encode information for each verb as a parameter to each production rule. However, as the dictionary grows, this approach may be too selective and require every different case to be specifically added to the production rules. Probabilistic parsing, on the other hand, works by augmenting the rules with assigned probabilities, representing the chance of the particular expansion (production rule) being the correct one. For example, a probabilistic grammar would introduce the following enhancements to the possible regular syntactic production rules for the expansion of the non-terminal symbol sentence [15]: • Sentence:= Nounphrase Verbphrase, P = 0.8 • Sentence:= Auxiliary Nounphrase Verbphrase, P = 0.15 • Sentence:= Verbphrase, P = 0.05 Note that the probabilities for the expansions of any given non-terminal symbol always add up to 1. 3.5.1 Disambiguation How does probabilistic parsing choose a parse tree from two possible interpretations? In most systems, it simply compares the products of all the probabilities involved in every production required for the competing parses and selects the one representing the highest of these probabilities. 3.5.2 Training One important task concerns how to set the probabilities. There are two fundamentally different techniques for this task [15]. 14 Groucho Marx 15 Jurafsky, D. & Martin, J. 11
  • 12. 3.5.2.1 Treebank A large database of sentences with their correct parses (parsed by knowledgeable humans) is entered into the system. The respective probabilities are then calculated as the relative frequencies of each possible parse. For more details, see [15]. The largest known treebank is known as the Penn Treebank16. The latest version, Treebank 3 contains parses of17: • One million words of 1989 Wall Street Journal material; • A small sample of ATIS-3 transcripts. The Air Travel Information Service is a joint project of DARPA (Defence Advanced Research Projects Agency) and SRI International, handling voice-based queries and requests about flights. More information can be found at18; • A fully parsed tagged version of the Brown Corpus, consisting of one million words from 500 different sources (novels, academic books, newspapers, non-fiction books etc. [15]); • Parsed and tagged text from a set of 560 transcripts of telephone conversation (a.k.a. the Switchboard-1 corpus). • This is a widely used “training set” (in analogy with an artificial neural network) enabling the parser to learn what classes of speech a given word can belong to and how frequently a particular expression is to be interpreted in different ways. 3.5.2.2 Incremental learning The other technique is a “trial and error” method, in which the parsing system much like an artificial neural network learns as it is used. The initial probabilities can be assigned randomly or by the user. After that, the system adjusts these probabilities according to the following rules [15]: • If the sentence was unambiguous, its parse count is increased by 1, i.e. pi: = pi +1; • If the sentence was ambiguous, each of the possible parses have their counts incremented by their respective probabilities, i.e. pi: = pi + P (p i ). The algorithm for this computation is referred to as the Inside - Outside Algorithm. It was originally proposed in19 and is described in detail in20. 16 Penn Treebank Project. 17 Quoted by the LDC office of the University of Pennsylvania in an email dated 10/7-2001. 18 Language Reference 19 Baker, J.K. pp. 547-550. 20 Manning, C.D. and Schutze, H. 12
  • 13. 3.6 Semantic Parsing The syntactic structure of a sentence is not enough to express its meaning. For instance, the noun phrase the catch can have different meanings depending on whether one is talking about a baseball game or a fishing expedition. To talk about different possible readings of the phrase the catch, one therefore has to define each specific sense of the phrase. The representation of the context- independent meaning of a sentence is called its logical form.21 Natural language analysis based on semantic grammar is similar to syntactically driven parsing except that in semantic grammar the categories used are defined semantically. Database items can be ambiguous when the same item is listed under more than one attribute. For example, the term “Mississippi” is ambiguous between being a river name or a state name, in other words, two different logical forms. The two different meanings have to be represented distinctly for an interpretation of a user query. 3.6.1 Semantic Data Models Semantic data models (SDM) are widely researched in the database community. They are closely related to semantic networks used in artificial intelligence, which were originally developed to support natural language processing. Hence, as database management systems they are capable of supporting large amounts of information, while still offering the potential of advanced inferencing capabilities including NLP, machine learning, and query processing. “SDMs can be seen as formalising many of the relationships, expressed in an ad hoc manor in conventional hypermedia systems.”22 SDMs support a variety of formalised links and relationships. An example of a small network on insects is shown in figure 3. The links in this graph express generalisation relationships or "ISA" (beneficial insect IS-A insect), part/whole (Abdomen is part of an Insect), association (Ladybugs eat Aphids), and class/instance (Ladybug is an instance of Beneficial Insect). 23 21 Tang, R. L. p5 22 Beck, H., Mobini, A., Kadambari, V 23 Beck, H., Mobini, A., Kadambari, V 13
  • 14. Figure 3 Semantic Data Model describing insects24 In figure 3, solid lines are ISA relationships, diamonds are part/whole, circles are associations, and Instances are underlined. Since concepts in SDMs are described by structured graphs expressing the relationships among symbols rather than connections between text files as in conventional hypertext, there exists the capability for manipulation of SDMs to produce a number of desirable functions. Foremost is that of search or query processing. [8] Suggests query processing based on graph matching techniques by which the query is expressed as a small semantic network. This query graph is then matched against the larger database graph to find connections. This gives a much more precise search capability than is possible with Boolean keyword searches over text files. 3.6.2 Case Based Reasoning In order to construct an NLP system, one must construct a large dictionary. Much of the recent advances in text understanding systems can be attributed to advances in design and construction of large lexicons. But that presupposes that word meaning is easily represented and a case-based reasoning approach to meaning is used. Words obtain meaning by how they are used. A particular word is used in many different situations and contexts. Each occurrence of the word is treated as one case. Similarities among cases can be observed, and cases with similar usage can be clustered together into categories. When a word is used in a new situation, similar cases are retrieved from the case-based memory in order to apply what happened before to the new context. The meaning of a particular word is established by a large case base, and thus a single word may be "worth 1,000 cases". 25 24 http://archive.ncsa.uiuc.edu/SDG/IT94/Proceedings/Searching/beck/fig1.gif 25 Beck, H., Mobini, A., Kadambari, V. 14
  • 15. 3.6.3 Semantic Representation The most basic constructs of the representation language are the terms used to describe objects in the database and the basic relations between them. Database objects bear relationships to each other or can be related to other objects of interest to a user who is requesting information from it. For instance, in a user query like “What is the capital of Texas?”, the data of interest is a city with certain relationship to a state called Texas, or more precisely its capital. The capital/2 relation, or predicate, is therefore defined to handle questions that require them. Predicates Description city (C) C is a city capital (S,C) C is the capital of S density (S,D) D is the population density of state S loc (X,Y) X is located in Y len (R,L) L is the length of river R next_to (S1,S2) State S1 borders S2 traverse (R,S) River R traverses state S Table 1 Sample of predicates26 3.6.4 Actions of the Parser Using the parser actions in CHILL [8] known as shift-reduce parsing we will discuss the working of the parser. The parser actions are generated from templates given by a logical query. An action template will be instantiated to form a specific parsing action. Recall that the parser also requires a lexicon to interpret meaning of phrases into specific logical forms. Consider the following example27: Sentence: What is the capital of Texas? Logical Query: answer(C,(capital(C,S),const(S,stateid(Texas)))). A very simple lexicon will map ‘capital’ to ‘capital(_,_)’ and ‘Texas’ to ‘const(_,stateid(texas))’. The parser begins with an initial stack and a buffer holding the input sentence, which is the initial parse state. Each predicate on the parse stack has an attached buffer to hold the context in which it was introduced. Words from the input sentence are shifted onto the stack buffer during parsing. The initial parse is as follow: Parse Stack: [answer(_,_):[]] Input Buffer: [what,is,the,capital,of,texas,?] 26 Lappoon R. T. p6 27 Tang, R.L. 15
  • 16. Since the first three words in the input buffer do not map to any logical forms, the next sequence of steps is to push the three words from the input buffer onto the parse stack. The process has the following result: Parse Stack: [answer(_,_):[the,is,what]] Input Buffer: [capital,of,Texas,?] Now, ‘capital’ is at the head of the input buffer and is mapped to ‘capital(_,_)’ in the lexicon. The next action is to push the logical form onto the parse stack. The resulting parse state is as followed: Parse Stack: [capital(_,_):[],answer(_,_):[the,is,what]] Input Buffer: [capital,of,Texas,?] The parser then binds two arguments of two different logical forms to the same variable, resulting in the following parse state: Parse Stack: [capital(C,_):[],answer(C,_):[the,is,what]] Input Buffer: [capital,of,Texas,?] The sequence repeats itself producing a parse state: Parse Stack: [const(S,stateid(Texas)):[?,Texas]capital(C,S):[of,capital],answer(C,_): [the,is,what]] Input Buffer: [] The final step is to take the logical form on the parse stack and put it into one of the arguments of the meta-predicate resulting in: Parse Stack: [answer(C,(capital(C,S), const(S,stateid(Texas)))): [?,Texas,of,capital,the,is,what]] Input Buffer: [] As this is the final parse state, the logical query is then constructed from the parse stack. 16
  • 17. 4 NLIDB ARCHITECTURE 4.1 Pattern-matching systems The first NLIDBs were based on pattern-matching techniques. As a simple illustration of pattern matching technique, consider the following database: Countries_Table Country Capital Language France Paris French Italy Rome Italian … … … Table 2 Sample Database Table28 A primitive pattern-matching system according to [8] may use rules as: Pattern: … ”capital” … <country> Action: Report CAPITAL of row where COUNTRY = <country> Pattern: … “capital” … “country” Action: Report CAPITAL and COUNTRY of each row If the user asked “What is the capital of France?”, using the first pattern rule the system would report “Paris”. The system would also use the same rule to handle questions such as “Print the capital of Italy”, “Could you please tell me what is the capital of France?” etc. Some advantages of this approach are that it requires no complicated parsing or interpretation modules, and that it is easy to implement. But the main advantage of this approach is its simplicity. However the shallowness of this approach often lead to bad failures. An example is when a pattern-matching NLIDB was asked “TITLES OF EMPLOYEES IN LOS ANGELES.” the system reported the state where each employee worked, assuming the “IN” to denote the post code of Indiana, and assumed that the question was about employees and states.29 4.2 Parsing based systems In general as [8] suggests, the system architectures of some NLIDBs can be seen as being made of two major modules. The first module controls the natural language, where a question is submitted and successively transformed. At the end of this process one or more intermediate logical query expressions is obtained. Given the dimension of the domain and the flexibility of the natural language, there usually exist several interpretations of the same question. The 28 Androutsopoulos, I., Ritchie, G.D., and Thanisch, P. p.14 29 Androutsopoulos, I., Ritchie, G.D., and Thanisch, P., pp.14-15 17
  • 18. second component is in charge of the connection with the database, translating the expressions to structured query language (SQL) expressions (using mapping) and sending them to the Data Base Management System (DBMS) to produce the answers.30 For a graphical explanation of the structure, examine Figure 4. Figure 4 NLIDB Architecture31 As described in the previous section, the source language sentence is first parsed, producing a parse tree. The two methods often found of parsing are the syntax based and semantic grammar based. 4.2.1 Semantic grammar based parsing Using this technique, the grammar’s categories do not necessarily correspond to syntactic concepts. Examine the following figure: 30 Reis, P., Matias, J. and Mamede N. p.3-4 31 Androutsopoulos, I., Ritchie, G.D., and Thanisch, P. p.18 18
  • 19. Figure 5 Semantic base parsing tree32 Notice that some categories of the grammar (e.g.: Substance, Magnesium, Specimen_question) do not correspond to syntactic constituents (e.g.: Noun- Phrase, Noun, Sentence). This is because the semantic information about the knowledge domain (e.g.: a question may either refer to specimens or spacecraft) is hared-wired into the semantic grammar.33 Because the semantic grammar approach contains hard-wired knowledge about a specific knowledge domain, it is very difficult to transfer it to other knowledge domain. A new semantic grammar has to be written whenever the NLIDB is configured for a new knowledge domain.34 4.2.2 Translation The translation is usually based on several mapping tables. Figure 6 illustrates this process for both the addition of new information based on an input sentence and the processing of a related query. The query is represented by a small graph, which initiates the mapping to the semantic hierarchy. The small graph is mapped to the semantic network by creating a link from each node in the smaller graph to the corresponding nodes in the network starting with the most general concept (the root) and ending with the most specific. This will create a unique instance, which is the intersection of all of the nodes involved in the query and may be used to narrow down a neighbourhood based on the requested information. 35 The mapping process is bounded by rules, and completely based on the information of the parse tree. As an example of mapping rules, consider the previous query of “which rock contains magnesium” taken from [1]: • The mapping of “which” is for_every X. 32 Androutsopoulos, I., Ritchie, G.D., and Thanisch, P. p.17 33 Androutsopoulos, I., Ritchie, G.D., and Thanisch, P. p.17 34 Androutsopoulos, I., Ritchie, G.D., and Thanisch, P. p.17 35 Beck, H., Mobini, A., Kadambari, V. [online] 19
  • 20. The mapping of “rock” is (is_rock X). • The mapping of an NP is Det’ N’, where Det’ and N’ are the mappings of the determiner and the noun respectively. Thus resulting in for_every X (is_rock X). • The mapping of “contains’ is contains. • The mapping of “magnesium” is magnesium. • The mapping of a VP (V’ X N’). Thus resulting in (contains X magnesium). 20
  • 21. Figure 6 Mapping and Query Processing Model36 Figure 7 demonstrates when the user ask a query on how John spent his leisure time and displays how the answer to the query is produced by exploiting the relationship between "spending leisure time" and "having a chance to go fishing" (both are "doing"). Figure 7 Query processing model37 In many systems the syntax rules linking non-leaf nodes and the semantic rules are domain independent, and can be used in any application domain. The information describing the possible words (leaf nodes) and the logic expressions is domain dependent and has to be declared in the lexicon.38 As an example, consider the lexicon used in MASQUE [8] listing the possible words, “capital”, “capitals”, “border”, “borders”, “bordering”, “bordered”. • The logic expression of “capital”, “capitals” could be capital_of(Capital,Country). • The logic expression of “border”, “borders”, “bordering”, “bordered” could be borders(Country1,Country2). • The logic expression of “country” could be is_country(Country). 36 http://archive.ncsa.uiuc.edu/SDG/IT94/Proceedings/Searching/beck/fig2.gif 37 http://archive.ncsa.uiuc.edu/SDG/IT94/Proceedings/Searching/beck/fig3.gif 38 Androutsopoulos, I., Ritchie, G.D., and Thanisch, P. p.19 21
  • 22. Then the question, “What is the capital of each country bordering Greece?” would be mapped to this query: answer([Capital, Country]):- is_country(Country), borders(Country, Greece), capital_of(Capital, Country). The meaning of the logic query above is to find all pairs [Capital, Country], such that Country is a country, Country borders Greece, and Capital is the capital of Country. The interpreter also needs to consult a world model that describes the structure of the surrounding world as shown by the figure below. Typically, the model contains a hierarchy of classes of world objects, and constraints on the types of arguments each logic predicate may have.39 Figure 8 Hierarchy in world model40 39 Androutsopoulos, I., Ritchie, G.D., and Thanisch, P. p.18-19 40 Androutsopoulos, I., Ritchie, G.D., and Thanisch, P. p.19 22
  • 23. 5 MARKET TEST In order to get a good estimate of the current state of the technology, the applications presented in the previous chapter were subjected to a neutral test. 5.1 Goals The goals of the tests were: • To get a thorough understanding of contemporary market applications; • To get an estimate of the relevance and importance of this type of systems; • To get some insight into what features are more and less important. 5.2 Tests The tests were carried out on the Northwind database, a sample database with information on a shipping company. The database comes as a demo with all distributed copies of Microsoft Access. A number of different queries of different types were posed to the respective natural language front ends. The questions were classified as simple (S), average (A), or complex (C). For a more comprehensive explanation of the considerations behind the testing procedures, see Appendix A. 5.3 Results 5.3.1 Impressions 5.3.1.1 Microsoft English Query English Query is a development environment that enables programmers to produce natural language front ends for SQL 2000 databases. The product is included with SQL 2000. The tests were performed on a demo of English Query, developed by Microsoft to interface with the Northwind database. The user interface has five fields, with the following functionalities: • Query (user input) • Interpretation of query • Required operations • Produced SQL statement • Results A screen shot from one of the queries is presented in Figure 9. 23
  • 24. Figure 9 Microsoft English Query. 5.3.1.2 Elfsoft Elfsoft works together with either VB or Access. Queries are entered in a query window (see Figure 10) and can be output either as database tables (see Figure 11) or in a graphical format. Figure 10 Elfsoft query window. 24
  • 25. Figure 11 Elfsoft answer output. Elfsoft also includes several other options for enhanced portability, including: • Automatic analyser of any Access database • Enabling the user to teach program meanings of phrases • Allowing the user to explain why a query failed (what was missing and/or wrong) • Permitting the user to edit the dictionary • Logging of queries for statistics 5.3.2 Query results The results are summarised in Table 3. A full recollection of the questions asked is presented in Appendix B. Table 3 Accuracy percentages. Type of query English Elfsoft Query Simple 71 23 Average 50 40 25
  • 26. Complex 67 100 26
  • 27. 6 FUTURE During the mid-eighties it was believed that natural language processing systems would become a universal interface to databases worldwide41. However, due to the emergence of graphical interfaces to databases, the relative simplicity of SQL and the inherent problems of natural language processing they have never really caught on commercially42. The current position of NLIDBs is probably best described by “it’s a great idea, but…” Although their usefulness is appreciated, they are still at a research stage. There are several reasons as to why their usage is not taking off on a broader scale. 6.1 Language challenges It is still very hard to encode the vast source, complexity and ambiguity of a human language into a computer. The formalisms for representing language patterns are still not comprehensive enough to capture all the different ways that expressions and terms can be constructed and given meaning depending on the context. 6.2 Portability challenges Although several systems for communication with individual databases have been successfully implemented and used, a general technique, which would allow the user to specify the database and use a system with any database management system (whether it be Access, SQL 2000, Oracle or any type), is still rather elusive. This would require the system to be able to recognize the fields and attributes of the new storage source seamlessly. An even bigger hurdle to portability is the nature and scope of language understanding. Language use in different domains is very dissimilar, which means that any portable system has to have a huge vocabulary with terms from many different application domains and be able to recognize expressions from users of a wide variety of professions. 6.3 Competing systems Graphical and form-based interfaces have become the de facto standard for database front ends. Because of the challenges presented above, these other types of systems are generally possible to develop in shorter time and at a lower cost. 6.4 Possible avenues There is still a lot of research going on in this area. Having explored the application of Natural Language Processing as database interfaces, the authors can see a number of different scenarios. 41 Johnson, T. 42 Androutsopoulos, I., Ritchie, G.D., and Thanisch, P. pp.29-81 27
  • 28. 6.4.1 Adaptation techniques There is a need for methodologies that would enable the user to specify the data source in a general descriptive language and to supply a given set of terms used within the domain. This would make the application portable from database to database. This need has been recognised in [8], where a solution based on the general Resource Description Framework (RDF). The system outlined in [8] learns the pattern and domain vocabulary of any given database automatically and also contains an interface that allows the user to change the database model (classes, properties, tables etc.). 6.4.2 Speech-based techniques Certain authors [8] believe that natural language keyboard interfaces will be superseded by speech recognition systems. However, as such systems are of an even more complex nature, some of the linguistic challenges will have to be solved first. Research on NLIDBs can therefore be a base for the development of voice-based systems [8]. 6.4.3 Learning algorithms Every person has its own vocabulary and way of using language. There is absolutely no way that a program can contain all words in a language or all different meanings that a term may take on. Further, the use of language changes over time, which means that the semantics and vocabulary of a system may become obsolete after a certain time of use. An important challenge for a natural language database front end (or any natural language processing system in general) is to possess an ability to learn, as it is used, evolve with the user and adapt to new users. This ability is after all one of the definitions of artificial intelligence. There are several ways in which this could potentially be achieved. Note that these are suggestions and not based on in-depth research. 6.4.3.1 User Dialogue One way to achieve learning would be to include a lexical editor, where the user could enter language terms and link them to their synonyms. They should also be able to specify the different forms of the word, e.g. noun plurals, adjective comparative forms, verb tenses etc. This ability is present in Elfsoft. 28
  • 29. 6.4.3.2 Neural Networks By use of probabilistic techniques, a system might be able to adjust probabilities of different parses based on training texts and test texts, which have been parsed and tagged by the user or obtained from linguists. By continuously retraining the network with parsed texts from the database-specific domain, the neural network would be able to pick up language patterns and learn incrementally. 6.4.3.3 Genetic Algorithms Another way would be for the system to obtain feedback from the user on the accuracy (e.g. ask the user whether queries were answered correctly) and adjust its language processing structure (production rules) by the use of genetic algorithms. 29
  • 30. 7 CONCLUSIONS The project has focused on two main topics: • The techniques of translating a question in natural language into a database query, extracting the results that the user is looking for; • The leading contemporary applications on the market. The underlying methods belong in the general natural language processing area, while any system has to select among several different techniques involving different degrees of syntactic analysis, semantic processing or a combination. A general feature seems to be the translation of the query in two steps, first to an intermediate language and then to a database query language, e.g. SQL. The topic integrates approaches several other facets of artificial intelligence, e.g. production systems, neural networks, expert systems, and machine learning. Two of the leading commercial software packages were tested with mixed results. Some rather complex queries were handled well, while the systems tended to have problems handling rather easy tasks. The sample sizes involved are too small to base any general conclusions on, however. The reason for this is that the configuration of the university computers at our disposal could not be used for testing the programs. Many companies have overestimated the use of natural language processing in the database interface. Their interpretation of the system is that it is able to understand the significance of the query accurately. However, the system is not able to fully comprehend the human language and jargon unless it has been given the definitions for these terms relating to the relevant database.43 This mainly involves the semantic analysis. A sentence that is syntactically structured may have lead to various meanings, which may not even be similar to one another. Thus, as a result, this will produce undesirable conclusions in the database queries. This is one main reason why many systems tend to fail and explains why most companies would still rather rely on SQL programmers for their database processing. Although these kinds of applications are rather unpopular, the authors enjoyed using them and encourage their future evolvement. From the experiences of the performed tests, systems have the potential to make the task of searching for information a lot less tedious and time-consuming. The eventual success for natural language front ends will depend on how well they can adapt to new environments, both regarding databases and users’ way of using language. Two proposed benchmarks for these types of systems could be: 43 Timo Honkela 30
  • 31. It has to be able to learn and understand the database faster than the user; • It has to learn natural language faster and easier than the user can learn a programming language. 31
  • 32. ACKNOWLEDGEMENTS The authors wish to extend their appreciations to the following people for their support during the course of the project: • Jon Greenblatt, President of English Language Frontend Software Co. • Girish Mohata, Teaching Fellow, IT School, Bond University 32
  • 33. 8 BIBLIOGRAPHY 1. Androutsopoulos, I., Ritchiey G.D., and Thanischz, P.: Natural Language Interfaces to Databases - An Introduction. Journal of Natural Language Engineering, vol. 1, No. 1. Cambridge University Press 1995 2. Baker, J.K.: Trainable grammars for speech recognition, Speech Communication Papers for the 97th Meeting of the Acoustical Society of America, Acoustical Society of America 1979. 3. Beck, H., Mobini, A., Kadambari, V. A Word is Worth 1000 Pictures: Natural Language Access to Digital Libraries. University of Florida. http://archive.ncsa.uiuc.edu/SDG/IT94/Proceedings/Searching/beck/b eckmain.html 4. Dialog-Oriented Use of Natural Language http://www.dfki.uni- sb.de/vitra/papers/ro-man94/node5.html. Accessed on 310701 5. Dougherty, R.C.: Natural Language Computing An English Generative Grammar in Prolog. Erlbaum, Lawrence Associates 1994. 6. EasyAsk - Applications Overview http://www.englishwizard.com/applications/index.cfm -. Accessed 19/7-2001 7. ECL I Vertiefung: natürlichsprachliche Zugangssysteme: chat80. http://www.ifi.unizh.ch/cl/broder/chat/chat80.htm. Accessed 12/7-2001 8. Eriksson, G.: Översättarteknik. KFS AB 1984. 9. Groucho Marx in the movie Animal Cracker. 10. Hafner, C. D. and Gooden, K.: Portability of Syntax and Semantics in Datalog. ACM Transactions on Information Systems, vol. 3. Association for Computing Machinery 1985. 11. Honkela, T., The Www Version Of Self-Organizing Maps In Natural Language Processing of Helsinki University of Technology – viewed on 22/07/01 http://www.cis.hut.fi/~tho/thesis/ 33
  • 34. 12. Johnson, T.: Natural Language Computing: The Commercial Applications. Ovum 1985. 13. Jurafsky, D. and Martin J. H.: Speech and Language Processing, An Introduction to Natural Language Processing, Computational Linguistic, and Speech Recognition. Prentice-Hall 2000 14. Language Reference http://www.darpa.mil/ito/psum2000/h165-0.html. Accessed 14/7-2001. 15. Luger, G.F. and Stubblefield, W.A.: Artificial Intelligence. Structures and Strategies for Complex Problem Solving. Third Edition. Addison-Wesley 1999. 16. Manas Tungare – Natural Language Processing http://www.manastungare.com/articles/nlp/natural-language- processing.asp. Accessed 30/07/01 17. Manning, C.D. and Schutze, H.: Foundations of Statistical Natural Language Processing. MIT Press 1999. 18. Natural-Language Database Interfaces from ELF Software Co http://www.elfsoft.com/ns/FAQ.htm -. Accessed 19/7 – 2001. 19. Palmer, M. and Finin, T.: Workshop on the Evaluation of Natural Language Processing Systems. Computational Linguistics, vol. 16, pp. 175-181. MIT Press 1990. 20. Penn Treebank Project http://www.cis.upenn.edu/~treebank/. Accessed 10/7 – 2001. 21. Reis, P., Matias, J., Mamede, N.: Edite – A Natural Language Interface to Databases, A new dimension for an old approach. http://digitais.ist.utl.pt/cstc/le/Papers/CSTCLE-12.PDF 22. Sharoff, S. and Zhigalov, V.: Register-domain separation as a Methodology for Development of Natural Language Interfaces to Databases. Proceedings of the IFIP TC.13 International Conference on Human-Computer Interaction. International Federation for Information Processing 1999. 34
  • 35. 23. Tang, R. L.: Integrating Statistical and Relational Learning for Semantic Parsing: Applications to Learning Natural Language Interfaces for Databases. University of Texas May 2000. 35
  • 36. 9 CONTRIBUTIONS The respective chapters were produced by the following group members: Chapter 1: Jun Chapter 2: Hakan Chapter 3: Aris and Hakan Chapter 4: Aris Chapter 5: All Chapter 6: Hakan Chapter 7: Hakan and Jun Bibliography and report compilation: Aris Appendices: Hakan 36
  • 37. APPENDIX A Evaluating Systems Introduction How good is a natural language database interface? The answer to this question is hard to define. A survey conducted during the course of this project revealed the existence of no formal evaluation techniques. As long as this situation remains, an unambiguous answer to the question will elude all stakeholders in this area. Why is there a need? The need for formal evaluation schemes in this field, as in any other arises out of several stakeholders’ desires: • Users want a guide for choosing between systems; • Companies want benchmarks for product development and improvement; • Companies need metrics for proving the capabilities of their products. Current Marketing The companies behind contemporary techniques market their products with some of the following arguments: • Ease of set up and integration with new databases. It is often mentioned [6,18] that end users will be relieved of the task of having to learn and understand the internal workings of the DataBase Management System (DBMS) • Money saved on searching • Price • Ease of integration across different DBMSs (Access, SQL Server, Oracle etc.) • Accuracy • The possibility to perform searches on several data stores simultaneously Problems There have been some attempts to define general formal metrics for natural language processing systems [19]. In [19], it was concluded that this is a difficult task for a number of reasons: • Systems are built using a variety of techniques; • They are used in many different domains, where users’ needs are varying; • There is a lack of funding for research in this area. • However, it is also concluded that database front ends constitutes one of the type of systems where metrics potentially could be developed and adopted.
  • 38. Black box metrics In [19], a strong distinction is made between black box and glass box metrics. A black box approach only looks at the output generated by a certain input and does not take into account the architecture of system, or the efficiency of individual components. Advantages • It takes the user’s view; • It can be applied across platforms, on systems with different implementation details; • It doesn’t tie to a specific implementation technique; • It can be used over time, regardless of trends in database and programming methodologies. Disadvantages: • It doesn’t give a good indication to programmers of what is actually wrong; • It is badly suited for testing individual components of a system. Proposed black box evaluation scheme The proposed evaluation scheme takes into account several different aspects of the program in question. Evaluation can be based on the following characteristics: Overall Characteristics • User Friendliness: Is the application easy to understand and use? Are help files accessible and explanatory? Are error messages clear? • Portability: Can it be used in conjunction with only a specific database? If no, how easy is it to integrate it with other databases? • Speed: How fast are answers extracted? • Fault Tolerance: Can the system recognize off-topic questions (queries on information that is not in the database) and give an informative response within a reasonable time frame? • Accessibility: Can it be used over the web? Vocabulary Can the system accurately understand the following expressions44: • What? • Which? • How many? • How much? • Show • List 44 This list is arbitrary and may have to be expanded/contracted.
  • 39. Tell • Count Ease of Interaction • Linguistic Flexibility: How many spelling errors in a word can the system tolerate and understand? Can it suggest alternative spellings45? • Probing questions: Are “follow-up” questions (questions referring to the previous answer) allowed? • Can the system adjust for bad grammar and still understand the question? Accuracy based on input complexity The system is asked a number of different questions. These questions are ranked as simple, average or complex. The accuracy (percentage of questions answered correctly) in each of the three categories is noted. The evaluation scheme formed the basis of the market tests of chapter 5. However, because of the small sample size of tested applications, no attempt to formalize the scheme or develop a metric based on it was made. 45 For an example of this capability, please try a search on http://www.google.com with a word containing a slight spelling error, e.g. elpheants.
  • 40. APPENDIX B Test Protocol The questions asked, their respective classifications, and the outcomes for the tested programs are presented in table 4. In the classification column, S stands for Simple, A for Average, and C for Complex. Table 4. Test Protocol. Question Class Microsoft English Elfsoft outcome Comments Query outcome Who is the oldest S Correct Correct English Query gave employee? the oldest person, Elfsoft the one who had worked the longest at Northwind. Which supplier C Correct Correct (currently) supplies the most products (which are not discontinued)? Which employee A No answer Correct Elfsoft gave too has handled the much information most orders? What product is S Correct No answer the most frequently ordered? List the country A No answer Correct that has a supplier that ships tofu. Name the third S No answer No answer most ordered product. What is the least S Wrong No answer ordered product? How much is S Correct No answer 1kg of Queso Cabrales?
  • 41. Question Class Microsoft English Elfsoft outcome Comments Query outcome How much tofu A No answer Correct Elfsoft gave too have been much information ordered? Show the phone S Correct Correct number of united package. Tell me the S Correct No answer names of the sales representatives Tell me the age A Correct No answer of these people. And their phone A Correct Correct numbers? Count the S Correct Correct customers in Germany. What is the A Correct Wrong average age of the employees? Name the A Correct No answer employees that are older than average Give the name of S Correct No answer the sales manager. Where is Around S Correct No answer the Horn from? What is the A No answer Wrong median of the age of the employees? List the names of S No answer Wrong the people working currently in the company. Who is older S Correct No answer than Janet?
  • 42. Question Class Microsoft English Elfsoft outcome Comments Query outcome What can you tell S Too little No answer me about Ernst information Handel? Which supplier C Correct Correct supplies tofu but not longlife tofu? What are the C No answer Wrong contact names and phone numbers of customers that have received products sent with Federal Shipping? What are the A Correct Correct Microsoft English products that Query had the federal shipping wrong ships interpretation. What customers A No answer Wrong received these shipments?