Script to Sentiment : on future of Language TechnologyMysore latest
1. Script to Sentiment : on future of Language
Technology
Jaganadh G
jaganadhg@gmail.com
Different Diementions of Language Technology
Central Institute of Hindi
Mysore
Feb. 25-26 2010
Abstract
Human Language Technology(HLT) is no longer confined as a sub-
ject for class room teaching. Revolutionary developments are occur-
ring in the field of HLT. These developments are capable enough to
bring changes in the human life. Information Communication Tech-
nology(ICT) became and inevitable component for our day to day life.
Directly or indirectly we are consumers of ICT based products. For
the last few years we saw that the ICT revolution is appearing in our
native languages too. As a result HLT became a direct or indirect com-
ponent in ICT products and services. HLT is supposed to premate all
ares of our life in future. Whether you are a Doctor, Engineer, Framer
or a lover irrespective of your profile we are all going to be addictive of
HLT based ICT products. The present paper discusses developments
in the field of HLT and the future.
1 Introduction
Human Language Technology(HLT) is no longer confined as a subject for
class room teaching. Revolutionary developments are occurring in the field
of HLT. These developments are capable enough to bring changes in the
human life. Information Communication Technology(ICT) became and in-
evitable component for our day to day life. Directly or indirectly we are
consumers of ICT based products. For the last few years we saw that the
ICT revolution is appearing in our native languages too. As a result HLT
became a direct or indirect component in ICT products and services. HLT
is supposed to premate all ares of our life in future. Whether you are a
Doctor, Engineer, Framer or a lover irrespective of your profile we are all
going to be addictive of HLT based ICT products.
1
2. The history of HLT begins from the birth of Personal Computers(PC).
From the early 1950’s Researchers and Scientists were trying to develop
computers programs that can handle human languages as like a human.
The earliest Research and Development(R&D) in this field was related to
the development of Machine Translation Systems(MT). As of now we can
say that significant developments were occurred in the field and working
systems are available. Some are ready accepted, some are imperfect but no
alternatives. So still we are not in a state to say that ’Yes! we cracked the
language challenge! and now able to provide smart engineering solutions’.
Path breaking R&D activities are happening this field. In this scenario it is
quiet interesting to investigate where we are standing in the field of Human
Language Technology.
The present paper is a compilation on the developments in the field of
HLT. The paper also discuss some of future technologies in HLT. Recent
developments in Indian Language Technology is also discussed in the paper
with special fous on issues involved in it.
2 Where we are now?
R&D activities in HLT can be broadly classified in two major categories.
1) Text processing 2) Speech Processing. Activities under text process-
ing involves development of spell chcekr systems to discourese analysis sys-
tems. Speech processing involves text to speech conversion(TTS) to speech
to speech translation.For all most all tasks in both fields; Free and Open
Source (FOSS)1 and propitiatory solutions are available. Internet based so-
lutions are also there like; Google Translate and other services2 . The FOSS
based solutions as well as public domain solution in this field played a vital
role in rapid developments in HLT including Indian Languages. This section
is a brief survey on present status of the R&D activities in the field.
2.1 Language and Scripts in Computers
In early days of HLT representing the vernaculars in the computers was a
challenge. ASCII 3 was the early character encoding scheme4 existed in the
early days. The encoding scheme was used to represent English alphabets.
This encoding scheme was not sufficient enough to represent the other lan-
guages. Some work around were done for attaining the same. Most of these
1
http://en.wikipedia.org/wiki/Free and open source software
2
http://translate.google.com/#
www.google.com/transliterate/
www.google.com/dictionary etc..
3
http://en.wikipedia.org/wiki/ASCII - Accessed on 01-01-2010
4
http://en.wikipedia.org/wiki/Character encoding - Accessed on 01-01-2010
2
3. workarounds were purely font5 based solutions. In India we developed such
a solution called ISCII 6 for representing Indian Languages. The introduc-
tion of Unicode 7 is a remarkable development in this field. Unicode made
the task of representing vernaculars in computers very easy and it became
de facto standard. Apparently suitable font8 technology also developed.
The incarnation of Unicode standard boosted the penetration of local
language contents in internet. All the living languages which received en-
coding space in Unicode got opportunity to dominate in the Information
technology(IT) world. It leaded to information overflow. As result there is
an increasing demand for information processing tools like keyboard drivers
to search engines to decision support systems9 .
2.2 Developments in Text Processing
This section is a brief survey on the developments in text processing tech-
nologies. A wide variety and number of text processing systems are avail-
able now; like spell checkers, grammar correcting systems, MT systems and
search engines etc.. People who are using computer for preparing the docu-
ments etc.. are familiar with tools like spell checking systems. They knows
that life is not easy without such tools. Because human being is tend to
commit errors and lazy too! But when the computers were placed in the
desk of hard core language people like translators they were interested in
electronic dictionaries as well as machine translation. When computers were
came in to the life of business people they are having different intentions.
But who ever may be and what ever may be the profile of the computer users
category there demands were directly or indirectly related to HLT. Because
everybody’s uses language, and they can’t live with out language. The im-
pact of such demands caused to rise of new methodologies and technologies
in HLT itself. Those developments are discussed here.
Spellcheckers
In computing, a spell checker (spell check) is an application program that
flags words in a document that may not be spelled correctly10 . The very
technology is very-much advanced now. Spell checkers are available for
all most all languages in the world. Most of the popular word processing
software having the feature. Spell checker systems are available for Indian
5
http://en.wikipedia.org/wiki/Font - Accessed on 01-01-2010
6
http://en.wikipedia.org/wiki/Indian Script Code for Information Interchange - Ac-
cessed on 01-01-2010
7
http://unicode.org/
http://en.wikipedia.org/wiki/Unicode - Accessed on 01-01-2010
8
http://en.wikipedia.org/wiki/Font - Accessed on 01-02-10
9
http://en.wikipedia.org/wiki/Decision support system Accessed on 01-02-10
10
http://en.wikipedia.org/wiki/Spell checker
3
4. Languages too. The language software collection cd’s distributed by the
TDIL11 program contains spell checker applications for almost all Indian
Languages.
The FOSS movement in India is very active in spell checker dictionary
development for Indian Languages12 . The FOSS frameworks13 available for
spell checker systems are being widely used by these FOSS peoples. Develop-
ments in Indian Language Spell checker dictionaries needs more volunteers.
Machine Translation
MT is one of the oldest and live task in HLT. For the last 50 and more years
R&D activities in the very field is in progress. Some systems are available
for use too. But majority are not in a state to consider as a perfect solution.
Divergent methodologies are available for the task of MT like statistical,rule-
based and hybrid etc.14 . But fully automated high quality MT remains as
a target to be achieved. Among the available MT systems/services the
Google Translate and Babel Fish15 is most famous. Google Translate have
the facility of English to Hindi and vice verse translation.
MT research in IL is very active from early 1970’s. AnglaBharati16 and
Anusaaraka17 are two major approaches developed in the early days and still
in active development. Other systems like Sampark18 , UNL based machine
translations systems19 are also available. The TDIL program of Govt. of
India is providing extensive support to MT research in India.Except the
above mentioned systems, some other IL MT initiatives are there.
Some FOSS based solutions are also available for MT system develop-
ment. There are two famous frameworks called Moses20 and Apertium21 .
These tools follows the statistical paradigm of MT. MT researchers in India
is also came forward to work in these two frameworks. Hope that this will
boost the MT research in India too.
11
www.tdil.mit.gov.in
12
http://indlinux.org/
http://smc.org.in/
http://wiki.services.openoffice.org/wiki/Dictionaries
13
http://hunspell.sourceforge.net
http://en.wikipedia.org/wiki/MySpell
14
http://www.hutchinsweb.me.uk/IntroMT-TOC.htm
15
http://babelfish.yahoo.com/
16
http://www.cse.iitk.ac.in/users/langtech/anglabharti.htm
17
http://ltrc.iiit.ac.in/˜nusaaraka/
a
18
http://sampark.iiit.ac.in/
19
http://www.springerlink.com/content/t1005w166746727l/fulltext.pdf
20
www.statmt.org/moses
21
www.apertium.org
4
5. Search and IR
Search Engines(SE) and Information Retrieval(IR) systems are the most
widely used HLT tool by the general public. Google22 , Yahoo23 and Bing24
are the three most famous search engine giants in the world. Revolutionary
developments are occurring in this field. Domain based searches like ’patent
search’, content based search like ’video search’, localized search like ’movie
timing’ and cross lingual search are the recent trends in this field. The latest
development in the field is Semantic Search which will be discussed in the
later section of the paper. All the search search engines are now capable
enough to handle local language search requests too. Cross Lingual Infor
Systems(CLIR) for Indian Languages are in development.
3 Speech Processing
This section is meant for to give a brief survey on the developments in
Speech Processing. The main technologies discussed in this section are Text
to Speech(TTS) system and Automatic Speech Recognition(ASR).
TTS
Text to Speech system or TTS is a software which can convert an electronic
text to corresponding speech. The very field involves both text processing
as well as signal processing techniques. R&D activities in this direction pro-
duced hopeful and acceptable solutions. FOSS based as well as proprietary
solutions are available now. The major FOSS based framework available for
TTS system development is Festival25 and Festvox26 system. Introduction
of both framework boosted the development of TTS in various languages in-
cluding Indian Languages too. The most remarkable development in Indian
Language TTS system under FOSS is the Dhvani project27 .
Even-though we are in a state to say that we achieved significant growth
in the field of TTS development more challenges are there. Those challenges
includes providing more naturalness to the synthesized voice, intonation and
emotion based TTS etc..
ASR
Automatic Speech Recognition(ASR) is technology that allows a computer
program to identify and transcribe the word that a person speaks in to
22
http://www.google.co.in/
23
www.yahoo.co.in
24
http://www.bing.com/
25
http://www.cstr.ed.ac.uk/projects/festival/
26
http://festvox.org/festival/
27
http://dhvani.sourceforge.net/
5
6. a microphone. As like TTS, ASR also involves both text processing and
signal processing techniques. It is one of the most challenging and inter-
esting tasks in HLT. Significant developments are in this field too. ASR
systems are available for some Indian Languages like Hindi28 and Telugu.
The most widely used FOSS based framework for ASR development is CMU
Sphinx29 . The introduction of CMU Sphinx opened a new direction in the
R&D of ASR. Apart from CMU Sphinx some other FOSS based as well as
propitiatory frameworks are available for ASR development.
4 Future of HLT
Over the past few decades colossal progress has been came up in the field
of HLT. From simple systems that can understand numbers to text un-
derstanding and summarization systems were developed with in the past
few decades. So many challenges are there to be addressed in the future.
Hopefully we can build complex systems from the existing HLT systems.
These developments are the results of a long journey from lab experiments
to deployment in real time work environments. The wide range of tools and
technologies developed as part of R&D in HLT is capable enough to make
deep impact in the human life. These tools are having great relevance and
impact in market oriented society.
What will be the future? Can we imagine it? Yes! Imagine that you
are asking your car to show the route to Central Institute of Hindi from
Mysore bus stand, and it is telling the directions or giving a detailed printout
describing the route. In-fact it is not a dream technology.It is possible with
clubbing of other technologies like GPS(Global Positioning System) and
Speech Processing. Suppose that a judge is analyzing the arguments related
to a case with a software and reaching in judgment. Or consider a legislative
assembly publishes some draft bills in its website and receives comments on
the bill.After receiving the comment and before proceeding to further actions
they are analyzing it ti find how many of them are positive comments and
how many of them are negative!! It is already possible. The technology
which analyzes the opinion is called ’Sentiment Analysis’. There is no end
for imaginations. But these imaginations will come in to reality very soon.
This section highlights some of the future technologies are R&D ares in HLT.
Semantic Web/Search
Semantics is a branch of modern linguistics which studies about the struc-
ture of meaning. The Semantic Web(SeW) is an evolving development of the
World Wide Web in which the meaning (semantics) of information and ser-
28
http://sourceforge.net/projects/hindiasr/
29
http://www.speech.cs.cmu.edu/sphinx/
6
7. vices on the web is defined, making it possible for the web to ”understand”
and satisfy the requests of people and machines to use the web content30 .
Tim Berners Lee the father of www31 is the inventor of this technology. W3C
or the World Wide Web consortium is the authority in publishing and main-
taining standards and recommendation on SeW. The semantic web based
HLT implementations are going to bring a big revolution in the coming
years. Semantic Search is one of such technologies which HLT people are
discussing now a days. SeW search engines are already there32 , but not that
much accepted as of now. It will bring revolutionary changes in the field of
online publishing, e-governance, and e-commerce etc...
Sentiment Analysis
Sentiment analysis or opinion mining refers to a broad (definitionally chal-
lenged) area of natural language processing, computational linguistics and
text mining33 . The basic task in sentiment analysis is classifying the polarity
of a given text at the document, sentence, or feature/aspect level — whether
the expressed opinion in a document, a sentence or an entity feature/aspect
is positive, negative or neutral34 . The rise social media like blog, twitter,
facebook, and linkedin etc.. has fueled great interest in the field of Sen-
timent Analysis. Publishers, movie companies and fast moving consumer
goods(FMCG) companies are the main consumers of this technology. The
technology is already present in the market. Very soon the technology will
be getting its own position in politics governance etc..
Future of MT
In previous section we discussed the developments in MT research. Re-
markable achievements were made in this direction. But still we have to
issue many issues to achieve the goal Fully Automated High Quality Ma-
chine Aided Translation (FAHQMAT). Other expectation is to build effi-
cient speech to speech translation systems. I think with in a few years our
researchers will be providing revolutionary solutions in this field.
HLT in Education
Computer Assisted Teaching(CAT) is already in practice through out the
globe. It is considered as one of the best way to for effective and interactive
30
Berners-Lee, Tim; James Hendler and Ora Lassila (May 17, 2001). ”The Seman-
tic Web”. Scientific American Magazine. http://www.sciam.com/article.cfm?id=the-
semanticweb&print=true. Accessed March 26, 2008.
31
World Wide Web
32
www.hakia.com
33
http://en.wikipedia.org/wiki/Sentiment analysis
34
http://en.wikipedia.org/wiki/Sentiment analysis
7
8. teaching. HLT techniques like ASR, TTS, morphological synthesis, parsing
and MT can be used for interactive language teaching especially second
language teaching. With the help of HLT we can build online systems which
can teach second language and evaluate the progress made by the student
with out the intervention of a human instructor.
HLT in Bio-Medical Research
HLT techniques like Named Entity Recognition35 (NER),SeW and Text Min-
ing36 techniques are widely used in the field of Bio-Medical research. The
very field of research is now called as Bio-medical Natural Language Processing(Bio-
NLP).
HLT in Forensic Science
Another vital are which HLT is going to applied is Forensic Science. The
HLT techniques are very useful for authorship dispute resolution,disputes of
meaning and use, identification of the author of anonymous texts, identifying
cases of plagiarism37 and reconstructing mobile phone text conversations
etc..
HLT for Business
It is well known that without search engines there is no existence for web-
pages. Without advertisements there is no existence for business too. The
emergence of new media pawed the way to online advertisement techniques.
Marriage of IR and other HLT techniques with online advertisement give
birth to a new field called ’Computational Advertisement’. It helps the ad-
vertisers to put heir advertisement in appropriate place according to the
taste of consumers. Another vital business oriented area of R&D is ’Collec-
tive Intelligence’38 where wide range of HLT techniques are used. It helps
service providers like online stores to give product recommendations for the
consumer based on his/her purchasing behavior and taste. This will be
attained by comparing and analyzing the purchasing behavior and taste
customers who shares similar taste. So remember when ever you are receiv-
ing context relevant advertising or product recommendation the power of
HLT is there!!
35
http://en.wikipedia.org/wiki/Named entity recognition
36
http://en.wikipedia.org/wiki/Text mining
37
http://en.wikipedia.org/wiki/Plagiarism
38
http://en.wikipedia.org/wiki/Collective intelligence
8
9. 5 Issues in HLT
The developments in HLT which happened during the past few years is quite
promising and the future technologies which is slowly coming in to practice
and on the way out of the lab too are quite exiting one. Still there are lots
of research issues are there. This section is dedicated to the discussion on
some of the selected issues in HLT with special focus to Indian Language
Technology.
A large number of Language Technology based products are coming in to
market. How these technology products can be evaluated? Many techniques
were evolved for evaluating LT project/product like EAGLES39 . But most
of the evaluation methodologies are not that much compatible enough to
handle the linguistic phenomenas in Indian Language. A typical example is
MT evaluation. BLUE and METROR are the two major methodologies for
evaluating MT. But both of this methedologies are not that much efficient
for handling MT between English and Indian Languages40 . Another vital
issue in evaluating HLT project/product is availability of data for testing
the tools. For example to evaluate an MT system reference translation sets
are required. In a way the reference translation is parallel corpus only. But
apart from a parallel corpus it has to posses some quality. Such reference
translation corpus should cover different syntactico-semantic phenomena in
source language as well as target language. Availability, especially publically
available such data sets and standards are lacking in the case of Indian Lan-
guages Technology. In the case of India we don’t have any defined standard
body, policy or standard body to evaluatie the HLT projects/products. In
short the issues in HLT can be classified in three broad ares 1) The devel-
opment challenges which involves the algorithm development and baffling
issues in language etc.. 2) Availability of resources and standards in public
domain 3) The evaluation problem. A detailed discussion on the topic is
quite out of the scope of this paper.
6 Conclusion
Much resources and tools were developed in the past few years in HLT.
The developments in the field is quite promising and the future too. As
we discussed in the beginning of this paper we can hope that all the ICT
tools will be powered by HLT in future. On the contrary we cant forgot the
challenges and issues which involved in the field. To solve the major issues
in HLT especially in Indian Language scenario much enhanced policies and
standards might be introduced in near future to boost the R&D activities
in the field.
39
http://www.issco.unige.ch/en/research/projects/ewg95//ewg95.html
40
http://www.cse.iitb.ac.in/˜b/papers/icon07bleu.pdf
p
9