This presentation was created to present the project done as a part of Applied Management Research Project in Vinod Gupta School of Management, IIT Kharagpur
Text Analytics- An application in Indian Stock Markets
1. Vinod Gupta School of Management, IIT Kharagpur
Text Analytics- An Application in
Indian Stock Market
Applied Management Research Project, 2014
By Sinjana Ghosh
Done under the able guidance of
Prof. A. K. Misra
3. Algorithmic Trading in India
ď Involves the use of algorithms in pre-built platforms to
place electronic trades on stocks, futures, options,
currencies and commodities on exchanges, without any
human intervention
ď In 2008, India allowed the first Direct-Market-Access
(DMA) and algorithmic trades to go through
ď The most commonly used strategies of algorithmic
trading in India include arbitrage, market making and
trend following algorithms
4. Big Data
ď Data available in various forms â not just structured
but also semi-structured like XML and EDI
Documents and unstructured like Text, multimedia
etc.
ď Big Data analytics is the strategy of using this huge
amount of data which is now accessible through
internet, mobile messages and various other
platforms, to extract useful information , that can be
further analyzed to help in the decision making
process
5. Text Data analytics
ď Subset of Big data analytics which involves extraction of
entities like person, location, organization etc. from text
messages and relationship between the extracted entities
and analysing them for business needs
Predictive analytics
ď Involves searching for meaningful relationships among
variables and representing those relationships in models
ď Response variables and explanatory variables
ď Two common types of model: Regression and
Classification
6. Sentiment Analysis
ď Use of natural language processing, text analysis and
computational linguistics to identify and extract
subjective information in source materials
ď Aims to determine the attitude of a speaker or a writer
with respect to some topic or the overall contextual
polarity of a document
Machine Learning
ď A branch of artificial intelligence, concerns the
construction and study of systems that can learn from
data
7. The Problem
Using text mining of news articles available in the public
domain to analyse the market sentiment and correlate it
with the actual movement in Nifty 50
8. ď Use textual news from a plethora of online
resources to perform data mining to check for
occurrence of a basic set of keywords in the
article.
ď Training a machine learning algorithm for
accurately predicting the impact of the most
viewed news articles on the market sentiment
and predict the movement of market represented
in the study by Nifty50.
ď Validate the results obtained through training set
using a set of recent news articles (Test set) to
check for errors and level of accuracy.
Objective
9. Methodology
ď Textual Representation
ď Bag of words
ď Noun Phrasing
ď Named Entities
ď Named Entities with context-capturing feature
ď Predictive Modelling Approach
Source: Modeling Techniques in Predictive Analytics: Business Problems and
Solutions with R (Mill)
11. Methodology
ď Partitioning data in machine learning
Source: Modeling Techniques in Predictive Analytics: Business Problems and
Solutions with R (Mill)
12. Text Analysis Algorithm
1. Convert all the characters to lowercase
2. Remove stop-words which does not help in sentiment analysis
like âisâ, âareâ, âifâ, âwhenâ, âwhereâ, âthenâ, âtheirâ, âthereâ,
âwhereâ, âwhyâ, âwhenâ, âwhichâ, âhowâ
After this the following is done:
1. Create an array of named entities which are of significance
like âinflationâ, âgdpâ, âsensexâ etc.
2. The script is run which extracts the named entities which
occur in the article along with the 2 words immediately
preceding and 3 words immediately succeeding it. This is done
to not only capture the keywords but also the context.
3. The algorithm is trained by assigning weights to each of the
keyword so that the sentiment score most closely reflects the
actual returns of the day.
13. Text Analysis Algorithm
4. A set of qualifiers is defined and the preceding and succeeding words
captured as âcontextâ of the extracted keyword. The algorithm further
assigns a weight (-1 for negative, 0 for neutral and +1 for positive) to
each extracted qualifiers.
5. The sum product of the qualifier weight and keyword weight gives the
actual sentiment score of the article from which the returns of the day
due to that news can be predicted.
6. Importance score is simply the sum of the weights of the individual
occurrence of keywords in the article. However, whether the effect
will be positive or negative, and how much the market will react to it
is determined only by the sentiment score.
7. Regression is performed on the scores versus actual returns for the
training set and a formula is obtained for converting the scores into
forecasted returns.
8. This is tested on the validation set and errors are calculated.
14. Training of algorithm
ď Training set: Daily returns of 2013-14 with
returns>1% or returns<1%
ď Several iterations were run and regression was
performed at each level to finalize the set of
keywords in the lexicon, weights of each keyword,
set of qualifiers and their scores, and the set of
exceptional items in the lexicon
ď Started iteration with 50 articles ended with 125
articles
15. Analysis and Results
125 news articles in the training set were analyzed using
the script in R and the following are extracted:
⢠All the named entities occurring in the news article that
match with the lexicon
⢠Capture the context in which they appear by extracting
the preceding as well as succeeding words of the named
entity
16. Interesting observations
ď The number of keywords that a news article contains has a
much lesser bearing on the effect of the news article on the
market as does the context in which it appears. Based
simply on the occurrence of keywords 35 news articles got
importance score greater than 80 but when sentiment
score was calculated most of the context led to neutral
scoring (0) thus leading to low sentiment score suggesting
low returns ( both on the positive as well as negative side)
ď The keywords assigned highest weight while training of the
algorithm are :
ď RBI
ď Rupee
ď Inflation
ď GDP
17. Interesting observations
ď Names of specific indices, or industries or results of
specific companies which contain terms like
âquarterlyâ, âresultsâ, âannualâ, âprofitâ, ârevenueâ
etc. are least useful in evaluating the sentiment of the
overall market represented by Nifty
ď When the Gold prices came down drastically, markets
in most nations fell as gold mutual funds incurred
huge losses. However, in India broad indices
outperformed on the same event, which goes on to
show that the prices of precious metals have inverse
effect on the Indian stock market as a whole. So gold
has also been included in the list of exceptional items
in the lexicon.
25. Conclusion
ď The algorithm used in the study along the weights given to
the terms in lexicon and qualifiers is able to predict daily
market returns effectively for daily returns greater than
equal to 1% (positive or negative)
ď Indian stock market does react to systemically important
news articles
ď Textual analysis of publicly available of news articles have
significant predictive quality
ď As efficiency of Indian market increases hence arbitrage
opportunities will be less, so algorithmic traders will have
significant advantage over manual traders if text analytics
is implemented in algorithmic trading
26. Scope of further work
ď News articles can be clustered or classified into âeconomic
newsâ, âpolitical newsâ and âother newsâ based on the
frequency of specific named entities to find out which type of
news have greatest impact on the Indian market
ď If minute-wise market returns are available then news articles
can be collected every hour and the returns can be observed
over a period to find how much time it requires a news article
of a certain importance score to affect the market
ď This text mining algorithm is not fully automated. The news
articles need to be fed manually into the program for it to run
and predict the returns. However this process can be
automated to obtain live news feed from websites and
automatically predict its importance and sentiment score. If
the score is higher or lower than a particular range, then BUY
or SELL (or short sell) calls can be taken automatically by the
machine.