Imperfect look at possible applications of Web Based Sentiment Engine MECB 2012.
Sentiment analysis involves classifying opinions from text as "positive", "negative" or “neutral. Its purpose and benefit is to assist in extracting valuable information and insight from copious amounts of unstructured data. This proposed system will have the capability to determine online sentiment on current affairs for the purpose of analysis and prediction. For the sentiment analysis a cluster-method approach is recommended, which is a recent advancement in this area. Various APIs will assist in extracting other data such as location and time. Evaluation of system through the use of the Pang et al movie review data sets is recommended to validate basic functionality and real life data in the form of the 2008 US presidential race data to evaluate all functionality of the system. Multiple industries are identified as potential users of this system from marketing companies to hotels adding to our benefit in the commercialisation potential of the system.
1. CA652A
Semantic Web
Based Sentiment
Engine
A system to determine online sentiment
on current affairs for the purpose of
analysis and prediction
11210889
52595354
CA652A
2. ABSTRACT
Sentiment analysis involves classifying opinions from text as "positive", "negative" or
“neutral. Its purpose and benefit is to assist in extracting valuable information and insight
from copious amounts of unstructured data. This proposed system will have the capability to
determine online sentiment on current affairs for the purpose of analysis and prediction. For
the sentiment analysis a cluster-method approach is recommended, which is a recent
advancement in this area. Various APIs will assist in extracting other data such as location
and time. Evaluation of system through the use of the Pang et al movie review data sets is
recommended to validate basic functionality and real life data in the form of the 2008 US
presidential race data to evaluate all functionality of the system. Multiple industries are
identified as potential users of this system from marketing companies to hotels adding to our
benefit in the commercialisation potential of the system.
1|Page
3. A report submitted to Dublin City University, School of Computing for module
CA652: Information Access, 2011/2012.
We hereby certify that the work presented and the material contained herein is
my/our own except where explicitly stated references to other material are made
Student Numbers
52595354
11210889
2|Page
4. TABLE OF CONTENTS
Abstract .................................................................................................................................... 1
Introduction ............................................................................................................................ 5
Concept Overview ................................................................................................................. 5
Constraints and Limitations ............................................................................................ 5
Functional Description ......................................................................................................... 6
Sentiment Search Functions............................................................................................... 6
Techniques ........................................................................................................................... 6
Time parameter Based Search ....................................................................................... 8
Geographical Extraction Based ..................................................................................... 8
Social Sentiment Extraction Based data ....................................................................... 9
Graphical Data Generation Tools ................................................................................. 9
Pros & Cons of proposed system ...................................................................................... 9
Evaluation Plan..................................................................................................................... 10
Stage One Testing - Validation ..................................................................................... 10
Stage Two Testing – Functionality Testing ................................................................ 11
Stage Three Testing – Real Life Data ........................................................................... 11
Commercialisation Potential ............................................................................................. 13
Conclusion and Further Research Opportunities .......................................................... 14
References .............................................................................................................................. 15
3|Page
6. INTRODUCTION
The ‘media’ as we now conceptualise it has changed dramatically. With the internet,
people have an opportunity to ‘weigh in’ on events, by providing their opinions, and
feedback and in real time through blogs, forum, social networks and commenting
systems on news websites. There is a growing interest in measuring sentiment that
can be contributed to the dramatic increase in the volume of digitized information.
“An increasing number of studies in political communication focus on the “sentiment” or
“tone” of news content, political speeches, or advertisements” (Young, L, & Soroka, S 2012)
This report discusses the concept of developing a Semantic Web based sentiment
engine that will be able to analyse public sentiment on current issues, from politics
to reality TV shows. Based on the analysis, tracking of popular opinion through
social media channels and leveraging research in the area of sentiment analysis,
accurate predictions could be made possible on events from presidential elections to
the X-Factor competition.
CONCEPT OVERVIEW
This proposed system is not a standard sentiment engine that returns static data; it
offers increased functionality to assist with data interpretation. By allowing end
users to customise their search, filter the returned data under multiple parameters
and have graphical representation of results to facilitate interpretation.
CONSTRAINTS AND LIMITATIONS
The limitations of this concept are not due to the technological constraints but are
simply down to the volatility of public opinion and that is something that cannot be
remedied or correcting by technology.
Another limitation is the scope of the opinion being captured. User groups of social
media and participants in online forums are statistical of a younger age group. The
lack of inclusion of the opinion of older age groups could greatly affect the accuracy
5|Page
7. of the data as it would not be entirely representative – the impact of this imbalance
would particularly impact politics with older groups statistical more likely to vote.
FUNCTIONAL DESCRIPTION
SENTIMENT SEARCH FUNCTIONS
• Users can enter multiple search terms for the purpose of data comparison.
Other features would be utilised to improve the analysis returns.
• Multiple Search Parameters
o Time Frame Defined Search - Data retrieved can be limited to a specific
time frame.
o Geographical Location Based Search – Search data retrieved can be
filtered by location of users
o Narrow Search Scope – Select websites to exclude or restrict search to
small number of websites.
• Graphical representations of the data are generated.
TECHNIQUES
Sentiment Analysis Techniques
There is much research in the area of sentiment analysis, the primary objective being
to find a technique where there is no trade-off between speed and accuracy. Several
new and emerging techniques have been researched as part of identifying the best fit
for this system.
• Proximity-Based Approach (Hasan, S, & Adjeroh, D 2011)
o This proposed method uses proximity-based features to determine
sentiment; proximity distribution, mutual information between
proximity types, and proximity patterns.
6|Page
8. • Based on Annotation (Shukla, A 2011)
o This proposed method counts all the annotation present, calculates
sentiment scores of all annotation including comments to determine
sentiments.
• Sentence-level Lexical Based Semantic Orientation (Khan, A et al, 2011)
o This proposed method uses SentiWordNet to calculate the semantic
‘score’ of sentences it has classified as subjective from reviews and blog
comments.
• Machine Learning approach to contextual information (YANG, C et al, 2008)
o This proposed method differentiates itself from others by taking
context into account when determining the sentiment category. Its
primary focus and test data sets have been blog posts. Figure 1 below,
shows the framework employed.
FIGURE 1 - SENTIMENT ANALYSIS FRAMEWORK
• Clustering-Based Sentiment Analysis Approach (Li, G, & Liu, F 2012)
The method deemed most appropriate for this proposed system was based on a
article from the Journal Of Information Science in April this year, which outlined the
Clustering-Based Sentiment Analysis approach. It proposed that by applying a “TF-
IDF weighting method, a voting mechanism and importing term scores, an acceptable and
stable clustering result can be obtained” (Li, G, & Liu, F 2012) The evaluation results
7|Page
9. were the most impressive of all techniques reviewed as part of this research. It
appears to have performed well in terms of both accuracy and efficiency with no
need for human participation, as can be seen from figure 1.
FIGURE 2 - CLUSTER METHOD ACCURACY/EFFICIENCY
Apart from its accuracy and efficiency, this technique was deemed the most suitable
as it can be applied universally to any data set. Other techniques researched, have
been developed for particular data types, customer reviews or blogs and their
evaluation appraisals appear to suggest they do not perform as well outside of these
data types.
TIME PARAMETER BASED SEARCH
This sentiment engine would make use of the adaptible Librato API libraries to
allow sentiment returns to be time sensative. This would be in order for a user to
evaluate how sentiment is changing over time or what sentiment was during
specific time periods.
GEOGRAPHICAL EXTRACTION BASED
Adding a geographical element would be a unique feature allowing for mapping of
sentiment results. Preferred location content will be pulled from the Twitter API as
it gives access to Twitter profile location. Comment systems used by news websites
etc. request a location prior to posting the comment like on the Irish Times website.
Facebook API allows access to location of user if the privacy setting is turned on.
OAUTH setting would be used to allow the users of the sentiment engine to explore
the opinions of their friends and networked associates and how it would fit on the
sentiment scales. Other free use location APIs may also be needed.
8|Page
10. SOCIAL SENTIMENT EXTRACTION BASED DATA
The content used to create athematrix of information to evaluate sentiment within
via FLP would likely be the following but not limited to: Twitter; Disqus; Livefyre;
Intensedebate; Drupal comments; Wordpress comments; other blog posts; scraped
open facebook and fan page comments; facebook comment system; text comments;
G+ posts; Slideshare.net; Pinterest pins; Google News articles; various bookmarking
site comments like fark.com reddit; and other language relavent wire news services.
GRAPHICAL DATA GENERATION TOOLS
Graphical representations of the data are generated. The results could be rendered as
web-based flash objects or in way that is complient to the evolving HTML5
standards and be IOS 5 comlient given the anamosity Apple has with Adobe over
flash for results to be useful on mobile devices and tablets. These reports woud be
exportable to Crystal Reports.
1600
1400
1200
1000
800 Candidate A
600 Candidate B
400
200
0
Postive Neutral Negative
FIGURE 3 - GRAPHICAL REPRESENTATION OF CONTENT
PROS & CONS OF PROPOSED SYSTEM
The primary argument for why sentiment engines via Semantic Web and linked data
are useful is based upon the new information and insight that can be gleaned from it.
The ability to know relative and positional sentiment can be useful in many anytical
or informational arbitrage situations.
9|Page
11. In terms of the cons, primary concern would be data quality. Problems with data
quality are a huge issue and can skew any resulting analysis. The extent of the data
quality problem has been often discovered by information activists working in the
open data movement.
Secondly privacy concerns and staying within the spirit and letter of the relavent
data privacy laws of the regulatory regime you operate under may at times be an
issue. This can be tricky given the interconnected nature of the web.
Lastly, inaccuracies of data and it being organisied in “short sets” vs deeper data
may create false sentiments. Is their enough data being looked at to create a realist
postive or negative sentiment? Some additional analysis may need some addition
parsing to tease out, for example, intial heated emotion responses from the rationale
morning after response.
EVALUATION PLAN
STAGE ONE TESTING - VALIDATION
The evaluation plan would begin with simple software validation. The first test case
would consist of validating the fundamental functionality of the system, its ability to
differentiate between sentiments. The data set that’s to be used is the movie review
data from Pang et al experiments1 Movie review data is widely regarded as the most
challenging data for sentiment engines to analysis, this can be contributed to the fact
that a positive review may contain descriptions of gory or violent scenes and equally
a negative review could contain descriptions of light-hearted pleasant scenes. For
additional testing other data sets could be used for each iteration of this dynamic
testing stage
1 Pang B, Lee L, Vaithyanathan S. Thumbs up, Sentiment classification using
machine learning techniques. In: Conference on empirical methods in natural
language processing (EMNLP). Philadelphia, Pennsylvania, USA, 2002, p. 79.
10 | P a g e
12. 20%
39%
Neutral
Positive
41% Negative
.
FIGURE 4 - BASIC VALIDATION TESTING RESULTS
STAGE TWO TESTING – FUNCTIONALITY TESTING
The second stage of testing would be the validation of the multiple input
functionality; to ensure that data can be retrieved for two or more search terms and
also that they can be accurately differentiated. The test case for this would be built
on the first stage of testing with added content regarding a second movie etc.
Schlinder's List The Usual Suspects
39% 20% 20% 21%
Neutral Neutral
41% Positive Positive
59%
Negative Negative
FIGURE 5 - TWO TOPIC VALIDATION TESTING
STAGE THREE TESTING – REAL LIFE DATA
The final stage of the evaluation plan would be to perform testing using previous
high profile events as the test cases, such as the US Presidential Election of 2008 and
11 | P a g e
13. the X-Factor competition from previous years. This validation is more complex as it
will span the entire internet not just the staging website.
The testing would be performed over different time intervals, days, weeks, months,
and the entire duration of the event. In the case of the political elections these time
periods could be used to coincide with official opinion polls, for example Gallop and
Rasmussen state side or RedC for Irish based events.
Validation of the geographical based sentiment analysis function would be tested to
gauge the accuracy of the location results. In the case of the US Presidential Election
the final voting percentages for each candidate per state would give an accurate
basis for comparison.
SAMPLE EVALUATION TEST CASE
By taking the ten states where each candidate won by the largest percentage
majority, and graphing the percentage of votes each candidate received, and also the
percentage of positive, negative and neutral data regarding that candidate. What one
would expect in a fully evaluated system would be a close correlation between
positive data and the percentage of votes and also a correlation with the negative or
neutral data and the other candidate’s percentage of votes, as per the sample charts
below for Obama and McCain respectively.
90
Obama’s Data
80
70 Obama's Percentage
60 of Votes
50 McCain's Percentage
40 of Votes
30 Positive %
20
10 Negative %
0
Neutral %
FIGURE 6 - SAMPLE TEST OUTPUT (OBAMA)
12 | P a g e
14. 70
McCain’s Data
60
McCain's Percentage
50 of Votes
40 Obama's Percentage
of Votes
30 Positive %
20
Negative %
10
0 Neutral %
FIGURE 7 - SAMPLE TEST DATA (MCCAIN)
COMMERCIALISATION POTENTIAL
In an era where both business and individuals are attempting to move further and
further to data driven decision sentiment engine products have a range of
commercial potential.
Some companies have already begun commercializing Semantic Web applications
like IBM licensing of their WebFountain Internet analytical engine to FActiva and
ThompsonReuters 2003 for example for those interested in corporate reputational
data.
Various market research for people who cannot afford Enterprise Resoruce Planning
(ERP) add ons like SAP Business Objects, SAS, or say LexisNexis Analytics and for
who the current available crop of free semantic sentiment engines (name a few from
those ten) tools are just insufficient, too niche, or unscalable (Basu, 2010). Semantic
Web products are becoming important in internal and external Business Inframatics.
However, information arbitrage is not merely for professional market traders. This
system would likely be a software as service (SaaS) on the web, it could be sold on a
free-mium basis or a monthly subscription or yearly license depending on the
implementation.
13 | P a g e
15. Primary clients would depend on the sentiments needing to be parsed and the
proprietary and public data sets being used in within the sentiment engine.
Examples to be included: Corporate Media; Content Publishing industry; PR firms;
polling; market research firms; Trading platforms; Political Parties; Elections;
Government agencies; security services; and Bookmarkers for deciding odds on
Novelty bets - reality TV shows, politics etc.
CONCLUSION AND FURTHER RESEARCH OPPORTUNITIES
Where does the Semantic Web lead to exactly? We don’t really know, but opening
up the segregated data silos and making sense of deeper dark ‘big data,’ in pursuit
of the benefits of a deeper rooted “hyperdata” would be a nice path. However, the
road will be long but it may improve our day to day lives immensely.
"Many applications and services claim to be "semantic" in one manner or another,
but that does not mean they are "Semantic Web." Semantic applications include any
applications that can make sense of meaning, particularly in language such as
unstructured text, or structured data in some cases. By this definition, all search
engines today are somewhat "semantic" but few would qualify as "Semantic Web"
apps. (Spivak, 2007)
How we get from the early steps of Web 3.0 to this deeper data web will be a long
process. It will provide countless benefits, many of which we may not even percieve
today. However, sentiment engines are mearly one way to get the public and the
developer community interested and excited for all the other benefits that this open
data future could hold. For that reason sentiment engines will remain an important
component in the near term future, as “big data,” holds much of the future promise
to bring the of the “web of things” and make sense and use of them.
14 | P a g e
16. REFERENCES
Abbasi, A, Hsinchun, C, & Salem, A 2008, 'Sentiment Analysis in Multiple
Languages: Feature Selection for Opinion Classification in Web Forums', ACM
Transactions On Information Systems, 26, 3, pp. 1-34, Computers & Applied Sciences
Complete, viewed 4 May 2012.
Basu, Saikat 2010. 10 Web Tools To Try Out Sentiment Search & Feel the Pulse Make
Use Of [Online] 30 April. http://www.makeuseof.com/tag/10-web-tools-sentiment-
search-feel-pulse/ [Accessed 1 May 2012]
Bergman, Mike 2010. I Have Yet to Metadata I Didn’t Like. AI3 [Online] 16 August.
http://www.mkbergman.com/902/i-have-yet-to-metadata-i-didnt-like/ [Accessed
1 May 2012]
Bollen, J. Mao, Huina. Zeng, Xiao-Jun March 2011. Twitter mood predicts the stock
market. Journal of Computational Science, 2(1), Pages 1-8 Available from:
http://arxiv.org/abs/1010.3003
Cai, K, Spangler, S, Ying, C, & Li, Z 2010, 'Leveraging sentiment analysis for topic
detection', Web Intelligence & Agent Systems, 8, 3, pp. 291-302, Academic Search
Complete, viewed 20 April 2012.
Dalton, Jeff 2007. Caffè Java Open Source NLP and Text Mining tools. Jeff's Search
Engine Caffé [Online] 16 March. http://www.searchenginecaffe.com/2007/03/java-
open-source-text-mining-and.html [Accessed 1 May 2012]
Hamouda, A, Marei, M, & Rohaim, M 2011, 'Building Machine Learning Based Senti-
word Lexicon for Sentiment Analysis', Journal Of Advances In Information Technology,
2, 4, pp. 199-203, Library, Information Science & Technology Abstracts with Full
Text, , viewed 1 May 2012.
Hasan, S, & Adjeroh, D 2011, 'Detecting Human Sentiment from Text using a
Proximity-Based Approach', Journal Of Digital Information Management, 9, 5, pp.
15 | P a g e
17. 206-212, Library, Information Science & Technology Abstracts with Full Text, ,
viewed 7 May 2012.
Kang, H, Yoo, S, & Han, D 2012, 'Senti-lexicon and improved Naïve Bayes
algorithms for sentiment analysis of restaurant reviews', Expert Systems With
Applications, 39, 5, pp. 6000-6010, Academic Search Complete, , viewed 10 April
2012.
Lévy, Pierre CRC, FRSC 2007. Elements of Semantic Engineering I3 workshop / WWW
Consortium Conference / Banff 2007 Available from:
http://www.ieml.org/text/semantic_space.pdf
Li, G, & Liu, F 2012, 'Application of a clustering method on sentiment analysis',
Journal Of Information Science, 38, 2, pp. 127-139, Business Source Complete, ,
viewed 21 April 2012.
Pang B, Lee L, Vaithyanathan S. Thumbs up, Sentiment classification using machine
learning techniques. In: Conference on empirical methods in natural language
processing (EMNLP). Philadelphia, Pennsylvania, USA, 2002, p. 79.
Shukla, A 2011, 'SENTIMENT ANALYSIS OF DOCUMENT BASED ON
ANNOTATION', International Journal Of Web & Semantic Technology, 2, 4, pp. 91-103,
Computers & Applied Sciences Complete, , viewed 6 May 2012.
Spivac, Nova 2007. The Semantic Web, Collective Intelligence and Hyperdata.
novaspivack.typepad.com [Online] 18 September.
http://novaspivack.typepad.com/nova_spivacks_weblog/2007/09/hyperdata.html
[Accessed 1 May 2012]
Vishwanath, J, & Aishwarya, S 2011, 'User Suggestions Extraction from customer
Reviews: A Sentiment Analysis approach', International Journal On Computer Science
& Engineering, 3, 3, pp. 1203-1206, Academic Search Complete, , viewed 1 May 2012.
YANG, C, LIN, K, & CHEN, H 2008, 'Sentiment Analysis in Weblog Using
Contextual Information:: A Machine Learning Approach', International Journal Of
16 | P a g e
18. Computer Processing Of Languages, 21, 4, pp. 331-345, Academic Search Complete, ,
viewed 27 April 2012.
Young, L, & Soroka, S 2012, 'Affective News: The Automated Coding of Sentiment in
Political Texts', Political Communication, 29, 2, pp. 205-231, Academic Search
Complete, , viewed 10 May 2012.
17 | P a g e