1. iii
TRIBHUVAN UNIVERSITY
INSTITUTE OF ENGINEERING
HIMALAYA COLLEGE OF ENGINEERING
A
FINAL YEAR PROJECT REPORT ON
TWEEZER
(CT-755)
Anil Shrestha(070/BCT/01)
Bijay Sahani(070/BCT/05)
Bimal Shrestha(070/BCT/10)
Deshbhakta Khanal(070/BCT/13)
A PROJECT WAS SUBMITTED TO THE DEPARTMENT OF ELECTRONICS
AND COMPUTER ENGINEERING IN PARTIAL FULFILLMENT OF THE
REQUIREMENT FOR THE BACHELOR ‘S DEGREE IN COMPUTER
ENGINEERING
August, 2017
2. iv
TWEEZER
A
FINAL YEAR PROJECT REPORT
(CT-755)
Anil Shrestha (070/BCT/01)
Bijay Sahani (070/BCT/05)
Bimal Shrestha (070/BCT/10)
Deshbhakta Khanal (070/BCT/13)
PROJECT SUPERVISOR
Er. Nhurendra Shakya
“A major project report submitted in partial fulfillment of the
requirements for the degree of Bachelor’s in Computer Engineering.”
DEPARTMENT OF ELECTRONICS AND COMPUTER
ENGINEERING
HIMALAYA COLLEGE OF ENGINEERING
Tribhuvan University
Lalitpur, Nepal
August, 2017
3. v
ACKNOWLEDGEMENT
It is a matter of great pleasure to present this progress report on development of
“Tweezer”. We are grateful to Institute of Engineering, Pulchowk and Himalaya
College of Engineering management for providing us with this great opportunity to
develop a web application for the major project.
We are also grateful to our project supervisor Er. Nhurendra Shakya for his valuable
advice and suggestion.
Also, we would like to thank Er. Ashok GM, Head of Department and Er. Alok
Samrat Kafle, Major Project Coordinator, for providing us all the necessary
resources that meets our project requirements.
We would like to convey our thanks to the teaching and non-teaching staffs of the
Department of Electronics and computer Engineering, HCOE for their invaluable
help and support throughout the period of the project hours. We will not miss to
express our gratitude to all our friends and everyone who has been the part of this
project by providing their comments and suggestions.
Group Members
Anil Shrestha
Bijay Sahani
Bimal Shrestha
Deshbhakta Khanal
4. vi
ABSTRACT
Analysis of public information from social media could yield interesting results and
insights into the world of public opinions about almost any product, service or
personality. Social network data is one of the most effective and accurate indicators
of public sentiment. The explosion of Web 2.0 has led to increased activity in
Podcasting, Blogging, Tagging, Contributing to RSS, Social Bookmarking, and
Social Networking. As a result there has been an eruption of interest in people to
mine these vast resources of data for opinions. Sentiment Analysis or Opinion
Mining is the computational treatment of opinions, sentiments and subjectivity of
text. In this paper we will be discussing a methodology which allows utilization and
interpretation of twitter data to determine public opinions.
Developing a program for sentiment analysis is an approach to be used to
computationally measure customers' perceptions. This paper reports on the design
of a sentiment analysis, extracting and training a vast amount of tweets. Results
classify customers' perspective via tweets into positive and negative, which is
represented in a pie chart, bar diagram, scatter plot using php, css and html pages.
Keywords: Data mining, Natural language processing, Sentiwordnet, Naïve
Bayes
5. vii
TABLE OF CONTENTS
LIST OF FIGURES...................................................................................................ix
ABBREVIATIONS....................................................................................................x
1. INTRODUCTION............................................................................................... 1
1.1 Statement of the problem ................................................................................... 2
1.2 Objectives......................................................................................................... 3
1.3 Scope of project ................................................................................................ 3
1.4 System Overview .............................................................................................. 3
1.5 System Features................................................................................................ 4
2. LITERATURE REVIEW..................................................................................... 5
3. FEASIBILITY ANALYSIS................................................................................. 8
3.1.1 Technical Feasibility....................................................................................... 8
3.1.2 Operational Feasibility.................................................................................... 9
3.1.3 Economic Feasibility ...................................................................................... 9
3.1.4 Schedule Feasibility........................................................................................ 9
3.2 Requirement Definition ................................................................................... 10
3.2.1 Functional Requirements ........................................................................... 10
3.2.2 Non-Functional Requirements.................................................................... 10
4. SYSTEM DESIGN AND ARCHITECTURE ......................................................... 12
4.1 Use Case Diagram........................................................................................... 12
4.2 System Flow Diagram ..................................................................................... 13
4.3 Activity Diagram............................................................................................. 15
4.5 Data Flow Diagram ......................................................................................... 17
4.6 Flow Chart...................................................................................................... 18
5.METHODOLOGY ................................................................................................ 19
5.1 Machine Learning ........................................................................................... 19
5.1.1 Naïve Bayes Classifier (NB)...................................................................... 20
5.2 Natural Language Processing ........................................................................... 25
5.3 Programming tools .......................................................................................... 26
5.3.1 MongoDB ................................................................................................ 26
5.3.2 Python...................................................................................................... 27
5.3.3 NLTK ...................................................................................................... 27
5.3.4 PHP ......................................................................................................... 27
7. ix
LIST OF FIGURES
Figure 4.1 Use Case Diagram………………………………………………...….13
Figure 4.2 System Flow Diagram………………………………………………..14
Figure 4.3 Activity Diagram……………………………………………………..15
Figure 4.4 Sequence Diagram……………………………………………………16
Figure 4.5 Data Flow Diagram…………………………………………………...17
Figure 4.6 Flow Chart Diagram………………………………………………….18
Figure 7.2.1 Pie-Chart Representation…………………………………………...30
Figure 7.2.2 Bar Graph Diagram…………………………………………………31
Figure 7.2.3 Scatter Plot………………………………………………………….31
8. x
ABBREVIATIONS
API : Application Programming Interface
MongoDB : Mongo Database
NLP : Natural Language Processing
NLTK : Natural Language Toolkit
PHP : Personal Home Page
POS : Parts of Speech
REST : Representational State Transfer
9. 1
1. INTRODUCTION
Sentiment is an attitude, thought, or judgment prompted by feeling. Sentiment
analysis [1-3], which is also known as opinion mining, studies people’s sentiments
towards certain entities. Internet is a resourceful place with respect to sentiment
information. From a user’s perspective, people are able to post their own content
through various social media, such as forums, micro-blogs, or online social
networking sites. From a researcher’s perspective, many social media sites release
their application programming interfaces (APIs), prompting data collection and
analysis by researchers and developers. For instance, Twitter currently has three
different versions of APIs available [6], namely the REST API, the Search API,
and the Streaming API. With the REST API, developers are able to gather status
data and user information; the Search API allows developers to query specific
Twitter content, whereas the Streaming API is able to collect Twitter content in
real time. Moreover, developers can mix those APIs to create their own
applications. Hence, sentiment analysis seems having a strong fundament with the
support of massive online data.
However, those types of online data have several flaws that potentially hinder the
process of sentiment analysis. The first flaw is that since people can freely post
their own content, the quality of their opinions cannot be guaranteed. For example,
instead of sharing topic-related opinions, online spammers post spam on forums.
Some spam are meaningless at all, while others have irrelevant opinions also
known as fake opinions [7-9]. The second flaw is that ground truth of such online
data is not always available. A ground truth is more like a tag of a certain opinion,
indicating whether the opinion is positive, negative, or neutral. The Stanford
Sentiment 140 Tweet Corpus [10] is one of the datasets that has ground truth and
is also public available. The corpus contains 1.6 million machine-tagged Twitter
messages.
Microblogging websites have evolved to become a source of varied kind of
information. This is due to nature of micro blogs on which people post real time
messages about their opinions on a variety of topics, discuss current issues,
10. 2
complain, and express positive sentiment for products they use in daily life. In fact,
companies manufacturing such products have started to poll these microblogs to get
a sense of general sentiment for their product. Many time these companies study
user reactions and reply to users on microblogs. One challenge is to build
technology to detect and summarize an overall sentiment.
Our project Tweezer resembles the analyze of tweets by the peoples on certain
products of companies or brands or performed by political leaders. In order to do
this we analyzed tweets from Twitter. Tweets are a reliable source of information
mainly because people tweet about anything and everything they do including
buying new products and reviewing them. Besides, all tweets contain hash tags
which make identifying relevant tweets a simple task. A number of research works
has already been done on twitter data. Most of which mainly demonstrates how
useful this information is to predict various outcomes. Our current research deals
with outcome prediction and explores localized outcomes.
We collected data using the Twitter public API which allows developers to extract
tweets from twitter programmatically.
The collected data, because of the random and casual nature of tweeting, need to be
filtered to remove unnecessary information. Filtering out these and other
problematic tweets such as redundant ones, and ones with no proper sentences was
done next.
As the preprocessing phase was done in certain extent it was possible to guarantee
that analyzing these filtered tweets will give reliable results. Twitter does not
provide the gender as a query parameter so it is not possible to obtain the gender of
a user from his or her tweets. It turned out that twitter does not ask for user gender
while opening an account so that information is seemingly unavailable.
1.1 Statement of the problem
The problem at hand consists of two subtasks:
Phrase Level Sentiment Analysis in Twitter :
11. 3
Given a message containing a marked instance of a word or a phrase,
determine whether that instance is positive, negative or neutral in that
context.
Sentence Level Sentiment Analysis in Twitter:
Given a message, decide whether the message is of positive, negative, or
neutral sentiment. For messages conveying both a positive and negative
sentiment, whichever is the stronger sentiment should be chosen.
1.2 Objectives
The objectives of this project are:
To implement an algorithm for automatic classification of text into positive
and negative
Sentiment Analysis to determine the attitude of the mass is positive,
negative or neutral towards the subject of interest
Graphical representation of the sentiment in form of Pie-Chart, Bar Diagram
and Scatter Plot.
1.3 Scope of project
This project will be helpful to the companies, political parties as well as to the
common people. It will be helpful to political party for reviewing about the program
that they are going to do or the program that they have performed. Similarly
companies also can get review about their new product on newly released hardwares
or softwares. Also the movie maker can take review on the currently running movie.
By analyzing the tweets analyzer can get result on how positive or negative or
neutral are peoples about it.
1.4 System Overview
This proposal entitled “TWEEZER” is a web application which is used to analyze
the tweets. We will be performing sentiment analysis in tweets and determine where
it is positive, negative or neutral. This web application can be used by any
12. 4
organization office to review their works or by political leaders or by any others
company to review about their products or brands.
1.5 System Features
The main feature of our web application is that it helps to determine the opinion
about the peoples on products, government work, politics or any other by analyzing
the tweets. Our system is capable of training the new tweets taking reference to
previously trained tweets.
The computed or analyzed data will be represented in various diagram such as Pie-
chart, Bar graph and Scatter Plot
13. 5
2. LITERATURE REVIEW
Sentiment analysis has been handled as a Natural Language Processing task at many
levels of granularity. Starting from being a document level classification task
( Turney, 2002; Pang and Lee, 2004) [4,5], it has been handled at the sentence level
(Hu and Liu [2], 2004; Kim and Hovy, 2004 [1]) and more recently at the phrase
level (Wilson et al., 2005; Agarwal et al., 2009). Microblog data like Twitter, on
which users post real time reactions to and opinions about “everything”, poses
newer and different challenges. Some of the early and recent results on sentiment
analysis of Twitter data are by Go et al. (2009), ( Bermingham and Smeaton, 2010)
and Pak and Paroubek (2010) [3]. Go et al. (2009) use distant learning to acquire
sentiment data. They use tweet sending in positive emotions like “:)” “:-)” as
positive and negative emoticons like “:(” “:-(” as negative. They build models using
Naive Bayes, Max Ent and Support Vector Machines (SVM), and they report SVM
outperforms other classifiers. In terms of feature space, they try a Unigram, Bigram
model in conjunction with parts-of-speech (POS) features. They note that the
unigram model outperforms all other models. Specifically, bigrams and POS
features do not help. Pak and Paroubek (2010) [3] collect data following a similar
distant learning paradigm. They perform a different classification task though:
subjective versus objective.
For subjective data they collect the tweets ending with emoticons in the same
manner as Go et al. (2009). For objective data they crawl twitter accounts of popular
newspapers like “New York Times”, “Washington Posts” etc. They report that POS
and bigrams both help (contrary to results presented by Go et al. (2009)). Both these
approaches, however, are primarily based on ngram models. Moreover, the data
they use for training and testing is collected by search queries and is therefore
biased. In contrast, we present features that achieve a significant gain over a
unigram baseline. In addition we explore a different method of data representation
and report significant improvement over the unigram models. Another contribution
of this paper is that we report results on manually annotated data that does not suffer
from any known biases. Our data will be a random sample of streaming tweets
unlike data collected by using specific queries. The size of our hand-labeled data
14. 6
will allow us to perform cross validation experiments and check forth variance in
performance of the classifier across folds. Another significant effort for sentiment
classification on Twitter data is by Barbosa and Feng (2010).
They use polarity predictions from three websites as noisy labels to train a model
and use 1000 manually labeled tweets for tuning and another 1000 manually labeled
tweets for testing. They however do not mention how they collect their test data.
They propose the use of syntax features of tweets like retweet, hashtags, link,
punctuation and exclamation marks in conjunction with features like prior polarity
of words and POS of words. We extend their approach by using real valued prior
polarity, and by combining prior polarity with POS. Our results show that the
features that enhance the performance of our classifiers the most are features that
combine prior polarity of words with their parts of speech. The tweet syntax features
help but only marginally. Gamon (2004) perform sentiment analysis on feeadback
data from Global Support Services survey. One aim of their paper is to analyze the
role of linguistic features like POS tags. They perform extensive feature analysis
and feature selection and demonstrate that abstract linguistic analysis features
contributes to the classifier accuracy. In this paper we perform extensive feature
analysis and show that the use of only 100 abstract linguistic features performs as
well as a hard unigram baseline.
One fundamental problem in sentiment analysis is categorization of sentiment
polarity [5,11].Given a piece of written text, the problem is to categorize the text
into one specific sentiment polarity, positive or negative(or neutral). Based on the
scope of the text, there are three levels of sentiment polarity categorization, namely
the document level, the sentence level, and the entity and aspect level[12].The
document level concerns whether a document, as a whole, expresses negative or
positive sentiment, while the sentence level deals with each sentence’s sentiment
categorization. The entity and aspect level then targets on what exactly people like
or dislike from their opinions.
For feature selection, Pang and Lee [4] suggested to remove objective sentences by
extracting subjective ones. They proposed a text-categorization technique that is
able to identify subjective content using minimum cut. Gann et al. [13] selected
15. 7
6,799 tokens based on Twitter data, where each token is assigned a sentiment score,
namely TSI (Total Sentiment Index), featuring itself as a positive token or a
negative token. Specifically, a TSI for a certain token is computed as:
𝑇𝑆𝐼 =
𝑝 −
𝑡𝑝
𝑡𝑛
∗ 𝑛
𝑝 +
𝑡𝑝
𝑡𝑛
∗ 𝑛
where p is the number of times a token appears in positive tweets and n is the
number of times a token appears in negative tweets.
𝑡𝑝
𝑡𝑛
is the ratio of total number
of positive tweets over total number of negative tweets.
Moreover, [8] showed that using the well-known “geo-tagged” feature in twitter to
identify the polarity of a political candidates in the US could be done by employing
the sentiment analysis algorithms to predict the future events such as the
presidential elections results. Comparing to previous approaches in sentiment
topics, additional findings by [10] showed that adding the semantic feature produces
better Recall (retrieved documents) to compute the score) in negative sentiment
classification.
16. 8
3. FEASIBILITY ANALYSIS
A feasibility study is a preliminary study which investigates the information of
prospective users and determines the resources requirements, costs, benefits and
feasibility of proposed system. A feasibility study takes into account various
constraints within which the system should be implemented and operated. In this
stage, the resource needed for the implementation such as computing equipment,
manpower and costs are estimated. The estimated are compared with available
resources and a cost benefit analysis of the system is made. The feasibility analysis
activity involves the analysis of the problem and collection of all relevant
information relating to the project. The main objectives of the feasibility study are
to determine whether the project would be feasible in terms of economic feasibility,
technical feasibility and operational feasibility and schedule feasibility or not. It is
to make sure that the input data which are required for the project are available.
Thus we evaluated the feasibility of the system in terms of the following categories:
Technical feasibility
Operational feasibility
Economic feasibility
Schedule feasibility
3.1.1 Technical Feasibility
Evaluating the technical feasibility is the trickiest part of a feasibility study. This is
because, at the point in time there is no any detailed designed of the system, making
it difficult to access issues like performance, costs (on account of the kind of
technology to be deployed) etc. A number of issues have to be considered while
doing a technical analysis; understand the different technologies involved in the
proposed system. Before commencing the project, we have to be very clear about
what are the technologies that are to be required for the development of the new
system. Is the required technology available? Our system "Tweezer" is technically
feasible since all the required tools are easily available. Python and Php with
17. 9
Javacscript can be easily handled. Although all tools seems to be easily available
there are challenges too.
3.1.2 Operational Feasibility
Proposed project is beneficial only if it can be turned into information systems that
will meet the operating requirements. Simply stated, this test of feasibility asks if
the system will work when it is developed and installed. Are there major barriers to
Implementation?
The proposed was to make a simplified web application. It is simpler to operate and
can be used in any webpages. It is free and not costly to operate.
3.1.3 Economic Feasibility
Economic feasibility attempts to weigh the costs of developing and implementing a
new system, against the benefits that would accrue from having the new system in
place. This feasibility study gives the top management the economic justification
for the new system. A simple economic analysis which gives the actual comparison
of costs and benefits are much more meaningful in this case. In addition, this proves
to be useful point of reference to compare actual costs as the project progresses.
There could be various types of intangible benefits on account of automation. These
could increase improvement in product quality, better decision making, and
timeliness of information, expediting activities, improved accuracy of operations,
better documentation and record keeping, faster retrieval of information.
This is a web based application. Creation of application is not costly.
3.1.4 Schedule Feasibility
A project will fail if it takes too long to be completed before it is useful. Typically,
this means estimating how long the system will take to develop, and if it can be
completed in a given period of time using some methods like payback period.
Schedule feasibility is a measure how reasonable the project timetable is. Given our
technical expertise, are the project deadlines reasonable? Some project is initiated
18. 10
with specific deadlines. It is necessary to determine whether the deadlines are
mandatory or desirable.
A minor deviation can be encountered in the original schedule decided at the
beginning of the project. The application development is feasible in terms of
schedule.
3.2 Requirement Definition
After the extensive analysis of the problems in the system, we are familiarized with
the requirement that the current system needs. The requirement that the system
needs is categorized into the functional and non-functional requirements. These
requirements are listed below:
3.2.1 Functional Requirements
Functional requirement are the functions or features that must be included in any
system to satisfy the business needs and be acceptable to the users. Based on this,
the functional requirements that the system must require are as follows:
System should be able to process new tweets stored in database after
retrieval
System should be able to analyze data and classify each tweet polarity
3.2.2 Non-Functional Requirements
Non-functional requirements is a description of features, characteristics and
attribute of the system as well as any constraints that may limit the boundaries of
the proposed system.
The non-functional requirements are essentially based on the performance,
information, economy, control and security efficiency and services.
Based on these the non-functional requirements are as follows:
19. 11
User friendly
System should provide better accuracy
To perform with efficient throughput and response time
20. 12
4. SYSTEM DESIGN AND ARCHITECTURE
4.1 Use Case Diagram
Fig 4.1:”Use Case Diagram of Tweezer”
27. 19
5.METHODOLOGY
There are primarily two types of approaches for sentiment classification of
opinionated texts:
Using a Machine learning based text classifier such as Naive Bayes
Using Natural Language Processing
We will be using those machine learning and natural language processing for
sentiment analysis of tweet.
5.1 Machine Learning
The machine learning based text classifiers are a kind of supervised machine
learning paradigm, where the classifier needs to be trained on some labeled training
data before it can be applied to actual classification task. The training data is usually
an extracted portion of the original data hand labeled manually. After suitable
training they can be used on the actual test data. The Naive Bayes is a statistical
classifier whereas Support Vector Machine is a kind of vector space classifier. The
statistical text classifier scheme of Naive Bayes (NB) can be adapted to be used for
sentiment classification problem as it can be visualized as a 2-class text
classification problem: in positive and negative classes. Support Vector machine
(SVM) is a kind of vector space model based classifier which requires that the text
documents should be transformed to feature vectors before they are used for
classification. Usually the text documents are transformed to multidimensional
vectors. The entire problem of classification is then classifying every text document
represented as a vector into a particular class. It is a type of large margin classifier.
Here the goal is to find a decision boundary between two classes that is maximally
far from any document in the training data.
This approach needs
A good classifier such as Naive Byes
A training set for each class
28. 20
There are various training sets available on Internet such as Movie Reviews data
set, twitter dataset, etc. Class can be Positive, negative. For both the classes we need
training data sets.
5.1.1 Naïve Bayes Classifier (NB)
The Naïve Bayes classifier is the simplest and most commonly used classifier.
Naïve Bayes classification model computes the posterior probability of a class,
based on the distribution of the words in the document. The model works with the
BOWs feature extraction which ignores the position of the word in the document.
It uses Bayes Theorem to predict the probability that a given feature set belongs to
a particular label.
P(label|features)=
P(label)∗P(features|label)
P(features)
P(label) is the prior probability of a label or the likelihood that a random feature
set the label. P(features|label) is the prior probability that a given feature set is
being classified as a label. P(features) is the prior probability that a given feature
set is occurred. Given the Naïve assumption which states that all features are
independent, the equation could be rewritten as follows:
P(label|features)=
P(label)∗P(f1|label)∗………∗P(fn|label)
P(features)
5.1.1.1 Multinomial Naïve Bayes Classifier
Accuracy – around 75%
Algorithm :
i. Dictionary generation
Count occurrence of all word in our whole data set and make a dictionary of
some most frequent words.
30. 22
ii. Feature set generation
All document is represented as a feature vector over the space of dictionary
words.
For each document, keep track of dictionary words along with their number of
occurrence in that document.
Formula used for algorithms:
= probability that a particular word in document of
label(neg/pos) = y will be the kth word in the
dictionary.
= Number of words in ith document.
= Total Number of documents.
Training
In this phase We have to generate training data(words with probability of
occurrence in positive/negative train data files ).
Calculate for each label .
Calculate for each dictionary words and store the result (Here: label
will be negative and positive).
)|(| ylabelkxP jylabelk
||)}{1(
1}andk{1
1
)(
1 1
)()(
Vnylabel
ylabelx
m
i
i
i
m
i
n
j
ii
j
i
ylabelk |
in
ylabelk |
m
ylabelk |
ylabelk |
31. 23
Now we have , word and corresponding probability for each of the defined label .
Testing
Goal - Finding the sentiment of given test data file.
- Generate Feature set(x) for test data file.
-For each document is test set find
Decision1=log P(x| label= pos)+log P(label=pos)
Similarly calculate
Decision2=log P(x| label= neg)+log P(label=neg)
Compare decision 1&2 to compute whether it has Negative or Positive sentiment.
The following diagrams and calculations shows details on tweet data processing,
feature extraction, analysis and tweet polarity classification based on Naïve Bayes
Algorithm and Classifier.
You have a document and a classification
DOC TEXT CLASS
1 I loved the movies +
2 I hated the movies -
3 a great movies. good movies +
4 poor acting -
5 great acting. a good movies +
Ten Unique words:
<I, loved, the, movies, hated, a, great, poor, acting, good>
Convert the document into feature sets, where the attributes are possible words,
and the values are the number of times a word occurs in the given document.
32. 24
DOC I loved The Movies hated a great poor acti
ng
good CLASS
1 1 1 1 1 +
2 1 1 1 1 -
3 2 1 1 1 +
4 1 1 -
5 1 1 1 1 1 +
Documents with positive outcomes.
DOC I loved the movies hated a great poor acting good CL
ASS
1 1 1 1 1 +
3 2 1 1 1 +
5 1 1 1 1 1 +
P (+)=3/5=0.6
Compute: p (i|+); p (love|+); p (the|+); p (movies|+);
P (a|+); p (great|+); p (acting|+); p (good|+)
Le n be the number of words in the (+) case: 14. nk the number of times word k
occurs in these cases(+)
Let p(wk /+)=
nk+1
2𝑛+|𝑉𝑜𝑐𝑎𝑏𝑢𝑙𝑎𝑟𝑦 |
Documents with positive outcomes.
DOC I loved the Movies hated a great poor acting good CLASS
1 1 1 1 1 +
3 2 1 1 1 +
5 1 1 1 1 1 +
P(+)=3/5=0.6; p(wk /+)=
nk+1
2𝑛+|𝑉𝑜𝑐𝑎𝑏𝑢𝑙𝑎𝑟𝑦|
P(i|+) =
1+1
14+10
= 0.0833; P(loved|+) =
1+1
14+10
= 0.0833;
P(the|+) =
1+1
14+10
= 0.0833; P(movies|+) =
5 +1
14 +10
= 0.2083;
P(a|+) =
2+1
14+10
= 0.125; P(great|+) =
2+1
14+10
= 0.125;
P(acting|+) =
1+1
14+10
= 0.0833; P(good|+) =
2+1
14+10
= 0.125;
P(hated|+) =
0+1
14+10
= 0.0417; P(poor|+) =
0+1
14+10
= 0.0417;
33. 25
Now, let’s look at the negative examples
DOC I loved the Movies hated a great poor acting good CLASS
2 1 1 1 1 -
4 1 1 -
P(-)=2/5=0.4
P(i|-) =
1+1
6+10
= 0.125; P(loved|-) =
0+1
6+10
= 0.0625;
P(the|-) =
1+1
6+10
= 0.125; P(movies|-) =
1+1
6+10
= 0.125;
P(a|-) =
0+1
6+10
= 0.0625; P(great|-) =
0+1
6+10
= 0.0625;
P(acting|-) =
1+1
6+10
= 0.125; P(good|-) =
0+1
6+10
= 0.0625;
P(hated|-) =
1+1
6+10
= 0.125; P(poor|-) =
1+1
6+10
= 0.125;
Now that we’ve trained our classifier,
Let’s classify a new sentence according to:
VNB = argmaxP
vj ∈ V
(vj) ∏
w ∈ words
P(W|vj)
where v stands for “value” or “class”
“I hated the poor acting”
If Vj=+; p(+)p(i|+)p(hated|+)p(the|+)p(poor|+)p(acting|+)=6.03*10-7
If Vj=-; p(-)p(i|-)p(hated|-)p(the|-)p(poor|-)p(acting|-)=1.22*10-5
5.2 Natural Language Processing
Natural language processing (NLP) is a field of computer science, artificial
intelligence, and linguistics concerned with the interactions between computers and
human (natural) languages. This approach utilizes the publicly available library of
SentiWordNet, which provides a sentiment polarity values for every term occurring
in the document. In this lexical resource each term t occurring in WordNet is
associated to three numerical scores obj(t), pos(t) and neg(t), describing the
objective, positive and negative polarities of the term, respectively. These three
34. 26
scores are computed by combining the results produced by eight ternary classifiers.
WordNet is a large lexical database of English. Nouns, verbs, adjectives and
adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a
distinct concept.
WordNet is also freely and publicly available for download. WordNet’s structure
makes it a useful tool for computational linguistics and natural language processing.
It groups words together based on their meanings. Synet is nothing but a set of one
or more Synonyms. This approach uses Semantics to understand the language.
Major tasks in NLP that helps in extracting sentiment from a sentence:
Extracting part of the sentence that reflects the sentiment
Understanding the structure of the sentence
Different tools which help process the textual data
Basically, Positive and Negative scores got from SentiWordNet according to its
part-of-speech tag and then by counting the total positive and negative scores we
determine the sentiment polarity based on which class (i.e. either positive or
negative) has received the highest score.
5.3 Programming tools
5.3.1 MongoDB
MongoDB is an open source database that uses a document-oriented data model.
MongoDB is one of several database types to arise in the mid-2000s under the
NoSQL banner. Instead of using tables and rows as in relational databases,
MongoDB is built on an architecture of collections and documents. Documents
comprise sets of key-value pairs and are the basic unit of data in MongoDB.
Collections contain sets of documents and function as the equivalent of relational
database tables.
35. 27
5.3.2 Python
Python is a widely used high-level, general-purpose, interpreted, dynamic
programming language. Its design philosophy emphasizes code readability, and its
syntax allows programmers to express concepts in fewer lines of code than possible
in languages such as C or Java. The language provides constructs intended to enable
writing clear programs on both a small and large scale.
5.3.3 NLTK
NLTK is a leading platform for building Python programs to work with human
language data. It provides easy-to-use interfaces to over 50 corpora and lexical
resources such as WordNet, along with a suite of text processing libraries for
classification, tokenization, stemming, tagging, parsing, and semantic reasoning,
wrappers for industrial-strength NLP libraries, and an active discussion forum.
NLTK has been called “a wonderful tool for teaching, and working in,
computational linguistics using Python,” and “an amazing library to play with
natural language.” NLTK is suitable for linguists, engineers, students, educators,
researchers, and industry users alike. Natural Language Processing with
Python provides a practical introduction to programming for language processing.
Written by the creators of NLTK, it guides the reader through the fundamentals of
writing Python programs, working with corpora, categorizing text, analyzing
linguistic structure, and more.
5.3.4 PHP
PHP is an HTML-embedded, server-side scripting language designed for web
development. It is also used as a general purpose programming language. PHP
codes are simply mixed with HTML codes and can be used in combination with
various web frameworks. Its scripts are executed on the server. PHP code is
processed by a PHP interpreter. The main goal of PHP is to allow web developer to
create dynamically generated pages quickly. A PHP file consists of texts, HTML
tags and scripts.
36. 28
5.3.5 JavaScript
We make the use of JavaScript to incorporate different plugins in our web page.
JavaScript is the programming language of HTML and the Web. JavaScript resides
inside HTML documents, and can provide levels of interactivity to web pages that
are not achievable with simple HTML.
5.3.6 HTML
We use HTML for rendering the analyzed in the web page. HTML is the standard
markup language for creating Web pages. HTML describes the structure of Web
pages using markup. HTML elements are the building blocks of HTML pages.
5.3.7 Highcharts
Highcharts is a charting library written in pure JavaScript, offering an easy way of
adding interactive charts to your web site or web application.
37. 29
6. TESTING
6.1. Unit Testing
Unit testing is performed for testing modules against detailed design. Inputs to the
process are usually compiled modules from the coding process. Each modules are
assembled into a larger unit during the unit testing process.
Testing has been performed on each phase of project design and coding. We carry
out the testing of module interface to ensure the proper flow of information into and
out of the program unit while testing. We make sure that the temporarily stored data
maintains its integrity throughout the algorithm's execution by examining the local
data structure. Finally, all error-handling paths are also tested.
6.2. System Testing
We usually perform system testing to find errors resulting from unanticipated
interaction between the sub-system and system components. Software must be
tested to detect and rectify all possible errors once the source code is generated
before delivering it to the customers. For finding errors, series of test cases must be
developed which ultimately uncover all the possibly existing errors. Different
software techniques can be used for this process. These techniques provide
systematic guidance for designing test that
Exercise the internal logic of the software components,
Exercise the input and output domains of a program to uncover errors in
program function, behavior and performance.
We test the software using two methods:
White Box testing: Internal program logic is exercised using this test case design
techniques.
Black Box testing: Software requirements are exercised using this test case design
techniques.
38. 30
Both techniques help in finding maximum number of errors with minimal effort and
time.
6.3. Performance Testing
It is done to test the run-time performance of the software within the context of
integrated system. These tests are carried out throughout the testing process. For
example, the performance of individual module are accessed during white box
testing under unit testing.
6.4. Verification and Validation
The testing process is a part of broader subject referring to verification and
validation. We have to acknowledge the system specifications and try to meet the
customer’s requirements and for this sole purpose, we have to verify and validate
the product to make sure everything is in place. Verification and validation are two
different things. One is performed to ensure that the software correctly implements
a specific functionality and other is done to ensure if the customer requirements are
properly met or not by the end product.
Verification is more like 'are we building the product right?' and validation is
more like 'are we building the right product?'.
39. 31
7. ANALYSIS AND RESULTS
7.1 Analysis
We collected dataset containing positive and negative data. Those dataset were
trained data and was classified using Naïve Bayes Classifier. Before training the
classifier unnecessary words, punctuations, meaning less words were cleaned to get
pure data. To determine positivity and negativity of tweets we collected data using
twitter API. Those data were stored in database and then retrieved back to remove
those unnecessary word and punctuations for pure data. To check polarity of test
tweet we train the classifier with the help of trained data. Those results were stored
in database and then retrieved back using php, html , javascript and css.
7.2 Result
After facing a number of errors, successful elimination of those error we have
completed our project with continuous effort. At the end of the project the results
can be summarized as:
A user friendly web based application.
No expertise is required for using the application.
Organizations can use the application to visualize product or brand review
graphically.
Fig 7.2.1:"Pie-Chart Representation"
41. 33
8. LIMITATION AND FUTURE ENHANCEMENT
8.1 Limitation
The system we designed is used to determine the opinion of the people based on
twitter data. We somehow completed our project and was able to determine only
positivity and negativity of tweet. For neutral data we were unable to merge dataset.
Also we are currently analyzing only 25 live tweets. This may not give proper value
and results. The results are not much accurate.
8.2 Future Enhancement
Analyzing sentiments on emo/smiley.
Determining neutrality.
Potential improvement can be made to our data collection and analysis
method.
Future research can be done with possible improvement such as more
refined data and more accurate algorithm.
42. 34
CONCLUSION
We have completed our project using python as language, Php with Html and
Javascript for output presentation. Although there was a problem in integration of
python and php, through numbers of tutorial we were able to integrate it.
We were able to determine the positivity and negativity of each tweet. Based on
those tweets we represented them in a diagrams like Bar graph, Pie-chart and scatter
plot. All the diagrams related to outcome are shown in fig (7.2.1),fig (7.2.2),
fig(7.2.3). A small conclusion is also shown during output presentation based on
product or brand entered. Our designed system is user friendly.
All displaying results are displayed in webpage.
43. 35
REFERENCES
1. Kim S-M, Hovy E (2004) Determining the sentiment of opinions In: Proceedings
of the 20th international conference on Computational Linguistics, page 1367.
Association for Computational Linguistics, Stroudsburg, PA, USA.
2. Liu B (2010) Sentiment analysis and subjectivity In: Handbook of Natural
Language Processing, Second Edition. Taylor and Francis Group, Boca. Liu B,
Hu M, Cheng J (2005) Opinion observer: Analyzing and comparing opinions on
the web In: Proceedings of the 14th International Conference on World Wide
Web, WWW ’05, 342–351. ACM, New York, NY, USA.
3. Pak A, Paroubek P (2010) Twitter as a corpus for sentiment analysis and opinion
mining In: Proceedings of the Seventh conference on International Language
Resources and Evaluation. European Languages Resources Association, Valletta,
Malta.
4. Pang B, Lee L (2004) A sentimental education: Sentiment analysis using
subjectivity summarization based on minimum cuts In: Proceedings of the 42Nd
Annual Meeting on Association for Computational Linguistics, ACL ’04..
Association for Computational Linguistics, Stroudsburg, PA, USA.
5. Pang B, Lee L (2008) Opinion mining and sentiment analysis. Found Trends Inf
Retr2(1-2): 1–135.
6. Twitter (2014) Twitter apis. https://dev.twitter.com/start.
7. LiuB(2014)The science of detecting fake reviews. http://content26.com/blog/bing-
liu-the-science-of-detecting-fake-reviews/
8. Jahanbakhsh, K., & Moon, Y. (2014). The predictive power of social
media: On the predictability of U.S presidential elections using
Twitter
44. 36
9. Mukherjee A, Liu B, Glance N (2012) Spotting fake reviewer groups in consumer
reviews In: Proceedings of the 21st, International Conference on World Wide
Web, WWW ’12, 191–200.. ACM, New York, NY, USA.
10. Saif, H., He, Y., & Alani, H. (2012). Semantic sentiment analysis of twitter. The
Semantic Web (pp. 508– 524). ISWC
11.Tan LK-W, Na J-C, Theng Y-L, Chang K (2011) Sentence-level sentiment polarity
classification using a linguistic approach In: Digital Libraries: For Cultural
Heritage, Knowledge Dissemination, and Future Creation, 77–87.. Springer,
Heidelberg, Germany.
12.Liu B (2012) Sentiment Analysis and Opinion Mining. Synthesis Lectures on
Human Language Technologies. Morgan & Claypool Publishers.
13.Gann W-JK, Day J, Zhou S (2014) Twitter analytics for insider trading fraud
detection system In: Proceedings of the sencond ASE international conference on
Big Data.. ASE.
14.Joachims T. Probabilistic analysis of the rocchio algorithm with TFIDF for text
categorization. In: Presented at the ICML conference; 1997.
15.Yung-Ming Li, Tsung-Ying Li Deriving market intelligence from microblogs