Graph-based Analysis and Opinion Mining in Social Network

1
Project Report: Graph-based Analysis and Opinion Mining
in Social Network
Khan Mostafa
Stony Brook University
Student ID# 109365509
khan.@nafSadh.com
ABSTRACT
This is the final report for Networks & Data Mining Techniques
project focusing on mining social network to estimate public
opinion about entities and associated keywords. This project mines
Twitter for recent feeds and analyzes them to estimate sentiment
score, discussed entity and describing keywords in each tweet. This
data is then exploited to elicit overall sentiment associated with
each entity. Entities and keywords extracted is also used to form an
entity-keyword bigraph. This graph is further used to detect entity
communities and keywords found within those communities.
Presented implementation works in linear time.
Categories and Subject Descriptors
H.2.8 [Database Management]: Database Applications –
Data Mining.
General Terms
Algorithms, Documentation, Experimentation.
Keywords
Opinion mining, sentiment, graph clustering, graph community
detection.
1. INTRODUCTION
This project focuses on mining opinion from social network. It
takes Twitter as a model platform for that it has a publicly available
stream of posts from people of diverge demographic. The goal is to
report public opinion in two forms: (a) overall opinion about some
entity and (b) opinion based cluster of entities and keywords.
Public opinion can be mined from posts about entity of interest. At
first, ample posts are fetched from public stream. Then, each post
is individually scored to find embedded subjectivity. All posts are
not subjective, some assert information while some other express
feelings. Hence, posts can be generally classified as objective,
positive and negative. However, subjective bias is not discrete;
rather each post embody mixed polarity. Again, attempts to
annotate post manually has shown that, different people associate
sentiment to same posts differently. Therefore, this project focuses
on calculating sentiment scores for posts. After each posts are
individually scored, overall opinion is represented using few
aggregative parameters including overall score, diversity, and
percentage of each type of polar posts. A set of keywords (kw) are
also identified to report how the entity (E) is positively and
negatively described.
In this project sentiment analysis is done using an approach similar
to [1], using a combination of two naïve Bayes classifiers to
calculate polarity score – PoS tag based classifier and n-gram based
classifier. Keywords and entities are primarily detected using parts
of speech. Then, in combined analysis, keywords that occur less
frequently for an entity is discarded, as that word is not sufficiently
associated with the entity. Again, those keywords that occur in
descriptions of too many entities, are less likely to be keyword,
rather are stop-words or generic words.
After tweets are individually analyzed further overall analysis can
be done. To do so, first an entity – keyword bigraph (E×kw) is
computed from tweets analyzed. Tweets are collected from recent
public feed stream using Twitter API. Analysis reports a polarity
score, a set of keywords and a set of entities for each tweet. In E×kw
bigraph an edge exist between E and kw if both occur in same tweet.
These edges also have associated polarity score. This E×kw bigraph
can be used to generate an E×E graph. In E×E, there exists an edge
between two entities if they share a keyword with similar sentiment
bias. This E×E graph is then clustered using a local clustering
algorithm in linear time.
This project is implemented mainly using .Net framework (C#) and
partially using PHP on Apache server to access Twitter API [2].
PoS tagging is done using a third party TreeTagger developed
recently for tweets [3].
The main contributions of this project are,
 Implemented a sentiment analysis tool that can elicit scores for
individual tweets
 Implemented a way to report aggregate sentiment score and
associated keywords for queried entity
 Devised and implemented a simple approach to identify
entities and keywords in tweets
 Implemented a fast local graph clustering algorithm using split
vectors instead of full-blown matrices.
 Used the fast local graph clustering to detect and report entity
groups along with keywords and grouped polarity scores
In this report following sections include, overview of prior works,
methodology description and result and analysis of mined data.
2. BACKGROUND
Mining social network for eliciting public opinion requires
sentiment analysis, keyword & entity tagging and graph clustering.
Sentiment analysis is vastly studied in several fields and still is an
open problem. There had also been ample investigation on
detecting communities, partitioning, and finding clusters in graphs.
In this section a few prior works are briefly discussed.
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise,
or republish, to post on servers or to redistribute to lists, requires prior
specific permission and/or a fee.
CSE590 Network and Data Mining Techniques, Fall, 2013, Stony Brook
University, NY, USA.
Copyright 2013

2
2.1 Sentiment Analysis
Sentiment analysis is being studied thoroughly for a decade or
more. One of the earliest work done by Pang, et al. [4], amongst
others, investigated in the field of sentiment classification. This
investigation opened a wide arena of research and have led to many
outcome by multitude of researchers from different fields.
Statistics, computational linguistics and machine learning has been
studied to solve the challenge of sentiment analysis.
There are several lexicon based techniques for opinion mining viz.
[5], versions of SentiWordNet [6], [7]. A detail survey of many
lexicon based approaches is done by [8].
Although earlier studies [9] suggested use of only adjectives as
subjectivity measure, later investigations revealed sentiment
appraisal is much diverse. Whitelaw, et al. [10] suggested using
appraisal taxonomies for sentiment classification. Similar
observation was found by [11] and [12] stating that, “Adjectives,
Verbs and Adverbs are better than Adjectives Alone”.
Machine learning approaches widely used Support Vector
Machines e.g. [1], [13] and Naïve Bayes e.g. [4], [14] classifiers.
Latent Dirichlet Allocation (LDA) is also utilized e.g. [15], [16]. A
lexicon based holistic approach [17] is also described to address
context dependency.
Opinion mining and sentiment analysis on Twitter is investigated
using various approaches viz. [14] [18] [1] [16] [19].
Most approaches for opinion mining assign strict subjectivity class
(positive, negative, neutral) to individual texts in different
granularity (i.e. sentence, post, paragraph and document).
However, a score assignment will serve better to understand
intensity of opinion. There is a paucity of studies that tried to
aggregate sentiment to identify public opinion. Perception of
opinion vary for each individual and a better insight of public
opinion can be found by eliciting few attributes from social media.
Overall sentiment score, percentage of positive and negative
opinions, key descriptions are useful attributes that can be elicited.
This project will focus on mining tweets about some entity for these
attributes associated with that entity.
2.2 Keyword and Entity detection
There are different and diverse approaches for keyword detection.
For example, there are machine learning based approaches, using
SVM [20], associating linguistic knowledge like n-grams and PoS
for supervised keyword extraction [21]. Thesaurus based
approaches [22] use semantic knowledge for machine based
keyword extraction.
Most keyword identification approaches use some kind of machine
learning technique along with some other knowledge. However, for
this project’s purpose, a simple method is required to identify
keywords. This project will employ hints from PoS tagging and
then let data itself build a keyword lexicon while simultaneously
detecting them.
1
Modularity is the fraction of the edges that fall within the given groups
minus the expected such fraction if edges were distributed at random.
[Wikipedia, Accessed Dec 03, 2013]
2.3 Graph clustering
Graphs have been studied extensively historically from
mathematical and theoretical viewpoint and in recent few decades
they have been more extensively studied from data analytic
perspectives. A lot of real world and physical phenomena can be
ideally modeled as graphs. These graphs can be then efficiently
investigated to find latent characteristics of modeled data.
One major operation on graph in data mining is to divide them into
smaller parts. Partitioning can be of different types. One approach
might be to partition whole graph into disjoint sub graphs of similar
size [23].
For analyzing graphs, a more natural division is often desired.
Vertices in graphs tend to have edge with vertices that have vertices
with other connected neighboring vertices of its own and thus
create communities. However, communities differ in sizes and
these communities are not disconnected. Rather, there are few links
between nodes of different community in contrast to nodes of same
community. Newman and others has conducted several research
[24] [25] [26] on detecting communities in graphs. They exploited
modularity1
of graph to do so. Most of their early works were
restrictive on scalability but later spectral optimization of
modularity yielded [27] an algorithm that works in near linear time.
Modularity based approaches cluster graph into disjoint
communities. In contrast, often communities are overlapped.
Andersen et al [28] suggested a “Local Graph Partitioning using
PageRank Vectors” and other derived algorithms. The core idea
behind these approaches is to use conductance2
of graph to locally
cluster them. These approaches works near linearly and can detect
communities that overlap.
This project uses an approach as devised by Andersen et al, as it
serves several purposes of the project goal. It can detect
communities that overlap, works near linearly, and an
implementation without necessarily creating the blown-up full
matrix is possible.
3. PROJECT DESCRIPTION
3.1 Problem Statement
People express their opinion about entities (viz. location, person,
products etc.) in social networks. In brief, the goal is to,
 extract overall public opinion of some entity
 elicit opinion based entity groups in recent stream
The scope of the project is to mine a popular microblogging
platform: Twitter.
3.1.1 Extract overall public opinion of some entity
The goal is to extract opinion about a given entity, E. This will be
done in terms of ample recent tweets about E. The solution shall be
able to yield the following about a given entity, E,
 Overall sentiment: Overall sentiment (viz. positive, negative,
mixed) about E. A sentiment score in a range of [-1, 1] will be
given. This will also show the percentage of positive, negative
2
Conductance is the measure of a sub graph denoting how much it
is connected to rest of the graph. It is the ratio of out-links from
the sub graph to the volume (total edge count from nodes in it).

3
and neutral (some threshold can be applied to distinguish
between these three classes) tweets as well as the count of
analyzed tweets. A measure (e.g. variance) of how diverse the
opinion is can also be included.
 Key description: The system will yield a set of keywords (kw)
that are used to describe E
An overall sentiment about an entity is useful to multitude of clients
for various applications. Sets of key descriptive words along with
sentiment will provide a better insight of public feelings.
3.1.2 Opinion based entity groups in recent stream
The goal is to detect how entities are grouped together in terms of
sentiment and descriptive keywords. This will be done based on a
stream of recent tweets. Each tweets shall be individually analyzed,
as in 3.1.1. Analysis on each tweet will yield,
 Text in the tweet, T
 Entities discussed in it, E
 Keywords in it, kw
 Polarity score, P
This tuples (T,E,kw,P) will then be used to build E×kw bigraph
such that,
 There exists an edge between Ei and kwj if there is one or
more tweet that contains Ei and kwj
 The edge has a weight indicating co-occurrence of Ei and
kwj. i.e.
weightij = Count ({Tk | Ei ∈ Tk.E ∧ kwj∈ Tk.kw})
 The edge has pScore that is average of pScore (=P) for
all such occurrences. i.e.
pScore =
Sum({Tk .pScore| Ei ∈ Tk.E ∧ kwj∈ Tk.kw})/weight
After this, a filter will be run on this graph to eliminate those links
that exist between entity and keyword where the keyword is not
enough descriptive of the entity. This is done, by calculating freq
such that,
freqij = weightij/ Occurrence (Ei)
If freqij is smaller than certain threshold, εfreq then that keyword is
filtered out for this entity Ei.
This E×kw bigraph will then be used to build E×E graph, such that,
there exists an edge between Ei and Ej if
 Occurrence(Ei)> εeo ∧ Occurrence(Ej)> εeo
 {kw(Ei) | Occurrence(kwx)< εkwo} ⋂ {kw(Ej) |
Occurrence(kwx)< εkwo} is not empty
 Polarity bias for both are similar
To describe, there is an edge between two entities if they share one
or more keywords with similar polarity bias link. These entities are
such that, they occur over a threshold, εeo. These keywords are such
that, they do not occur for more than some threshold, εkwo, times.
This threshold over keywords is motivated from following
intuition,
 If a potential word occur in description of most entities
then that is not an keyword but is a generic term
Then, a community detection algorithm is to be run on this E×E
graph to find groups of entities that are bind together with lot of
polarity aligned keyword links. After one such groups of entities is
generated, there will be a group of keywords such that, they occur
in edges that are within that community of nodes. Also, a
representative averaged pScore can be calculated for such a group.
To summarize, given a stream of tweets, the system shall be able to
generate,
 (T,E,kw,P) tuples
 E×kw bigraph
 E×E graph
 Return group of entities has similar opinion
3.2 Data collection
3.2.1 Corpus and entity from Twitter
This project requires collecting two types of data. First, a corpus of
subjective and objective tweets are collected – these data is used to
train classifier (scorer). After training the classifier, training (not
the training data set) can be stored in a file so that scorer can act
later by loading them from file.
Secondly, on query time posts are fetched from Twitter.
Following API from Twitter is used:
 search/tweets
This API is called with ‘q’ = emoticons for gathering training
data (positive and negative posts).
In query time, same API is used with ‘q’ = query term to fetch
related recent posts.
 statuses/user_timeline
This API is used to fetch objective training data by querying
'screen_name' = popular_stream. I used, Lifehacker,
Gizmodo, New York Times, and The Atlantic as source.
Twitter API do not allow fetching more than 100 posts at once.
Hence, I had to exploit max_id for iteratively requesting same call
for different portions of result. I have collected ten thousands of
each type of data for training. In query time 200~2000 posts are
fetched.
3.2.2 Mining recent twitter stream
To generate an E×E graph large enough to detect grouping of
entities a large stream of Twitter public stream is to be collected.
To do this, again Twitter API is used and strapped continuously for
a large amount of windows. Note that, in v1.1, Twitter API allow
only 180 search query per window per user and 450 query per
window per app. At each query, a maximum of 100 tweets are
returned. Currently, windows are 15 minutes each. Hence, max_id
is utilized to continuously fetch tweets using a q=”.” query.
Another alternative to search/tweets API could be a streaming API.
After tweets are fetched, very tiny tweets are discarded. I have,
filtered out tweets with less than 50 characters. This is because,
smaller tweets are difficult to understand. Also, retweets (RT) are
discarded to avoid occurrence of same tweets many times.
Furhtermore, another stage of filtration is imposed to remove yet
duplicate tweets.

4
3.2.3 PoS Tagging
After collecting tweets they are passed to a TreeTagger for PoS
tagging. I used recently developed GATE Twitter part-of-speech
tagger [3], which is based on Stanford TreeTagger, which in terms
are based on famous TreeTagger [29] by Schimd. PoS tags yielded
are based on Penn-Treebank-Tagset [30].
3.3 Implementation
3.3.1 Twitter corpus to train sentiment classifier
Each posts are individually scored based on two scorers. Following
(Pak and Paroubek 2010) [1], two classifiers are built. To train
them, tweets are queried as such, (1) positive tweets are fetched
with a search of q=””, (2) negative tweets are fetched with a
search of q=”” and (3) objective tweets are fetched from new
media accounts. One classifier exploits parts-of-speech (PoS)
distribution amongst objective and polar statements. PoS
distribution differs amongst positive and negative statements. See
Figure 1 and Figure 2. Another classifier is made exploiting the
distribution of n-grams (n=2). N-grams indicate strong correlation
with bias or with objectivity. Human usually uses common phrases
to express a type of feeling. On the other hand, some phrases are of
assertive nature. This feature of natural language is captured using
n-grams. See Table 2 for top 20 polar n-grams of 94k n-grams.
The reference work used classification result from two classifiers
to verdict final classification. This project enhances the approach
by implementing classifiers as scorers to evaluate PoS score and N-
Gram score for each statement. Then, both score contribute to a
final score of the statement (tweet).
3.3.2 From strapped tweets to graphs
As outlined in 3.1.2, (T,E,kw,P) tuples, E×kw bigraph and
E×E graph are generated from a given stream of tweets.
3.3.2.1 Analyzing tweets
To do so, first each tweet is scored using sentiment classifier
described in 3.3.1.
PoS tags are exploited to primarily identify entities and keywords.
Entity: Our goal is to analyze entities (location, place, person,
product etc.) In English, they are generally represented by proper
nouns. Also, in Twitter, users can be regarded as entities. Hence,
from, PoS tags, proper nouns (NNP, NNPS, USR) are regarded as
entities.
Keyword: In English adjective, adverbs and verbs are used to
describe an entity. This property is exploited by identifying words
with tags for these PoS (JJ, RB, VB etc.) as keywords. The
algorithm also allows an alternate using a parameter that include
common nouns (not NNP) as keywords.
3.3.2.2 Entity-keyword bigraph
From analyzed tweets, (T,E,kw,P) tuples are iterated on to build
an E×kw bigraph as described in 3.3.1. A general intuition, also
confirmed by several studies, is that, graphs are generally sparse.
Thus, instead of building full blown matrix, two dictionary/maps
are stored to represent E×kw bigraph:-
 A dictionary of entities, with pointers to keywords, as well as
weight and pScore associated with that node
 For ease of iteration, another dictionary of keywords is stored,
which stores pointers back to entities from keywords.
This representation, assure small storage for the entire bigraph, yet
describes entire bigraph with edges and nodes. This reduces the
storage from (E*kw) to edgeCount. Note that,
2*(E+kw) < edgeCount << (E*kw)
Running time for building a bigraph is proportional to number of
edges, i.e. 𝑂(𝑒𝑑𝑔𝑒𝑠).
3.3.2.3 Entity-Entity graph
From the E×kw bigraph generated above, an E×E is generated by
iterating over each entity. For each entity, Ei, a set of keywords
kw(Ei) are processed. Each keyword points to another set of
entities, E(kw(Ei). These set of entities are added to neighbor of Ei.
In this step also, a dictionary is used to represent the graph. It
requires one dictionary of entities, where each entry also point to
immediate neighbors. This requires a storage of 2*edge. Runtime
to build this graph is proportional to number of edges. However, a
filtration of entities is done a priori to remove nodes with very few
neighbor from simulation (thus building a set of significant
entities). Filtering generic terms from keyword list (thus only using
legitimate keywords) reduces search space.
3.3.3 Keywords form data
Keywords are filtered in several steps to let data define legitimate
keywords. In first step, PoS tagging define preliminary set. After
all tweets are analyzed, a filtration is used to remove low frequency
terms from keyword lists of each entity. After E×kw bigraph is
built, another filtration is used to rule out generic terms. Generic
terms are those potential keywords that are found in too many
entities. A threshold parameter is supplied to the algorithm for this.
Finally after generating communities consolidation step filters out
irregular keywords to yield final set of keywords.
3.3.4 Community detection: group of entities
After E×E graph is generated, consisting legitimate keywords and
significant entities a community detection algorithm can be used to
detect community in them. This project implements a fast
derivation of Andersen et al [28].
Table 1. Community Detection Algorithm
1. Significant_entities := entities in (E×E)
2. Seed_node := supplied_seed
3. if(seed null or not exist) then
seed:=first(Significant_entities)
4. aCommunity := new Community()
5. entity :=seed
6. eval := evaluate(entity,aCommunity)
7. if(eval.member) then
aCommunity.Add(entity)
remove(entity, Significant_entities)
remove(a.Community. Nbor, entity)
8. if(aCommunity.Nbor = empty)
goto 11
9. entity := first(aCommunity. Nbor)
10. goto 5
11. add(aCommunity,Communities)
12. if(Significant_entities not emmpty)
goto 4
13. return

5
Algorithm described above uses objects of class Community. It’s
Add() member function adds the entity and updates the community
with, Volume (=edges inside) and outward links. evaluate()
function check membership by calculated conductance if this node
added to community and compare with original conductance.
Conductance is defined as,
Cond = (links outward from community)/(edges inside).
This will generate a set of communities. After generating each
communities, a consolidation step in is undergone to further filter
keywords. This is done as,
size:= size of community := number of entities in it
Threshold := ln(size)
If (Occcurance(kw)< Threshold) then Remove(kw)
After this step, a set of descriptive keywords is associated with the
group of entities.
3.3.5 Storing result
The final outcome of communities is returned as an XML document
from the implementation. Also, (T,E,kw,P) tuples are returned
as XML. Other intermediate graphs, E×kw bigraph and E×E graph
are exported as CSV (comma separated value) files.
4. RESULTS AND FINDINGS
4.1 Findings
Findings reported here are based on 160,711 tweets collected in late
November of 2013.
4.1.1 PoS Distributions and n-grams
Later in this section are figures of PoS distributions over subjective-
objective statements and positive-negative statements. A positive
bias value in Figure 1 indicate presence of such PoS is more
indicative of the statement of being positive. Same is for negative
values. Subjectivity score in Figure 2 indicates similar score. Table
2 shows top few n-grams. Note that, PoS distributions and top n-
grams slightly differ from referred work [1]. Again, if training data
is collected in different time, some slight change will occur.
Table 2. Top n-gram with occurrence in each class of data
n-gram Positive Negative Objective
'enjoying break' 1 328 1
'happy birthday' 22 207 1
'so happy' 106 53 1
'follow back' 10 132 1
'miss my' 93 10 1
'no one notices' 97 4 1
'notices my' 97 1 1
'good day' 5 82 1
'follow please' 47 38 1
'my phone' 64 18 1
'presenting emotional' 60 20 1
'please follow' 11 66 1
'follow love' 17 60 1
'am sorry' 71 4 1
'so sad' 71 3 1
'miss u' 65 7 1
'new followers' 53 17 1
Figure 1. Distribution of PoS in positive and negative
statements
Figure 2. Distribution of PoS between subjective and objective
tweets
4.1.2 Power law in Entity and Keywords
Figure 3 and Figure 4 show how entity and keywords follow power
law.
Figure 3. ln(Occurance) of Entities show power law
Figure 4. ln(Occurance) of keyword show power law
4.1.3 Distribution of Polarity Score in Entities
Figure 5 show how polarity score amongst entities are distributed.
It is seen that, polarity score has skewed distribution. Figure 6
shows the distribution of polarity score over natural logarithm (ln)
of occurrence of the entity.
POS,0.600
WP$,0.500
PDT,0.333
RBS,0.280
URL,0.229
WP,0.217
JJS,0.187
SYM,0.176
USR,0.155
FW,0.127
NNP,0.110
CD,0.068
DT,0.032
VB,0.000
UH,-0.004
NN,-0.007
JJR,-0.010
IN,-0.012
NNS,-0.015
JJ,-0.019
RBR,-0.024
WDT,-0.031
VBG,-0.034
NNPS,-0.050
VBZ,-0.055
EX,-0.064
MD,-0.099
CC,-0.102
PRP$,-0.114
PRP,-0.135
VBP,-0.144
TO,-0.149
RP,-0.175
RB,-0.182
VBD,-0.227
VBN,-0.245
WRB,-0.282
BIAS
WRB,0.164
VBN,0.140
VBD,0.128
RB,0.100
RP,0.096
TO,0.081
VBP,0.078
PRP,0.072
PRP$,0.061
CC,0.054
MD,0.052
EX,0.033
VBZ,0.028
NNPS,0.025
VBG,0.017
WDT,0.016
RBR,0.012
JJ,0.010
NNS,0.008
IN,0.006
JJR,0.005
NN,0.003
UH,0.002
VB,0.000
LS,0.000
DT,-0.016
CD,-0.033
NNP,-0.052
FW,-0.060
USR,-0.072
SYM,-0.081
JJS,-0.085
WP,-0.098
URL,-0.103
RBS,-0.123
PDT,-0.143
WP$,-0.200
POS,-0.231
SUBJECTIVITY
0
1
2
3
4
5
6
7
8
9
0 2000 4000 6000 8000 10000 12000 14000
0
1
2
3
4
5
6
7
8
9
10
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000

6
Figure 5. Distribution of Polarity Score over entire entity space
Figure 6. Polarity Score over ln(Occurance) of entities
4.1.4 Graph BFS & communities in adjacency matrix
From any arbitrary node, the E×E graph is traversed BFS (breadth
first search) to generate an arbitrary random walk. This BFS assigns
index to each entity and then an adjacency matrix is visualized as
in Figure 7. Notice that, this is a near diagonal matrix. Although the
diagram is white, as there is no self-edge. Notice the blocks; these
blocks are representative of communities. There are tiny and large
communities. There are 157 communities having a maximum size
of 136.
Figure 7. Adjacency matrix of significant entities
4.1.5 Observation of Groups
Different size of feed tweet set are examined. It is seen that, number
of significant entities and number of legitimate keywords increase
with size of tweets. They all yield communities with different size.
When manually examined these communities, and keywords, they
matched intuition. An interesting community where the keyword
cries is associated with two stars is noted in Figure 8.
<Community id="146" size="2" conductance="0.5"
pScore="0.63566754320156">
<trapped-keywords count="1">
Cries:4,
</trapped-keywords>
<e>Kristen Stewart</e>
<e>Robert Pattinson</e>
</Community>
Figure 8. XML representation of a community
4.2 Results
Figure 9 shows some sample runs where the system is queried for
overall sentiment analysis of an entity.
<opinion entity='mermaid'>
<score>0.21</score>
<analysis
post-count='1086'
percent-positive='52.03'
percent-negative='24.59'/>
</opinion>
<opinion entity='bankrupt'>
<score>-0.18</score>
<analysis
post-count='2073'
</opinion>
<opinion entity='drunk man'>
<score>-0.50</score>
<analysis
post-count='1084'
</opinion>
<opinion entity='November'>
<score>0.20</score>
<analysis
post-count='2062'
</opinion>
Figure 9. Result runs for query over entity
Few parameters are fluctuated on the sample to see how they works.
Kw threshold (εkwo), Minimum nodes (εeo), Common Noun as
keyword are varied and results are shown in Table 3. Using
common nouns as keyword yield a few groups with very large size.
Thus, it is recommended to discard common noun from keywords.
Table 3. Effect of parameters change
Kw threshold 350 350 450
Minimum nodes 2 2 2
Common Noun
as keyword
false true false
Potential kw 15108 31593 15108
Legitimate kw 14967 31368 14997
Entities 97147 97147 97147
E occurring > 2 7580 7580 7580
Significant E. 1190 2012 1378
Groups 170 92 157
Largest size 70 1256 136
Polarity scores of each entities and keywords are stored and can be
accessed directly in E×kw bigraph.
Building a polarity invariant E×kw bigraph is also tested. For,
similar setting as of last column in Table 3, polarity invariant
version generated 174 groups with largest group of size 598 for
1854 significant entities. Generated groups are also significantly
different.
-1.5
-1
-0.5
0
0.5
1
1.5
-1.5
-1
-0.5
0
0.5
1
1.5
0 1 2 3 4 5 6 7 8

7
Files generated containing result sets are kept online at
http://meaningofdata.com/mining
4.3 Performance
4.3.1 Sentiment Scoring
There is no available way to evaluate correctness for overall
sentiment analysis. Therefore, performance for individual scoring
is tested against a publicly available Mechanical Turk annotated
Twitter data [18]. This data set includes 3771 annotated tweets. It
is to be noted that, each of them were annotated by three human.
They annotated 21600 tweets and all three of them agreed on only
3771 tweets. As the test set has strict classification, test tweets are
scored and then classified for testing purpose with a threshold of .5
(i.e. tweets with score above +0.5 are regarded positive, scored
below -0.5 regarded negative and rest are neutral). This yields only
61% matching of sentiment with test data. However most
disagreement are seen in non-biased annotated entries. For biased
posts, mismatch is around 26%.
4.3.2 Opinion based entity groups in recent stream
Data strapping from Twitter requires a long while due to the query
restriction per window. The third party GATE TreeTagger I used
performs slowly and this hinders overall performance. However,
the part implemented for the project performs fast in linear fashion.
Table 4 lists time variance over size of sample.
Table 4. Performance of graph analysis for different data size
Sample 1 Large
Sample
Very large
Sample
Tweets 160711 485447 847276
Time to analyze each 48.91s 148.53s 262.01s
Build Bigraph 9.29s 34.24s 66.45
Generate EE graph 1.54s 3.49s 4.99s
Time to Find Groups 0.126s 0.310s 0.358s
Groups count 157 334 457
Largest Group size 136 183 162
Significant Entities 1378 2627 3560
Legitimate Keywords 14997 25818 35005
5. Conclusion
This project has devised and studied an approach to mine social
network for eliciting public opinion about entities. Public opinion
is represented as, analysis of individual entities and graph analysis
of entities based on polarity aligned keyword relationship.
Sentiment analysis itself is still an open problem and needs further
investigation. This project uses an approach to analyze sentiment
of tweets, which is built from Twitter as learning corpus. This
approach yield polarity score rather than discrete polarity marker.
To elicit overall opinion about an entity, aggregative polarity score
and representative keywords are detected.
For grouping entities, an entity graph is built from entity-keyword
bigraph involving polarity scores. A local community detection
mechanism is used to finally cluster them.
The problem of detecting keywords is solved as an embedded
approach. During steps for building entity groups from strapped
tweets, keyword are filtered from raw candidate set of keywords to
final set of keywords. This approach can be useful in building
keyword lexicons dynamically.
A report of sample runs of implementation is also added in this
document. Several key observation are noted in section 4.
6. References
[1] A. Pak and P. Paroubek, "Twitter as a Corpus for Sentiment
Analysis and Opinion Mining," in Language Resources and
Evaluation, 2010.
[2] Twitter, "REST API v1.1 Resources," [Online]. Available:
https://dev.twitter.com/docs/api/1.1.
[3] "GATE Twitter part-of-speech tagger," [Online]. Available:
https://gate.ac.uk/wiki/twitter-postagger.html.
[4] B. Pang, L. Lee and S. Vaithyanathan, "Thumbs up?
Sentiment Classification using Machine Learning
Techniques," in Proceedings of the ACL-02 conference on
Empirical methods in natural language processing,
Philadelphia, PA, USA, 2002.
[5] T. Wilson, J. Wiebe and P. Hoffmann, "Recognizing
contextual polarity in phrase-level sentiment analysis," in
HLT '05 Proceedings of the conference on Human Language
Technology and Empirical Methods in Natural Language
Processing, Stroudsburg, PA, USA, 2005 .
[6] A. Esuli and F. Sebastiani, "Sentiwordnet: A publicly
available lexical resource for opinion mining," in
Proceedings of LREC, 2006.
[7] S. Baccianella, A. Esuli and F. Sebastiani, "SentiWordNet
3.0: An Enhanced Lexical Resource for Sentiment Analysis
and Opinion Mining," in LREC, 2010.
[8] M. Taboada, J. Brooke, M. Toﬁloski, K. Voll and M. Stede,
"Lexicon-based methods for sentiment analysis,"
Computational linguistics, vol. 37, pp. 267-307, 2011.
[9] V. Hatzivassiloglou and J. M. Wiebe, "Effects of adjective
orientation and gradability on sentence subjectivity," in
Proceedings of the 18th conference on Computational
linguistics-Volume 1, 2000.
[10] C. Whitelaw, N. Garg and S. Argamon, "Using appraisal
groups for sentiment analysis," in Proceedings of the 14th
ACM international conference on Information and
knowledge management, 2005.
[11] F. Benamara, C. Cesarano, A. Picariello, D. Reforgiato and
V. Subrahmanian, "Sentiment Analysis: Adjectives and
Adverbs are better than Adjectives Alone," in International
Conference on Weblogs and Social Media, Boulder, CO
USA, 2007.
[12] V. S. Subrahmanian and D. Reforgiato, "AVA: Adjective-
verb-adverb combinations for sentiment analysis,"
Intelligent Systems, vol. 23, no. 4, pp. 43-50, 2008.

8
[13] T. Mullen and N. Collier, "Sentiment Analysis using Support
Vector Machines with Diverse Information Sources," in
EMNLP, 2004.
[14] A. Bifet and E. Frank., "Sentiment knowledge discovery in
twitter streaming data," in Discovery Science, Berlin
Heidelberg, Springer , 2010, pp. 1-15.
[15] C. Lin and Y. He, "Joint sentiment/topic model for sentiment
analysis," in Proceedings of the 18th ACM conference on
Information and knowledge management, 2009.
[16] S. a. L. Y. a. S. H. Tan, Z. Guan, X. Yan, J. Bu, C. Chen and
X. He, "Interpreting the Public Sentiment Variations on
Twitter," IEEE Transactions on Knowledge and Data
Engineering, vol. 6, no. 1, pp. 1-14, 2012.
[17] X. Ding, B. Liu and P. S. Yu, "A holistic lexicon-based
approach to opinion mining," in WSDM '08 Proceedings of
the 2008 International Conference on Web Search and Data
Mining, New York, NY, USA, 2008.
[18] S. Narr, "Annotated Twitter Sentiment Dataset," [Online].
Available: http://data.dai-labor.de/corpus/sentiment/.
[Accessed 7 10 2013].
[19] "Sentiment140," [Online]. Available:
http://www.sentiment140.com.
[20] K. Zhang, H. Xu, J. Tang and J. Li, "Keyword Extraction
Using Support Vector Machine," in Advances in Web-Age
Information Management, Springer, 2006, pp. 85--96.
[21] A. Hulth, "Improved automatic keyword extraction given
more linguistic knowledge," in EMNLP '03 Proceedings of
the 2003 conference on Empirical methods in natural
language processing, Stroudsburg, PA, USA, 2003.
[22] O. Medelyan and I. H. Witten, "Thesaurus based automatic
keyphrase indexing," in Proceedings of the 6th ACM/IEEE-
CS joint conference on Digital libraries, 2006.
[23] G. Karypis and V. Kumar, "Multilevel k-way Partitioning
Scheme for Irregular Graphs," J. Parallel Distrib. Comput,
vol. 48, no. 1, pp. 96-129, 1998.
[24] M. Girvan and M. E. J. Newman, "Community structure in
social and biological networks," in Proc. Natl. Acad. Sci.
USA, 1999.
[25] M. E. J. Newman, "Fast algorithm for detecting community
structure in networks," in Phys. Rev. E 69, 066133., 2004.
[26] A. Clauset, M. E. J. Newman and C. Moore, "Finding
community structure in very large networks," in Phys. Rev.
E 70, 066111, 2004.
[27] M. E. J. Newman, "Modularity and community structure in
networks," in Proc. Natl. Acad. Sci. USA 103, 8577–8582,
2006.
[28] R. Andersen, F. Chung and K. Lang, "Local graph
partitioning using pagerank vectors," in Foundations of
Computer Science, FOCS'06. 47th Annual IEEE Symposium
on, 2006.
[29] H. Schmid, "TreeTagger," TC project at the Institute for
Computational Linguistics of the University of Stuttgart,
1994.
[30] B. Santorini, Part-of-speech tagging guidelines for the Penn
Treebank Project, 3rd revision ed., 1990.
[31] A. Go, R. Bhayani and L. Huang, "Twitter sentiment
classification using distant supervision," Stanford, 2009.
[32] L. Derczynski, A. Ritter, S. Clark and K. Bontcheva, "Twitter
Part-of-Speech Tagging for All: Overcoming Sparse and
Noisy Data," in Proceedings of the International Conference
on Recent Advances in Natural Language Processing, 2013.

Graph-based Analysis and Opinion Mining in Social Network

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (18)

Ähnlich wie Graph-based Analysis and Opinion Mining in Social Network

Ähnlich wie Graph-based Analysis and Opinion Mining in Social Network (20)

Mehr von Khan Mostafa

Mehr von Khan Mostafa (14)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Graph-based Analysis and Opinion Mining in Social Network