SlideShare ist ein Scribd-Unternehmen logo
1 von 8
Downloaden Sie, um offline zu lesen
1
Project Report: Graph-based Analysis and Opinion Mining
in Social Network
Khan Mostafa
Stony Brook University
Student ID# 109365509
khan.@nafSadh.com
ABSTRACT
This is the final report for Networks & Data Mining Techniques
project focusing on mining social network to estimate public
opinion about entities and associated keywords. This project mines
Twitter for recent feeds and analyzes them to estimate sentiment
score, discussed entity and describing keywords in each tweet. This
data is then exploited to elicit overall sentiment associated with
each entity. Entities and keywords extracted is also used to form an
entity-keyword bigraph. This graph is further used to detect entity
communities and keywords found within those communities.
Presented implementation works in linear time.
Categories and Subject Descriptors
H.2.8 [Database Management]: Database Applications –
Data Mining.
General Terms
Algorithms, Documentation, Experimentation.
Keywords
Opinion mining, sentiment, graph clustering, graph community
detection.
1. INTRODUCTION
This project focuses on mining opinion from social network. It
takes Twitter as a model platform for that it has a publicly available
stream of posts from people of diverge demographic. The goal is to
report public opinion in two forms: (a) overall opinion about some
entity and (b) opinion based cluster of entities and keywords.
Public opinion can be mined from posts about entity of interest. At
first, ample posts are fetched from public stream. Then, each post
is individually scored to find embedded subjectivity. All posts are
not subjective, some assert information while some other express
feelings. Hence, posts can be generally classified as objective,
positive and negative. However, subjective bias is not discrete;
rather each post embody mixed polarity. Again, attempts to
annotate post manually has shown that, different people associate
sentiment to same posts differently. Therefore, this project focuses
on calculating sentiment scores for posts. After each posts are
individually scored, overall opinion is represented using few
aggregative parameters including overall score, diversity, and
percentage of each type of polar posts. A set of keywords (kw) are
also identified to report how the entity (E) is positively and
negatively described.
In this project sentiment analysis is done using an approach similar
to [1], using a combination of two naïve Bayes classifiers to
calculate polarity score – PoS tag based classifier and n-gram based
classifier. Keywords and entities are primarily detected using parts
of speech. Then, in combined analysis, keywords that occur less
frequently for an entity is discarded, as that word is not sufficiently
associated with the entity. Again, those keywords that occur in
descriptions of too many entities, are less likely to be keyword,
rather are stop-words or generic words.
After tweets are individually analyzed further overall analysis can
be done. To do so, first an entity – keyword bigraph (E×kw) is
computed from tweets analyzed. Tweets are collected from recent
public feed stream using Twitter API. Analysis reports a polarity
score, a set of keywords and a set of entities for each tweet. In E×kw
bigraph an edge exist between E and kw if both occur in same tweet.
These edges also have associated polarity score. This E×kw bigraph
can be used to generate an E×E graph. In E×E, there exists an edge
between two entities if they share a keyword with similar sentiment
bias. This E×E graph is then clustered using a local clustering
algorithm in linear time.
This project is implemented mainly using .Net framework (C#) and
partially using PHP on Apache server to access Twitter API [2].
PoS tagging is done using a third party TreeTagger developed
recently for tweets [3].
The main contributions of this project are,
 Implemented a sentiment analysis tool that can elicit scores for
individual tweets
 Implemented a way to report aggregate sentiment score and
associated keywords for queried entity
 Devised and implemented a simple approach to identify
entities and keywords in tweets
 Implemented a fast local graph clustering algorithm using split
vectors instead of full-blown matrices.
 Used the fast local graph clustering to detect and report entity
groups along with keywords and grouped polarity scores
In this report following sections include, overview of prior works,
methodology description and result and analysis of mined data.
2. BACKGROUND
Mining social network for eliciting public opinion requires
sentiment analysis, keyword & entity tagging and graph clustering.
Sentiment analysis is vastly studied in several fields and still is an
open problem. There had also been ample investigation on
detecting communities, partitioning, and finding clusters in graphs.
In this section a few prior works are briefly discussed.
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise,
or republish, to post on servers or to redistribute to lists, requires prior
specific permission and/or a fee.
CSE590 Network and Data Mining Techniques, Fall, 2013, Stony Brook
University, NY, USA.
Copyright 2013
2
2.1 Sentiment Analysis
Sentiment analysis is being studied thoroughly for a decade or
more. One of the earliest work done by Pang, et al. [4], amongst
others, investigated in the field of sentiment classification. This
investigation opened a wide arena of research and have led to many
outcome by multitude of researchers from different fields.
Statistics, computational linguistics and machine learning has been
studied to solve the challenge of sentiment analysis.
There are several lexicon based techniques for opinion mining viz.
[5], versions of SentiWordNet [6], [7]. A detail survey of many
lexicon based approaches is done by [8].
Although earlier studies [9] suggested use of only adjectives as
subjectivity measure, later investigations revealed sentiment
appraisal is much diverse. Whitelaw, et al. [10] suggested using
appraisal taxonomies for sentiment classification. Similar
observation was found by [11] and [12] stating that, “Adjectives,
Verbs and Adverbs are better than Adjectives Alone”.
Machine learning approaches widely used Support Vector
Machines e.g. [1], [13] and Naïve Bayes e.g. [4], [14] classifiers.
Latent Dirichlet Allocation (LDA) is also utilized e.g. [15], [16]. A
lexicon based holistic approach [17] is also described to address
context dependency.
Opinion mining and sentiment analysis on Twitter is investigated
using various approaches viz. [14] [18] [1] [16] [19].
Most approaches for opinion mining assign strict subjectivity class
(positive, negative, neutral) to individual texts in different
granularity (i.e. sentence, post, paragraph and document).
However, a score assignment will serve better to understand
intensity of opinion. There is a paucity of studies that tried to
aggregate sentiment to identify public opinion. Perception of
opinion vary for each individual and a better insight of public
opinion can be found by eliciting few attributes from social media.
Overall sentiment score, percentage of positive and negative
opinions, key descriptions are useful attributes that can be elicited.
This project will focus on mining tweets about some entity for these
attributes associated with that entity.
2.2 Keyword and Entity detection
There are different and diverse approaches for keyword detection.
For example, there are machine learning based approaches, using
SVM [20], associating linguistic knowledge like n-grams and PoS
for supervised keyword extraction [21]. Thesaurus based
approaches [22] use semantic knowledge for machine based
keyword extraction.
Most keyword identification approaches use some kind of machine
learning technique along with some other knowledge. However, for
this project’s purpose, a simple method is required to identify
keywords. This project will employ hints from PoS tagging and
then let data itself build a keyword lexicon while simultaneously
detecting them.
1
Modularity is the fraction of the edges that fall within the given groups
minus the expected such fraction if edges were distributed at random.
[Wikipedia, Accessed Dec 03, 2013]
2.3 Graph clustering
Graphs have been studied extensively historically from
mathematical and theoretical viewpoint and in recent few decades
they have been more extensively studied from data analytic
perspectives. A lot of real world and physical phenomena can be
ideally modeled as graphs. These graphs can be then efficiently
investigated to find latent characteristics of modeled data.
One major operation on graph in data mining is to divide them into
smaller parts. Partitioning can be of different types. One approach
might be to partition whole graph into disjoint sub graphs of similar
size [23].
For analyzing graphs, a more natural division is often desired.
Vertices in graphs tend to have edge with vertices that have vertices
with other connected neighboring vertices of its own and thus
create communities. However, communities differ in sizes and
these communities are not disconnected. Rather, there are few links
between nodes of different community in contrast to nodes of same
community. Newman and others has conducted several research
[24] [25] [26] on detecting communities in graphs. They exploited
modularity1
of graph to do so. Most of their early works were
restrictive on scalability but later spectral optimization of
modularity yielded [27] an algorithm that works in near linear time.
Modularity based approaches cluster graph into disjoint
communities. In contrast, often communities are overlapped.
Andersen et al [28] suggested a “Local Graph Partitioning using
PageRank Vectors” and other derived algorithms. The core idea
behind these approaches is to use conductance2
of graph to locally
cluster them. These approaches works near linearly and can detect
communities that overlap.
This project uses an approach as devised by Andersen et al, as it
serves several purposes of the project goal. It can detect
communities that overlap, works near linearly, and an
implementation without necessarily creating the blown-up full
matrix is possible.
3. PROJECT DESCRIPTION
3.1 Problem Statement
People express their opinion about entities (viz. location, person,
products etc.) in social networks. In brief, the goal is to,
 extract overall public opinion of some entity
 elicit opinion based entity groups in recent stream
The scope of the project is to mine a popular microblogging
platform: Twitter.
3.1.1 Extract overall public opinion of some entity
The goal is to extract opinion about a given entity, E. This will be
done in terms of ample recent tweets about E. The solution shall be
able to yield the following about a given entity, E,
 Overall sentiment: Overall sentiment (viz. positive, negative,
mixed) about E. A sentiment score in a range of [-1, 1] will be
given. This will also show the percentage of positive, negative
2
Conductance is the measure of a sub graph denoting how much it
is connected to rest of the graph. It is the ratio of out-links from
the sub graph to the volume (total edge count from nodes in it).
3
and neutral (some threshold can be applied to distinguish
between these three classes) tweets as well as the count of
analyzed tweets. A measure (e.g. variance) of how diverse the
opinion is can also be included.
 Key description: The system will yield a set of keywords (kw)
that are used to describe E
An overall sentiment about an entity is useful to multitude of clients
for various applications. Sets of key descriptive words along with
sentiment will provide a better insight of public feelings.
3.1.2 Opinion based entity groups in recent stream
The goal is to detect how entities are grouped together in terms of
sentiment and descriptive keywords. This will be done based on a
stream of recent tweets. Each tweets shall be individually analyzed,
as in 3.1.1. Analysis on each tweet will yield,
 Text in the tweet, T
 Entities discussed in it, E
 Keywords in it, kw
 Polarity score, P
This tuples (T,E,kw,P) will then be used to build E×kw bigraph
such that,
 There exists an edge between Ei and kwj if there is one or
more tweet that contains Ei and kwj
 The edge has a weight indicating co-occurrence of Ei and
kwj. i.e.
weightij = Count ({Tk | Ei ∈ Tk.E ∧ kwj∈ Tk.kw})
 The edge has pScore that is average of pScore (=P) for
all such occurrences. i.e.
pScore =
Sum({Tk .pScore| Ei ∈ Tk.E ∧ kwj∈ Tk.kw})/weight
After this, a filter will be run on this graph to eliminate those links
that exist between entity and keyword where the keyword is not
enough descriptive of the entity. This is done, by calculating freq
such that,
freqij = weightij/ Occurrence (Ei)
If freqij is smaller than certain threshold, εfreq then that keyword is
filtered out for this entity Ei.
This E×kw bigraph will then be used to build E×E graph, such that,
there exists an edge between Ei and Ej if
 Occurrence(Ei)> εeo ∧ Occurrence(Ej)> εeo
 {kw(Ei) | Occurrence(kwx)< εkwo} ⋂ {kw(Ej) |
Occurrence(kwx)< εkwo} is not empty
 Polarity bias for both are similar
To describe, there is an edge between two entities if they share one
or more keywords with similar polarity bias link. These entities are
such that, they occur over a threshold, εeo. These keywords are such
that, they do not occur for more than some threshold, εkwo, times.
This threshold over keywords is motivated from following
intuition,
 If a potential word occur in description of most entities
then that is not an keyword but is a generic term
Then, a community detection algorithm is to be run on this E×E
graph to find groups of entities that are bind together with lot of
polarity aligned keyword links. After one such groups of entities is
generated, there will be a group of keywords such that, they occur
in edges that are within that community of nodes. Also, a
representative averaged pScore can be calculated for such a group.
To summarize, given a stream of tweets, the system shall be able to
generate,
 (T,E,kw,P) tuples
 E×kw bigraph
 E×E graph
 Return group of entities has similar opinion
3.2 Data collection
3.2.1 Corpus and entity from Twitter
This project requires collecting two types of data. First, a corpus of
subjective and objective tweets are collected – these data is used to
train classifier (scorer). After training the classifier, training (not
the training data set) can be stored in a file so that scorer can act
later by loading them from file.
Secondly, on query time posts are fetched from Twitter.
Following API from Twitter is used:
 search/tweets
This API is called with ‘q’ = emoticons for gathering training
data (positive and negative posts).
In query time, same API is used with ‘q’ = query term to fetch
related recent posts.
 statuses/user_timeline
This API is used to fetch objective training data by querying
'screen_name' = popular_stream. I used, Lifehacker,
Gizmodo, New York Times, and The Atlantic as source.
Twitter API do not allow fetching more than 100 posts at once.
Hence, I had to exploit max_id for iteratively requesting same call
for different portions of result. I have collected ten thousands of
each type of data for training. In query time 200~2000 posts are
fetched.
3.2.2 Mining recent twitter stream
To generate an E×E graph large enough to detect grouping of
entities a large stream of Twitter public stream is to be collected.
To do this, again Twitter API is used and strapped continuously for
a large amount of windows. Note that, in v1.1, Twitter API allow
only 180 search query per window per user and 450 query per
window per app. At each query, a maximum of 100 tweets are
returned. Currently, windows are 15 minutes each. Hence, max_id
is utilized to continuously fetch tweets using a q=”.” query.
Another alternative to search/tweets API could be a streaming API.
After tweets are fetched, very tiny tweets are discarded. I have,
filtered out tweets with less than 50 characters. This is because,
smaller tweets are difficult to understand. Also, retweets (RT) are
discarded to avoid occurrence of same tweets many times.
Furhtermore, another stage of filtration is imposed to remove yet
duplicate tweets.
4
3.2.3 PoS Tagging
After collecting tweets they are passed to a TreeTagger for PoS
tagging. I used recently developed GATE Twitter part-of-speech
tagger [3], which is based on Stanford TreeTagger, which in terms
are based on famous TreeTagger [29] by Schimd. PoS tags yielded
are based on Penn-Treebank-Tagset [30].
3.3 Implementation
3.3.1 Twitter corpus to train sentiment classifier
Each posts are individually scored based on two scorers. Following
(Pak and Paroubek 2010) [1], two classifiers are built. To train
them, tweets are queried as such, (1) positive tweets are fetched
with a search of q=””, (2) negative tweets are fetched with a
search of q=”” and (3) objective tweets are fetched from new
media accounts. One classifier exploits parts-of-speech (PoS)
distribution amongst objective and polar statements. PoS
distribution differs amongst positive and negative statements. See
Figure 1 and Figure 2. Another classifier is made exploiting the
distribution of n-grams (n=2). N-grams indicate strong correlation
with bias or with objectivity. Human usually uses common phrases
to express a type of feeling. On the other hand, some phrases are of
assertive nature. This feature of natural language is captured using
n-grams. See Table 2 for top 20 polar n-grams of 94k n-grams.
The reference work used classification result from two classifiers
to verdict final classification. This project enhances the approach
by implementing classifiers as scorers to evaluate PoS score and N-
Gram score for each statement. Then, both score contribute to a
final score of the statement (tweet).
3.3.2 From strapped tweets to graphs
As outlined in 3.1.2, (T,E,kw,P) tuples, E×kw bigraph and
E×E graph are generated from a given stream of tweets.
3.3.2.1 Analyzing tweets
To do so, first each tweet is scored using sentiment classifier
described in 3.3.1.
PoS tags are exploited to primarily identify entities and keywords.
Entity: Our goal is to analyze entities (location, place, person,
product etc.) In English, they are generally represented by proper
nouns. Also, in Twitter, users can be regarded as entities. Hence,
from, PoS tags, proper nouns (NNP, NNPS, USR) are regarded as
entities.
Keyword: In English adjective, adverbs and verbs are used to
describe an entity. This property is exploited by identifying words
with tags for these PoS (JJ, RB, VB etc.) as keywords. The
algorithm also allows an alternate using a parameter that include
common nouns (not NNP) as keywords.
3.3.2.2 Entity-keyword bigraph
From analyzed tweets, (T,E,kw,P) tuples are iterated on to build
an E×kw bigraph as described in 3.3.1. A general intuition, also
confirmed by several studies, is that, graphs are generally sparse.
Thus, instead of building full blown matrix, two dictionary/maps
are stored to represent E×kw bigraph:-
 A dictionary of entities, with pointers to keywords, as well as
weight and pScore associated with that node
 For ease of iteration, another dictionary of keywords is stored,
which stores pointers back to entities from keywords.
This representation, assure small storage for the entire bigraph, yet
describes entire bigraph with edges and nodes. This reduces the
storage from (E*kw) to edgeCount. Note that,
2*(E+kw) < edgeCount << (E*kw)
Running time for building a bigraph is proportional to number of
edges, i.e. 𝑂(𝑒𝑑𝑔𝑒𝑠).
3.3.2.3 Entity-Entity graph
From the E×kw bigraph generated above, an E×E is generated by
iterating over each entity. For each entity, Ei, a set of keywords
kw(Ei) are processed. Each keyword points to another set of
entities, E(kw(Ei). These set of entities are added to neighbor of Ei.
In this step also, a dictionary is used to represent the graph. It
requires one dictionary of entities, where each entry also point to
immediate neighbors. This requires a storage of 2*edge. Runtime
to build this graph is proportional to number of edges. However, a
filtration of entities is done a priori to remove nodes with very few
neighbor from simulation (thus building a set of significant
entities). Filtering generic terms from keyword list (thus only using
legitimate keywords) reduces search space.
3.3.3 Keywords form data
Keywords are filtered in several steps to let data define legitimate
keywords. In first step, PoS tagging define preliminary set. After
all tweets are analyzed, a filtration is used to remove low frequency
terms from keyword lists of each entity. After E×kw bigraph is
built, another filtration is used to rule out generic terms. Generic
terms are those potential keywords that are found in too many
entities. A threshold parameter is supplied to the algorithm for this.
Finally after generating communities consolidation step filters out
irregular keywords to yield final set of keywords.
3.3.4 Community detection: group of entities
After E×E graph is generated, consisting legitimate keywords and
significant entities a community detection algorithm can be used to
detect community in them. This project implements a fast
derivation of Andersen et al [28].
Table 1. Community Detection Algorithm
1. Significant_entities := entities in (E×E)
2. Seed_node := supplied_seed
3. if(seed null or not exist) then
seed:=first(Significant_entities)
4. aCommunity := new Community()
5. entity :=seed
6. eval := evaluate(entity,aCommunity)
7. if(eval.member) then
aCommunity.Add(entity)
remove(entity, Significant_entities)
remove(a.Community. Nbor, entity)
8. if(aCommunity.Nbor = empty)
goto 11
9. entity := first(aCommunity. Nbor)
10. goto 5
11. add(aCommunity,Communities)
12. if(Significant_entities not emmpty)
goto 4
13. return
5
Algorithm described above uses objects of class Community. It’s
Add() member function adds the entity and updates the community
with, Volume (=edges inside) and outward links. evaluate()
function check membership by calculated conductance if this node
added to community and compare with original conductance.
Conductance is defined as,
Cond = (links outward from community)/(edges inside).
This will generate a set of communities. After generating each
communities, a consolidation step in is undergone to further filter
keywords. This is done as,
size:= size of community := number of entities in it
Threshold := ln(size)
If (Occcurance(kw)< Threshold) then Remove(kw)
After this step, a set of descriptive keywords is associated with the
group of entities.
3.3.5 Storing result
The final outcome of communities is returned as an XML document
from the implementation. Also, (T,E,kw,P) tuples are returned
as XML. Other intermediate graphs, E×kw bigraph and E×E graph
are exported as CSV (comma separated value) files.
4. RESULTS AND FINDINGS
4.1 Findings
Findings reported here are based on 160,711 tweets collected in late
November of 2013.
4.1.1 PoS Distributions and n-grams
Later in this section are figures of PoS distributions over subjective-
objective statements and positive-negative statements. A positive
bias value in Figure 1 indicate presence of such PoS is more
indicative of the statement of being positive. Same is for negative
values. Subjectivity score in Figure 2 indicates similar score. Table
2 shows top few n-grams. Note that, PoS distributions and top n-
grams slightly differ from referred work [1]. Again, if training data
is collected in different time, some slight change will occur.
Table 2. Top n-gram with occurrence in each class of data
n-gram Positive Negative Objective
'enjoying break' 1 328 1
'happy birthday' 22 207 1
'so happy' 106 53 1
'follow back' 10 132 1
'miss my' 93 10 1
'no one notices' 97 4 1
'notices my' 97 1 1
'good day' 5 82 1
'follow please' 47 38 1
'my phone' 64 18 1
'presenting emotional' 60 20 1
'please follow' 11 66 1
'follow love' 17 60 1
'am sorry' 71 4 1
'so sad' 71 3 1
'miss u' 65 7 1
'new followers' 53 17 1
Figure 1. Distribution of PoS in positive and negative
statements
Figure 2. Distribution of PoS between subjective and objective
tweets
4.1.2 Power law in Entity and Keywords
Figure 3 and Figure 4 show how entity and keywords follow power
law.
Figure 3. ln(Occurance) of Entities show power law
Figure 4. ln(Occurance) of keyword show power law
4.1.3 Distribution of Polarity Score in Entities
Figure 5 show how polarity score amongst entities are distributed.
It is seen that, polarity score has skewed distribution. Figure 6
shows the distribution of polarity score over natural logarithm (ln)
of occurrence of the entity.
POS,0.600
WP$,0.500
PDT,0.333
RBS,0.280
URL,0.229
WP,0.217
JJS,0.187
SYM,0.176
USR,0.155
FW,0.127
NNP,0.110
CD,0.068
DT,0.032
VB,0.000
UH,-0.004
NN,-0.007
JJR,-0.010
IN,-0.012
NNS,-0.015
JJ,-0.019
RBR,-0.024
WDT,-0.031
VBG,-0.034
NNPS,-0.050
VBZ,-0.055
EX,-0.064
MD,-0.099
CC,-0.102
PRP$,-0.114
PRP,-0.135
VBP,-0.144
TO,-0.149
RP,-0.175
RB,-0.182
VBD,-0.227
VBN,-0.245
WRB,-0.282
BIAS
WRB,0.164
VBN,0.140
VBD,0.128
RB,0.100
RP,0.096
TO,0.081
VBP,0.078
PRP,0.072
PRP$,0.061
CC,0.054
MD,0.052
EX,0.033
VBZ,0.028
NNPS,0.025
VBG,0.017
WDT,0.016
RBR,0.012
JJ,0.010
NNS,0.008
IN,0.006
JJR,0.005
NN,0.003
UH,0.002
VB,0.000
LS,0.000
DT,-0.016
CD,-0.033
NNP,-0.052
FW,-0.060
USR,-0.072
SYM,-0.081
JJS,-0.085
WP,-0.098
URL,-0.103
RBS,-0.123
PDT,-0.143
WP$,-0.200
POS,-0.231
SUBJECTIVITY
0
1
2
3
4
5
6
7
8
9
0 2000 4000 6000 8000 10000 12000 14000
0
1
2
3
4
5
6
7
8
9
10
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
6
Figure 5. Distribution of Polarity Score over entire entity space
Figure 6. Polarity Score over ln(Occurance) of entities
4.1.4 Graph BFS & communities in adjacency matrix
From any arbitrary node, the E×E graph is traversed BFS (breadth
first search) to generate an arbitrary random walk. This BFS assigns
index to each entity and then an adjacency matrix is visualized as
in Figure 7. Notice that, this is a near diagonal matrix. Although the
diagram is white, as there is no self-edge. Notice the blocks; these
blocks are representative of communities. There are tiny and large
communities. There are 157 communities having a maximum size
of 136.
Figure 7. Adjacency matrix of significant entities
4.1.5 Observation of Groups
Different size of feed tweet set are examined. It is seen that, number
of significant entities and number of legitimate keywords increase
with size of tweets. They all yield communities with different size.
When manually examined these communities, and keywords, they
matched intuition. An interesting community where the keyword
cries is associated with two stars is noted in Figure 8.
<Community id="146" size="2" conductance="0.5"
pScore="0.63566754320156">
<trapped-keywords count="1">
Cries:4,
</trapped-keywords>
<e>Kristen Stewart</e>
<e>Robert Pattinson</e>
</Community>
Figure 8. XML representation of a community
4.2 Results
Figure 9 shows some sample runs where the system is queried for
overall sentiment analysis of an entity.
<opinion entity='mermaid'>
<score>0.21</score>
<analysis
post-count='1086'
percent-positive='52.03'
percent-negative='24.59'/>
</opinion>
<opinion entity='bankrupt'>
<score>-0.18</score>
<analysis
post-count='2073'
percent-positive='30.29'
percent-negative='47.03'/>
</opinion>
<opinion entity='drunk man'>
<score>-0.50</score>
<analysis
post-count='1084'
percent-positive='11.99'
percent-negative='65.59'/>
</opinion>
<opinion entity='November'>
<score>0.20</score>
<analysis
post-count='2062'
percent-positive='53.25'
percent-negative='25.12'/>
</opinion>
Figure 9. Result runs for query over entity
Few parameters are fluctuated on the sample to see how they works.
Kw threshold (εkwo), Minimum nodes (εeo), Common Noun as
keyword are varied and results are shown in Table 3. Using
common nouns as keyword yield a few groups with very large size.
Thus, it is recommended to discard common noun from keywords.
Table 3. Effect of parameters change
Kw threshold 350 350 450
Minimum nodes 2 2 2
Common Noun
as keyword
false true false
Potential kw 15108 31593 15108
Legitimate kw 14967 31368 14997
Entities 97147 97147 97147
E occurring > 2 7580 7580 7580
Significant E. 1190 2012 1378
Groups 170 92 157
Largest size 70 1256 136
Polarity scores of each entities and keywords are stored and can be
accessed directly in E×kw bigraph.
Building a polarity invariant E×kw bigraph is also tested. For,
similar setting as of last column in Table 3, polarity invariant
version generated 174 groups with largest group of size 598 for
1854 significant entities. Generated groups are also significantly
different.
-1.5
-1
-0.5
0
0.5
1
1.5
-1.5
-1
-0.5
0
0.5
1
1.5
0 1 2 3 4 5 6 7 8
7
Files generated containing result sets are kept online at
http://meaningofdata.com/mining
4.3 Performance
4.3.1 Sentiment Scoring
There is no available way to evaluate correctness for overall
sentiment analysis. Therefore, performance for individual scoring
is tested against a publicly available Mechanical Turk annotated
Twitter data [18]. This data set includes 3771 annotated tweets. It
is to be noted that, each of them were annotated by three human.
They annotated 21600 tweets and all three of them agreed on only
3771 tweets. As the test set has strict classification, test tweets are
scored and then classified for testing purpose with a threshold of .5
(i.e. tweets with score above +0.5 are regarded positive, scored
below -0.5 regarded negative and rest are neutral). This yields only
61% matching of sentiment with test data. However most
disagreement are seen in non-biased annotated entries. For biased
posts, mismatch is around 26%.
4.3.2 Opinion based entity groups in recent stream
Data strapping from Twitter requires a long while due to the query
restriction per window. The third party GATE TreeTagger I used
performs slowly and this hinders overall performance. However,
the part implemented for the project performs fast in linear fashion.
Table 4 lists time variance over size of sample.
Table 4. Performance of graph analysis for different data size
Sample 1 Large
Sample
Very large
Sample
Tweets 160711 485447 847276
Time to analyze each 48.91s 148.53s 262.01s
Build Bigraph 9.29s 34.24s 66.45
Generate EE graph 1.54s 3.49s 4.99s
Time to Find Groups 0.126s 0.310s 0.358s
Groups count 157 334 457
Largest Group size 136 183 162
Significant Entities 1378 2627 3560
Legitimate Keywords 14997 25818 35005
5. Conclusion
This project has devised and studied an approach to mine social
network for eliciting public opinion about entities. Public opinion
is represented as, analysis of individual entities and graph analysis
of entities based on polarity aligned keyword relationship.
Sentiment analysis itself is still an open problem and needs further
investigation. This project uses an approach to analyze sentiment
of tweets, which is built from Twitter as learning corpus. This
approach yield polarity score rather than discrete polarity marker.
To elicit overall opinion about an entity, aggregative polarity score
and representative keywords are detected.
For grouping entities, an entity graph is built from entity-keyword
bigraph involving polarity scores. A local community detection
mechanism is used to finally cluster them.
The problem of detecting keywords is solved as an embedded
approach. During steps for building entity groups from strapped
tweets, keyword are filtered from raw candidate set of keywords to
final set of keywords. This approach can be useful in building
keyword lexicons dynamically.
A report of sample runs of implementation is also added in this
document. Several key observation are noted in section 4.
6. References
[1] A. Pak and P. Paroubek, "Twitter as a Corpus for Sentiment
Analysis and Opinion Mining," in Language Resources and
Evaluation, 2010.
[2] Twitter, "REST API v1.1 Resources," [Online]. Available:
https://dev.twitter.com/docs/api/1.1.
[3] "GATE Twitter part-of-speech tagger," [Online]. Available:
https://gate.ac.uk/wiki/twitter-postagger.html.
[4] B. Pang, L. Lee and S. Vaithyanathan, "Thumbs up?
Sentiment Classification using Machine Learning
Techniques," in Proceedings of the ACL-02 conference on
Empirical methods in natural language processing,
Philadelphia, PA, USA, 2002.
[5] T. Wilson, J. Wiebe and P. Hoffmann, "Recognizing
contextual polarity in phrase-level sentiment analysis," in
HLT '05 Proceedings of the conference on Human Language
Technology and Empirical Methods in Natural Language
Processing, Stroudsburg, PA, USA, 2005 .
[6] A. Esuli and F. Sebastiani, "Sentiwordnet: A publicly
available lexical resource for opinion mining," in
Proceedings of LREC, 2006.
[7] S. Baccianella, A. Esuli and F. Sebastiani, "SentiWordNet
3.0: An Enhanced Lexical Resource for Sentiment Analysis
and Opinion Mining," in LREC, 2010.
[8] M. Taboada, J. Brooke, M. Tofiloski, K. Voll and M. Stede,
"Lexicon-based methods for sentiment analysis,"
Computational linguistics, vol. 37, pp. 267-307, 2011.
[9] V. Hatzivassiloglou and J. M. Wiebe, "Effects of adjective
orientation and gradability on sentence subjectivity," in
Proceedings of the 18th conference on Computational
linguistics-Volume 1, 2000.
[10] C. Whitelaw, N. Garg and S. Argamon, "Using appraisal
groups for sentiment analysis," in Proceedings of the 14th
ACM international conference on Information and
knowledge management, 2005.
[11] F. Benamara, C. Cesarano, A. Picariello, D. Reforgiato and
V. Subrahmanian, "Sentiment Analysis: Adjectives and
Adverbs are better than Adjectives Alone," in International
Conference on Weblogs and Social Media, Boulder, CO
USA, 2007.
[12] V. S. Subrahmanian and D. Reforgiato, "AVA: Adjective-
verb-adverb combinations for sentiment analysis,"
Intelligent Systems, vol. 23, no. 4, pp. 43-50, 2008.
8
[13] T. Mullen and N. Collier, "Sentiment Analysis using Support
Vector Machines with Diverse Information Sources," in
EMNLP, 2004.
[14] A. Bifet and E. Frank., "Sentiment knowledge discovery in
twitter streaming data," in Discovery Science, Berlin
Heidelberg, Springer , 2010, pp. 1-15.
[15] C. Lin and Y. He, "Joint sentiment/topic model for sentiment
analysis," in Proceedings of the 18th ACM conference on
Information and knowledge management, 2009.
[16] S. a. L. Y. a. S. H. Tan, Z. Guan, X. Yan, J. Bu, C. Chen and
X. He, "Interpreting the Public Sentiment Variations on
Twitter," IEEE Transactions on Knowledge and Data
Engineering, vol. 6, no. 1, pp. 1-14, 2012.
[17] X. Ding, B. Liu and P. S. Yu, "A holistic lexicon-based
approach to opinion mining," in WSDM '08 Proceedings of
the 2008 International Conference on Web Search and Data
Mining, New York, NY, USA, 2008.
[18] S. Narr, "Annotated Twitter Sentiment Dataset," [Online].
Available: http://data.dai-labor.de/corpus/sentiment/.
[Accessed 7 10 2013].
[19] "Sentiment140," [Online]. Available:
http://www.sentiment140.com.
[20] K. Zhang, H. Xu, J. Tang and J. Li, "Keyword Extraction
Using Support Vector Machine," in Advances in Web-Age
Information Management, Springer, 2006, pp. 85--96.
[21] A. Hulth, "Improved automatic keyword extraction given
more linguistic knowledge," in EMNLP '03 Proceedings of
the 2003 conference on Empirical methods in natural
language processing, Stroudsburg, PA, USA, 2003.
[22] O. Medelyan and I. H. Witten, "Thesaurus based automatic
keyphrase indexing," in Proceedings of the 6th ACM/IEEE-
CS joint conference on Digital libraries, 2006.
[23] G. Karypis and V. Kumar, "Multilevel k-way Partitioning
Scheme for Irregular Graphs," J. Parallel Distrib. Comput,
vol. 48, no. 1, pp. 96-129, 1998.
[24] M. Girvan and M. E. J. Newman, "Community structure in
social and biological networks," in Proc. Natl. Acad. Sci.
USA, 1999.
[25] M. E. J. Newman, "Fast algorithm for detecting community
structure in networks," in Phys. Rev. E 69, 066133., 2004.
[26] A. Clauset, M. E. J. Newman and C. Moore, "Finding
community structure in very large networks," in Phys. Rev.
E 70, 066111, 2004.
[27] M. E. J. Newman, "Modularity and community structure in
networks," in Proc. Natl. Acad. Sci. USA 103, 8577–8582,
2006.
[28] R. Andersen, F. Chung and K. Lang, "Local graph
partitioning using pagerank vectors," in Foundations of
Computer Science, FOCS'06. 47th Annual IEEE Symposium
on, 2006.
[29] H. Schmid, "TreeTagger," TC project at the Institute for
Computational Linguistics of the University of Stuttgart,
1994.
[30] B. Santorini, Part-of-speech tagging guidelines for the Penn
Treebank Project, 3rd revision ed., 1990.
[31] A. Go, R. Bhayani and L. Huang, "Twitter sentiment
classification using distant supervision," Stanford, 2009.
[32] L. Derczynski, A. Ritter, S. Clark and K. Bontcheva, "Twitter
Part-of-Speech Tagging for All: Overcoming Sparse and
Noisy Data," in Proceedings of the International Conference
on Recent Advances in Natural Language Processing, 2013.

Weitere ähnliche Inhalte

Was ist angesagt?

Event detection and summarization based on social networks and semantic query...
Event detection and summarization based on social networks and semantic query...Event detection and summarization based on social networks and semantic query...
Event detection and summarization based on social networks and semantic query...ijnlc
 
NE7012- SOCIAL NETWORK ANALYSIS
NE7012- SOCIAL NETWORK ANALYSISNE7012- SOCIAL NETWORK ANALYSIS
NE7012- SOCIAL NETWORK ANALYSISrathnaarul
 
A QUERY LEARNING ROUTING APPROACH BASED ON SEMANTIC CLUSTERS
A QUERY LEARNING ROUTING APPROACH BASED ON SEMANTIC CLUSTERSA QUERY LEARNING ROUTING APPROACH BASED ON SEMANTIC CLUSTERS
A QUERY LEARNING ROUTING APPROACH BASED ON SEMANTIC CLUSTERSijait
 
Survey on Location Based Recommendation System Using POI
Survey on Location Based Recommendation System Using POISurvey on Location Based Recommendation System Using POI
Survey on Location Based Recommendation System Using POIIRJET Journal
 
Hybrid sentiment and network analysis of social opinion polarization icoict
Hybrid sentiment and network analysis of social opinion polarization   icoictHybrid sentiment and network analysis of social opinion polarization   icoict
Hybrid sentiment and network analysis of social opinion polarization icoictAndry Alamsyah
 
Hashtag Conversations, Eventgraphs, and User Ego Neighborhoods: Extracting...
Hashtag Conversations,Eventgraphs, and User Ego Neighborhoods:  Extracting...Hashtag Conversations,Eventgraphs, and User Ego Neighborhoods:  Extracting...
Hashtag Conversations, Eventgraphs, and User Ego Neighborhoods: Extracting...learjk
 
Finding Pattern in Dynamic Network Analysis
Finding Pattern in Dynamic Network AnalysisFinding Pattern in Dynamic Network Analysis
Finding Pattern in Dynamic Network AnalysisAndry Alamsyah
 
Ontological approach for improving semantic web search results
Ontological approach for improving semantic web search resultsOntological approach for improving semantic web search results
Ontological approach for improving semantic web search resultseSAT Publishing House
 
Tweet sentiment analysis
Tweet sentiment analysisTweet sentiment analysis
Tweet sentiment analysisAnil Shrestha
 
The Mathematics of Social Network Analysis: Metrics for Academic Social Networks
The Mathematics of Social Network Analysis: Metrics for Academic Social NetworksThe Mathematics of Social Network Analysis: Metrics for Academic Social Networks
The Mathematics of Social Network Analysis: Metrics for Academic Social NetworksEditor IJCATR
 
Context Sensitive Relatedness Measure of Word Pairs
Context Sensitive Relatedness Measure of Word PairsContext Sensitive Relatedness Measure of Word Pairs
Context Sensitive Relatedness Measure of Word PairsIJCSIS Research Publications
 
A Proposal on Social Tagging Systems Using Tensor Reduction and Controlling R...
A Proposal on Social Tagging Systems Using Tensor Reduction and Controlling R...A Proposal on Social Tagging Systems Using Tensor Reduction and Controlling R...
A Proposal on Social Tagging Systems Using Tensor Reduction and Controlling R...ijcsa
 
A Novel Approach for Travel Package Recommendation Using Probabilistic Matrix...
A Novel Approach for Travel Package Recommendation Using Probabilistic Matrix...A Novel Approach for Travel Package Recommendation Using Probabilistic Matrix...
A Novel Approach for Travel Package Recommendation Using Probabilistic Matrix...IJSRD
 
Volume 2-issue-6-2016-2020
Volume 2-issue-6-2016-2020Volume 2-issue-6-2016-2020
Volume 2-issue-6-2016-2020Editor IJARCET
 
Oxford Digital Humanities Summer School
Oxford Digital Humanities Summer SchoolOxford Digital Humanities Summer School
Oxford Digital Humanities Summer SchoolScott A. Hale
 

Was ist angesagt? (18)

Event detection and summarization based on social networks and semantic query...
Event detection and summarization based on social networks and semantic query...Event detection and summarization based on social networks and semantic query...
Event detection and summarization based on social networks and semantic query...
 
NE7012- SOCIAL NETWORK ANALYSIS
NE7012- SOCIAL NETWORK ANALYSISNE7012- SOCIAL NETWORK ANALYSIS
NE7012- SOCIAL NETWORK ANALYSIS
 
A QUERY LEARNING ROUTING APPROACH BASED ON SEMANTIC CLUSTERS
A QUERY LEARNING ROUTING APPROACH BASED ON SEMANTIC CLUSTERSA QUERY LEARNING ROUTING APPROACH BASED ON SEMANTIC CLUSTERS
A QUERY LEARNING ROUTING APPROACH BASED ON SEMANTIC CLUSTERS
 
Survey on Location Based Recommendation System Using POI
Survey on Location Based Recommendation System Using POISurvey on Location Based Recommendation System Using POI
Survey on Location Based Recommendation System Using POI
 
Hybrid sentiment and network analysis of social opinion polarization icoict
Hybrid sentiment and network analysis of social opinion polarization   icoictHybrid sentiment and network analysis of social opinion polarization   icoict
Hybrid sentiment and network analysis of social opinion polarization icoict
 
Hashtag Conversations, Eventgraphs, and User Ego Neighborhoods: Extracting...
Hashtag Conversations,Eventgraphs, and User Ego Neighborhoods:  Extracting...Hashtag Conversations,Eventgraphs, and User Ego Neighborhoods:  Extracting...
Hashtag Conversations, Eventgraphs, and User Ego Neighborhoods: Extracting...
 
Finding Pattern in Dynamic Network Analysis
Finding Pattern in Dynamic Network AnalysisFinding Pattern in Dynamic Network Analysis
Finding Pattern in Dynamic Network Analysis
 
Ontological approach for improving semantic web search results
Ontological approach for improving semantic web search resultsOntological approach for improving semantic web search results
Ontological approach for improving semantic web search results
 
Tweet sentiment analysis
Tweet sentiment analysisTweet sentiment analysis
Tweet sentiment analysis
 
Social Data Mining
Social Data MiningSocial Data Mining
Social Data Mining
 
The Mathematics of Social Network Analysis: Metrics for Academic Social Networks
The Mathematics of Social Network Analysis: Metrics for Academic Social NetworksThe Mathematics of Social Network Analysis: Metrics for Academic Social Networks
The Mathematics of Social Network Analysis: Metrics for Academic Social Networks
 
Context Sensitive Relatedness Measure of Word Pairs
Context Sensitive Relatedness Measure of Word PairsContext Sensitive Relatedness Measure of Word Pairs
Context Sensitive Relatedness Measure of Word Pairs
 
A Proposal on Social Tagging Systems Using Tensor Reduction and Controlling R...
A Proposal on Social Tagging Systems Using Tensor Reduction and Controlling R...A Proposal on Social Tagging Systems Using Tensor Reduction and Controlling R...
A Proposal on Social Tagging Systems Using Tensor Reduction and Controlling R...
 
A Novel Approach for Travel Package Recommendation Using Probabilistic Matrix...
A Novel Approach for Travel Package Recommendation Using Probabilistic Matrix...A Novel Approach for Travel Package Recommendation Using Probabilistic Matrix...
A Novel Approach for Travel Package Recommendation Using Probabilistic Matrix...
 
Volume 2-issue-6-2016-2020
Volume 2-issue-6-2016-2020Volume 2-issue-6-2016-2020
Volume 2-issue-6-2016-2020
 
Oxford Digital Humanities Summer School
Oxford Digital Humanities Summer SchoolOxford Digital Humanities Summer School
Oxford Digital Humanities Summer School
 
A4 elanjceziyan
A4 elanjceziyanA4 elanjceziyan
A4 elanjceziyan
 
Abstract
AbstractAbstract
Abstract
 

Ähnlich wie Graph-based Analysis and Opinion Mining in Social Network

Current trends of opinion mining and sentiment analysis in social networks
Current trends of opinion mining and sentiment analysis in social networksCurrent trends of opinion mining and sentiment analysis in social networks
Current trends of opinion mining and sentiment analysis in social networkseSAT Publishing House
 
A scalable, lexicon based technique for sentiment analysis
A scalable, lexicon based technique for sentiment analysisA scalable, lexicon based technique for sentiment analysis
A scalable, lexicon based technique for sentiment analysisijfcstjournal
 
INTELLIGENT SOCIAL NETWORKS MODEL BASED ON SEMANTIC TAG RANKING
INTELLIGENT SOCIAL NETWORKS MODEL BASED ON SEMANTIC TAG RANKINGINTELLIGENT SOCIAL NETWORKS MODEL BASED ON SEMANTIC TAG RANKING
INTELLIGENT SOCIAL NETWORKS MODEL BASED ON SEMANTIC TAG RANKINGdannyijwest
 
INTELLIGENT SOCIAL NETWORKS MODEL BASED ON SEMANTIC TAG RANKING
INTELLIGENT SOCIAL NETWORKS MODEL BASED  ON SEMANTIC TAG RANKINGINTELLIGENT SOCIAL NETWORKS MODEL BASED  ON SEMANTIC TAG RANKING
INTELLIGENT SOCIAL NETWORKS MODEL BASED ON SEMANTIC TAG RANKINGdannyijwest
 
SampleLiteratureReviewTemplate_IVBTechIISEM_MajorProject.pptx
SampleLiteratureReviewTemplate_IVBTechIISEM_MajorProject.pptxSampleLiteratureReviewTemplate_IVBTechIISEM_MajorProject.pptx
SampleLiteratureReviewTemplate_IVBTechIISEM_MajorProject.pptx20211a05p7
 
Big Data Analytics- USE CASES SOLVED USING NETWORK ANALYSIS TECHNIQUES IN GEPHI
Big Data Analytics- USE CASES SOLVED USING NETWORK ANALYSIS TECHNIQUES IN GEPHIBig Data Analytics- USE CASES SOLVED USING NETWORK ANALYSIS TECHNIQUES IN GEPHI
Big Data Analytics- USE CASES SOLVED USING NETWORK ANALYSIS TECHNIQUES IN GEPHIRuchika Sharma
 
Framework for opinion as a service on review data of customer using semantics...
Framework for opinion as a service on review data of customer using semantics...Framework for opinion as a service on review data of customer using semantics...
Framework for opinion as a service on review data of customer using semantics...IJECEIAES
 
An Unsupervised Approach For Reputation Generation
An Unsupervised Approach For Reputation GenerationAn Unsupervised Approach For Reputation Generation
An Unsupervised Approach For Reputation GenerationKayla Jones
 
Insights to Problems, Research Trend and Progress in Techniques of Sentiment ...
Insights to Problems, Research Trend and Progress in Techniques of Sentiment ...Insights to Problems, Research Trend and Progress in Techniques of Sentiment ...
Insights to Problems, Research Trend and Progress in Techniques of Sentiment ...IJECEIAES
 
NOVEL MACHINE LEARNING ALGORITHMS FOR CENTRALITY AND CLIQUES DETECTION IN YOU...
NOVEL MACHINE LEARNING ALGORITHMS FOR CENTRALITY AND CLIQUES DETECTION IN YOU...NOVEL MACHINE LEARNING ALGORITHMS FOR CENTRALITY AND CLIQUES DETECTION IN YOU...
NOVEL MACHINE LEARNING ALGORITHMS FOR CENTRALITY AND CLIQUES DETECTION IN YOU...ijaia
 
NOVEL MACHINE LEARNING ALGORITHMS FOR CENTRALITY AND CLIQUES DETECTION IN YOU...
NOVEL MACHINE LEARNING ALGORITHMS FOR CENTRALITY AND CLIQUES DETECTION IN YOU...NOVEL MACHINE LEARNING ALGORITHMS FOR CENTRALITY AND CLIQUES DETECTION IN YOU...
NOVEL MACHINE LEARNING ALGORITHMS FOR CENTRALITY AND CLIQUES DETECTION IN YOU...gerogepatton
 
Novel Machine Learning Algorithms for Centrality and Cliques Detection in You...
Novel Machine Learning Algorithms for Centrality and Cliques Detection in You...Novel Machine Learning Algorithms for Centrality and Cliques Detection in You...
Novel Machine Learning Algorithms for Centrality and Cliques Detection in You...gerogepatton
 
APPROXIMATE ANALYTICAL SOLUTION OF NON-LINEAR BOUSSINESQ EQUATION FOR THE UNS...
APPROXIMATE ANALYTICAL SOLUTION OF NON-LINEAR BOUSSINESQ EQUATION FOR THE UNS...APPROXIMATE ANALYTICAL SOLUTION OF NON-LINEAR BOUSSINESQ EQUATION FOR THE UNS...
APPROXIMATE ANALYTICAL SOLUTION OF NON-LINEAR BOUSSINESQ EQUATION FOR THE UNS...mathsjournal
 
FEATURE SELECTION AND CLASSIFICATION APPROACH FOR SENTIMENT ANALYSIS
FEATURE SELECTION AND CLASSIFICATION APPROACH FOR SENTIMENT ANALYSISFEATURE SELECTION AND CLASSIFICATION APPROACH FOR SENTIMENT ANALYSIS
FEATURE SELECTION AND CLASSIFICATION APPROACH FOR SENTIMENT ANALYSISmlaij
 
Fuzzy AndANN Based Mining Approach Testing For Social Network Analysis
Fuzzy AndANN Based Mining Approach Testing For Social Network AnalysisFuzzy AndANN Based Mining Approach Testing For Social Network Analysis
Fuzzy AndANN Based Mining Approach Testing For Social Network AnalysisIJERA Editor
 
An Approach To Sentiment Analysis
An Approach To Sentiment AnalysisAn Approach To Sentiment Analysis
An Approach To Sentiment AnalysisSarah Morrow
 

Ähnlich wie Graph-based Analysis and Opinion Mining in Social Network (20)

Current trends of opinion mining and sentiment analysis in social networks
Current trends of opinion mining and sentiment analysis in social networksCurrent trends of opinion mining and sentiment analysis in social networks
Current trends of opinion mining and sentiment analysis in social networks
 
A scalable, lexicon based technique for sentiment analysis
A scalable, lexicon based technique for sentiment analysisA scalable, lexicon based technique for sentiment analysis
A scalable, lexicon based technique for sentiment analysis
 
E017433538
E017433538E017433538
E017433538
 
INTELLIGENT SOCIAL NETWORKS MODEL BASED ON SEMANTIC TAG RANKING
INTELLIGENT SOCIAL NETWORKS MODEL BASED ON SEMANTIC TAG RANKINGINTELLIGENT SOCIAL NETWORKS MODEL BASED ON SEMANTIC TAG RANKING
INTELLIGENT SOCIAL NETWORKS MODEL BASED ON SEMANTIC TAG RANKING
 
INTELLIGENT SOCIAL NETWORKS MODEL BASED ON SEMANTIC TAG RANKING
INTELLIGENT SOCIAL NETWORKS MODEL BASED  ON SEMANTIC TAG RANKINGINTELLIGENT SOCIAL NETWORKS MODEL BASED  ON SEMANTIC TAG RANKING
INTELLIGENT SOCIAL NETWORKS MODEL BASED ON SEMANTIC TAG RANKING
 
SampleLiteratureReviewTemplate_IVBTechIISEM_MajorProject.pptx
SampleLiteratureReviewTemplate_IVBTechIISEM_MajorProject.pptxSampleLiteratureReviewTemplate_IVBTechIISEM_MajorProject.pptx
SampleLiteratureReviewTemplate_IVBTechIISEM_MajorProject.pptx
 
Big Data Analytics- USE CASES SOLVED USING NETWORK ANALYSIS TECHNIQUES IN GEPHI
Big Data Analytics- USE CASES SOLVED USING NETWORK ANALYSIS TECHNIQUES IN GEPHIBig Data Analytics- USE CASES SOLVED USING NETWORK ANALYSIS TECHNIQUES IN GEPHI
Big Data Analytics- USE CASES SOLVED USING NETWORK ANALYSIS TECHNIQUES IN GEPHI
 
Framework for opinion as a service on review data of customer using semantics...
Framework for opinion as a service on review data of customer using semantics...Framework for opinion as a service on review data of customer using semantics...
Framework for opinion as a service on review data of customer using semantics...
 
An Unsupervised Approach For Reputation Generation
An Unsupervised Approach For Reputation GenerationAn Unsupervised Approach For Reputation Generation
An Unsupervised Approach For Reputation Generation
 
Sub1557
Sub1557Sub1557
Sub1557
 
Q046049397
Q046049397Q046049397
Q046049397
 
Insights to Problems, Research Trend and Progress in Techniques of Sentiment ...
Insights to Problems, Research Trend and Progress in Techniques of Sentiment ...Insights to Problems, Research Trend and Progress in Techniques of Sentiment ...
Insights to Problems, Research Trend and Progress in Techniques of Sentiment ...
 
F017433947
F017433947F017433947
F017433947
 
NOVEL MACHINE LEARNING ALGORITHMS FOR CENTRALITY AND CLIQUES DETECTION IN YOU...
NOVEL MACHINE LEARNING ALGORITHMS FOR CENTRALITY AND CLIQUES DETECTION IN YOU...NOVEL MACHINE LEARNING ALGORITHMS FOR CENTRALITY AND CLIQUES DETECTION IN YOU...
NOVEL MACHINE LEARNING ALGORITHMS FOR CENTRALITY AND CLIQUES DETECTION IN YOU...
 
NOVEL MACHINE LEARNING ALGORITHMS FOR CENTRALITY AND CLIQUES DETECTION IN YOU...
NOVEL MACHINE LEARNING ALGORITHMS FOR CENTRALITY AND CLIQUES DETECTION IN YOU...NOVEL MACHINE LEARNING ALGORITHMS FOR CENTRALITY AND CLIQUES DETECTION IN YOU...
NOVEL MACHINE LEARNING ALGORITHMS FOR CENTRALITY AND CLIQUES DETECTION IN YOU...
 
Novel Machine Learning Algorithms for Centrality and Cliques Detection in You...
Novel Machine Learning Algorithms for Centrality and Cliques Detection in You...Novel Machine Learning Algorithms for Centrality and Cliques Detection in You...
Novel Machine Learning Algorithms for Centrality and Cliques Detection in You...
 
APPROXIMATE ANALYTICAL SOLUTION OF NON-LINEAR BOUSSINESQ EQUATION FOR THE UNS...
APPROXIMATE ANALYTICAL SOLUTION OF NON-LINEAR BOUSSINESQ EQUATION FOR THE UNS...APPROXIMATE ANALYTICAL SOLUTION OF NON-LINEAR BOUSSINESQ EQUATION FOR THE UNS...
APPROXIMATE ANALYTICAL SOLUTION OF NON-LINEAR BOUSSINESQ EQUATION FOR THE UNS...
 
FEATURE SELECTION AND CLASSIFICATION APPROACH FOR SENTIMENT ANALYSIS
FEATURE SELECTION AND CLASSIFICATION APPROACH FOR SENTIMENT ANALYSISFEATURE SELECTION AND CLASSIFICATION APPROACH FOR SENTIMENT ANALYSIS
FEATURE SELECTION AND CLASSIFICATION APPROACH FOR SENTIMENT ANALYSIS
 
Fuzzy AndANN Based Mining Approach Testing For Social Network Analysis
Fuzzy AndANN Based Mining Approach Testing For Social Network AnalysisFuzzy AndANN Based Mining Approach Testing For Social Network Analysis
Fuzzy AndANN Based Mining Approach Testing For Social Network Analysis
 
An Approach To Sentiment Analysis
An Approach To Sentiment AnalysisAn Approach To Sentiment Analysis
An Approach To Sentiment Analysis
 

Mehr von Khan Mostafa

Research in the Computing Industry
Research in the Computing IndustryResearch in the Computing Industry
Research in the Computing IndustryKhan Mostafa
 
Semantic matchmaking Local Closed-World Reasoning
Semantic matchmaking Local Closed-World ReasoningSemantic matchmaking Local Closed-World Reasoning
Semantic matchmaking Local Closed-World ReasoningKhan Mostafa
 
Survey on real media paint simulation in Computer Graphics
Survey on real media paint simulation in Computer GraphicsSurvey on real media paint simulation in Computer Graphics
Survey on real media paint simulation in Computer GraphicsKhan Mostafa
 
Seminal works on watercolor painting simulation
Seminal works on watercolor painting simulation Seminal works on watercolor painting simulation
Seminal works on watercolor painting simulation Khan Mostafa
 
Reaction Paper Discussing Articles in Fields of Outlier Detection & Sentiment...
Reaction Paper Discussing Articles in Fields of Outlier Detection & Sentiment...Reaction Paper Discussing Articles in Fields of Outlier Detection & Sentiment...
Reaction Paper Discussing Articles in Fields of Outlier Detection & Sentiment...Khan Mostafa
 
Project Presentation: Graph-based Analysis and Opinion Mining in Social Network
Project Presentation: Graph-based Analysis and Opinion Mining in Social NetworkProject Presentation: Graph-based Analysis and Opinion Mining in Social Network
Project Presentation: Graph-based Analysis and Opinion Mining in Social NetworkKhan Mostafa
 
A Survey on Sentiment Mining Techniques
A Survey on Sentiment Mining TechniquesA Survey on Sentiment Mining Techniques
A Survey on Sentiment Mining TechniquesKhan Mostafa
 
RDF by Structured Reference to Semantics, the RS2 framework
RDF by Structured Reference to Semantics, the RS2 frameworkRDF by Structured Reference to Semantics, the RS2 framework
RDF by Structured Reference to Semantics, the RS2 frameworkKhan Mostafa
 
Study Tour (KUET CSE 2k5) Poster
Study Tour (KUET CSE 2k5) PosterStudy Tour (KUET CSE 2k5) Poster
Study Tour (KUET CSE 2k5) PosterKhan Mostafa
 
Traffic Jam Detection System by Ratul, Sadh, Shams
Traffic Jam Detection System by Ratul, Sadh, ShamsTraffic Jam Detection System by Ratul, Sadh, Shams
Traffic Jam Detection System by Ratul, Sadh, ShamsKhan Mostafa
 
Open Document Format
Open Document FormatOpen Document Format
Open Document FormatKhan Mostafa
 
An Approach To Emerge Web 3.0
An Approach To Emerge Web 3.0An Approach To Emerge Web 3.0
An Approach To Emerge Web 3.0Khan Mostafa
 

Mehr von Khan Mostafa (14)

Research in the Computing Industry
Research in the Computing IndustryResearch in the Computing Industry
Research in the Computing Industry
 
Semantic matchmaking Local Closed-World Reasoning
Semantic matchmaking Local Closed-World ReasoningSemantic matchmaking Local Closed-World Reasoning
Semantic matchmaking Local Closed-World Reasoning
 
Survey on real media paint simulation in Computer Graphics
Survey on real media paint simulation in Computer GraphicsSurvey on real media paint simulation in Computer Graphics
Survey on real media paint simulation in Computer Graphics
 
Seminal works on watercolor painting simulation
Seminal works on watercolor painting simulation Seminal works on watercolor painting simulation
Seminal works on watercolor painting simulation
 
Reaction Paper Discussing Articles in Fields of Outlier Detection & Sentiment...
Reaction Paper Discussing Articles in Fields of Outlier Detection & Sentiment...Reaction Paper Discussing Articles in Fields of Outlier Detection & Sentiment...
Reaction Paper Discussing Articles in Fields of Outlier Detection & Sentiment...
 
Project Presentation: Graph-based Analysis and Opinion Mining in Social Network
Project Presentation: Graph-based Analysis and Opinion Mining in Social NetworkProject Presentation: Graph-based Analysis and Opinion Mining in Social Network
Project Presentation: Graph-based Analysis and Opinion Mining in Social Network
 
A Survey on Sentiment Mining Techniques
A Survey on Sentiment Mining TechniquesA Survey on Sentiment Mining Techniques
A Survey on Sentiment Mining Techniques
 
The Career (CSE)
The Career (CSE)The Career (CSE)
The Career (CSE)
 
RDF by Structured Reference to Semantics, the RS2 framework
RDF by Structured Reference to Semantics, the RS2 frameworkRDF by Structured Reference to Semantics, the RS2 framework
RDF by Structured Reference to Semantics, the RS2 framework
 
Study Tour (KUET CSE 2k5) Poster
Study Tour (KUET CSE 2k5) PosterStudy Tour (KUET CSE 2k5) Poster
Study Tour (KUET CSE 2k5) Poster
 
Traffic Jam Detection System by Ratul, Sadh, Shams
Traffic Jam Detection System by Ratul, Sadh, ShamsTraffic Jam Detection System by Ratul, Sadh, Shams
Traffic Jam Detection System by Ratul, Sadh, Shams
 
Open Document Format
Open Document FormatOpen Document Format
Open Document Format
 
GPU Computing
GPU ComputingGPU Computing
GPU Computing
 
An Approach To Emerge Web 3.0
An Approach To Emerge Web 3.0An Approach To Emerge Web 3.0
An Approach To Emerge Web 3.0
 

Kürzlich hochgeladen

Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...nirzagarg
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样wsppdmt
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRajesh Mondal
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Researchmichael115558
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Valters Lauzums
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...gajnagarg
 
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...gajnagarg
 
Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareGraham Ware
 
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...SOFTTECHHUB
 
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...kumargunjan9515
 
Dubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls DubaiDubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls Dubaikojalkojal131
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabiaahmedjiabur940
 
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...HyderabadDolls
 
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...gragchanchal546
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...ZurliaSoop
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...nirzagarg
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNKTimothy Spann
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...Elaine Werffeli
 
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...HyderabadDolls
 

Kürzlich hochgeladen (20)

Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for Research
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
 
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
 
Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham Ware
 
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
 
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
 
Dubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls DubaiDubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls Dubai
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
 
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
 
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
 
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
 

Graph-based Analysis and Opinion Mining in Social Network

  • 1. 1 Project Report: Graph-based Analysis and Opinion Mining in Social Network Khan Mostafa Stony Brook University Student ID# 109365509 khan.@nafSadh.com ABSTRACT This is the final report for Networks & Data Mining Techniques project focusing on mining social network to estimate public opinion about entities and associated keywords. This project mines Twitter for recent feeds and analyzes them to estimate sentiment score, discussed entity and describing keywords in each tweet. This data is then exploited to elicit overall sentiment associated with each entity. Entities and keywords extracted is also used to form an entity-keyword bigraph. This graph is further used to detect entity communities and keywords found within those communities. Presented implementation works in linear time. Categories and Subject Descriptors H.2.8 [Database Management]: Database Applications – Data Mining. General Terms Algorithms, Documentation, Experimentation. Keywords Opinion mining, sentiment, graph clustering, graph community detection. 1. INTRODUCTION This project focuses on mining opinion from social network. It takes Twitter as a model platform for that it has a publicly available stream of posts from people of diverge demographic. The goal is to report public opinion in two forms: (a) overall opinion about some entity and (b) opinion based cluster of entities and keywords. Public opinion can be mined from posts about entity of interest. At first, ample posts are fetched from public stream. Then, each post is individually scored to find embedded subjectivity. All posts are not subjective, some assert information while some other express feelings. Hence, posts can be generally classified as objective, positive and negative. However, subjective bias is not discrete; rather each post embody mixed polarity. Again, attempts to annotate post manually has shown that, different people associate sentiment to same posts differently. Therefore, this project focuses on calculating sentiment scores for posts. After each posts are individually scored, overall opinion is represented using few aggregative parameters including overall score, diversity, and percentage of each type of polar posts. A set of keywords (kw) are also identified to report how the entity (E) is positively and negatively described. In this project sentiment analysis is done using an approach similar to [1], using a combination of two naïve Bayes classifiers to calculate polarity score – PoS tag based classifier and n-gram based classifier. Keywords and entities are primarily detected using parts of speech. Then, in combined analysis, keywords that occur less frequently for an entity is discarded, as that word is not sufficiently associated with the entity. Again, those keywords that occur in descriptions of too many entities, are less likely to be keyword, rather are stop-words or generic words. After tweets are individually analyzed further overall analysis can be done. To do so, first an entity – keyword bigraph (E×kw) is computed from tweets analyzed. Tweets are collected from recent public feed stream using Twitter API. Analysis reports a polarity score, a set of keywords and a set of entities for each tweet. In E×kw bigraph an edge exist between E and kw if both occur in same tweet. These edges also have associated polarity score. This E×kw bigraph can be used to generate an E×E graph. In E×E, there exists an edge between two entities if they share a keyword with similar sentiment bias. This E×E graph is then clustered using a local clustering algorithm in linear time. This project is implemented mainly using .Net framework (C#) and partially using PHP on Apache server to access Twitter API [2]. PoS tagging is done using a third party TreeTagger developed recently for tweets [3]. The main contributions of this project are,  Implemented a sentiment analysis tool that can elicit scores for individual tweets  Implemented a way to report aggregate sentiment score and associated keywords for queried entity  Devised and implemented a simple approach to identify entities and keywords in tweets  Implemented a fast local graph clustering algorithm using split vectors instead of full-blown matrices.  Used the fast local graph clustering to detect and report entity groups along with keywords and grouped polarity scores In this report following sections include, overview of prior works, methodology description and result and analysis of mined data. 2. BACKGROUND Mining social network for eliciting public opinion requires sentiment analysis, keyword & entity tagging and graph clustering. Sentiment analysis is vastly studied in several fields and still is an open problem. There had also been ample investigation on detecting communities, partitioning, and finding clusters in graphs. In this section a few prior works are briefly discussed. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. CSE590 Network and Data Mining Techniques, Fall, 2013, Stony Brook University, NY, USA. Copyright 2013
  • 2. 2 2.1 Sentiment Analysis Sentiment analysis is being studied thoroughly for a decade or more. One of the earliest work done by Pang, et al. [4], amongst others, investigated in the field of sentiment classification. This investigation opened a wide arena of research and have led to many outcome by multitude of researchers from different fields. Statistics, computational linguistics and machine learning has been studied to solve the challenge of sentiment analysis. There are several lexicon based techniques for opinion mining viz. [5], versions of SentiWordNet [6], [7]. A detail survey of many lexicon based approaches is done by [8]. Although earlier studies [9] suggested use of only adjectives as subjectivity measure, later investigations revealed sentiment appraisal is much diverse. Whitelaw, et al. [10] suggested using appraisal taxonomies for sentiment classification. Similar observation was found by [11] and [12] stating that, “Adjectives, Verbs and Adverbs are better than Adjectives Alone”. Machine learning approaches widely used Support Vector Machines e.g. [1], [13] and Naïve Bayes e.g. [4], [14] classifiers. Latent Dirichlet Allocation (LDA) is also utilized e.g. [15], [16]. A lexicon based holistic approach [17] is also described to address context dependency. Opinion mining and sentiment analysis on Twitter is investigated using various approaches viz. [14] [18] [1] [16] [19]. Most approaches for opinion mining assign strict subjectivity class (positive, negative, neutral) to individual texts in different granularity (i.e. sentence, post, paragraph and document). However, a score assignment will serve better to understand intensity of opinion. There is a paucity of studies that tried to aggregate sentiment to identify public opinion. Perception of opinion vary for each individual and a better insight of public opinion can be found by eliciting few attributes from social media. Overall sentiment score, percentage of positive and negative opinions, key descriptions are useful attributes that can be elicited. This project will focus on mining tweets about some entity for these attributes associated with that entity. 2.2 Keyword and Entity detection There are different and diverse approaches for keyword detection. For example, there are machine learning based approaches, using SVM [20], associating linguistic knowledge like n-grams and PoS for supervised keyword extraction [21]. Thesaurus based approaches [22] use semantic knowledge for machine based keyword extraction. Most keyword identification approaches use some kind of machine learning technique along with some other knowledge. However, for this project’s purpose, a simple method is required to identify keywords. This project will employ hints from PoS tagging and then let data itself build a keyword lexicon while simultaneously detecting them. 1 Modularity is the fraction of the edges that fall within the given groups minus the expected such fraction if edges were distributed at random. [Wikipedia, Accessed Dec 03, 2013] 2.3 Graph clustering Graphs have been studied extensively historically from mathematical and theoretical viewpoint and in recent few decades they have been more extensively studied from data analytic perspectives. A lot of real world and physical phenomena can be ideally modeled as graphs. These graphs can be then efficiently investigated to find latent characteristics of modeled data. One major operation on graph in data mining is to divide them into smaller parts. Partitioning can be of different types. One approach might be to partition whole graph into disjoint sub graphs of similar size [23]. For analyzing graphs, a more natural division is often desired. Vertices in graphs tend to have edge with vertices that have vertices with other connected neighboring vertices of its own and thus create communities. However, communities differ in sizes and these communities are not disconnected. Rather, there are few links between nodes of different community in contrast to nodes of same community. Newman and others has conducted several research [24] [25] [26] on detecting communities in graphs. They exploited modularity1 of graph to do so. Most of their early works were restrictive on scalability but later spectral optimization of modularity yielded [27] an algorithm that works in near linear time. Modularity based approaches cluster graph into disjoint communities. In contrast, often communities are overlapped. Andersen et al [28] suggested a “Local Graph Partitioning using PageRank Vectors” and other derived algorithms. The core idea behind these approaches is to use conductance2 of graph to locally cluster them. These approaches works near linearly and can detect communities that overlap. This project uses an approach as devised by Andersen et al, as it serves several purposes of the project goal. It can detect communities that overlap, works near linearly, and an implementation without necessarily creating the blown-up full matrix is possible. 3. PROJECT DESCRIPTION 3.1 Problem Statement People express their opinion about entities (viz. location, person, products etc.) in social networks. In brief, the goal is to,  extract overall public opinion of some entity  elicit opinion based entity groups in recent stream The scope of the project is to mine a popular microblogging platform: Twitter. 3.1.1 Extract overall public opinion of some entity The goal is to extract opinion about a given entity, E. This will be done in terms of ample recent tweets about E. The solution shall be able to yield the following about a given entity, E,  Overall sentiment: Overall sentiment (viz. positive, negative, mixed) about E. A sentiment score in a range of [-1, 1] will be given. This will also show the percentage of positive, negative 2 Conductance is the measure of a sub graph denoting how much it is connected to rest of the graph. It is the ratio of out-links from the sub graph to the volume (total edge count from nodes in it).
  • 3. 3 and neutral (some threshold can be applied to distinguish between these three classes) tweets as well as the count of analyzed tweets. A measure (e.g. variance) of how diverse the opinion is can also be included.  Key description: The system will yield a set of keywords (kw) that are used to describe E An overall sentiment about an entity is useful to multitude of clients for various applications. Sets of key descriptive words along with sentiment will provide a better insight of public feelings. 3.1.2 Opinion based entity groups in recent stream The goal is to detect how entities are grouped together in terms of sentiment and descriptive keywords. This will be done based on a stream of recent tweets. Each tweets shall be individually analyzed, as in 3.1.1. Analysis on each tweet will yield,  Text in the tweet, T  Entities discussed in it, E  Keywords in it, kw  Polarity score, P This tuples (T,E,kw,P) will then be used to build E×kw bigraph such that,  There exists an edge between Ei and kwj if there is one or more tweet that contains Ei and kwj  The edge has a weight indicating co-occurrence of Ei and kwj. i.e. weightij = Count ({Tk | Ei ∈ Tk.E ∧ kwj∈ Tk.kw})  The edge has pScore that is average of pScore (=P) for all such occurrences. i.e. pScore = Sum({Tk .pScore| Ei ∈ Tk.E ∧ kwj∈ Tk.kw})/weight After this, a filter will be run on this graph to eliminate those links that exist between entity and keyword where the keyword is not enough descriptive of the entity. This is done, by calculating freq such that, freqij = weightij/ Occurrence (Ei) If freqij is smaller than certain threshold, εfreq then that keyword is filtered out for this entity Ei. This E×kw bigraph will then be used to build E×E graph, such that, there exists an edge between Ei and Ej if  Occurrence(Ei)> εeo ∧ Occurrence(Ej)> εeo  {kw(Ei) | Occurrence(kwx)< εkwo} ⋂ {kw(Ej) | Occurrence(kwx)< εkwo} is not empty  Polarity bias for both are similar To describe, there is an edge between two entities if they share one or more keywords with similar polarity bias link. These entities are such that, they occur over a threshold, εeo. These keywords are such that, they do not occur for more than some threshold, εkwo, times. This threshold over keywords is motivated from following intuition,  If a potential word occur in description of most entities then that is not an keyword but is a generic term Then, a community detection algorithm is to be run on this E×E graph to find groups of entities that are bind together with lot of polarity aligned keyword links. After one such groups of entities is generated, there will be a group of keywords such that, they occur in edges that are within that community of nodes. Also, a representative averaged pScore can be calculated for such a group. To summarize, given a stream of tweets, the system shall be able to generate,  (T,E,kw,P) tuples  E×kw bigraph  E×E graph  Return group of entities has similar opinion 3.2 Data collection 3.2.1 Corpus and entity from Twitter This project requires collecting two types of data. First, a corpus of subjective and objective tweets are collected – these data is used to train classifier (scorer). After training the classifier, training (not the training data set) can be stored in a file so that scorer can act later by loading them from file. Secondly, on query time posts are fetched from Twitter. Following API from Twitter is used:  search/tweets This API is called with ‘q’ = emoticons for gathering training data (positive and negative posts). In query time, same API is used with ‘q’ = query term to fetch related recent posts.  statuses/user_timeline This API is used to fetch objective training data by querying 'screen_name' = popular_stream. I used, Lifehacker, Gizmodo, New York Times, and The Atlantic as source. Twitter API do not allow fetching more than 100 posts at once. Hence, I had to exploit max_id for iteratively requesting same call for different portions of result. I have collected ten thousands of each type of data for training. In query time 200~2000 posts are fetched. 3.2.2 Mining recent twitter stream To generate an E×E graph large enough to detect grouping of entities a large stream of Twitter public stream is to be collected. To do this, again Twitter API is used and strapped continuously for a large amount of windows. Note that, in v1.1, Twitter API allow only 180 search query per window per user and 450 query per window per app. At each query, a maximum of 100 tweets are returned. Currently, windows are 15 minutes each. Hence, max_id is utilized to continuously fetch tweets using a q=”.” query. Another alternative to search/tweets API could be a streaming API. After tweets are fetched, very tiny tweets are discarded. I have, filtered out tweets with less than 50 characters. This is because, smaller tweets are difficult to understand. Also, retweets (RT) are discarded to avoid occurrence of same tweets many times. Furhtermore, another stage of filtration is imposed to remove yet duplicate tweets.
  • 4. 4 3.2.3 PoS Tagging After collecting tweets they are passed to a TreeTagger for PoS tagging. I used recently developed GATE Twitter part-of-speech tagger [3], which is based on Stanford TreeTagger, which in terms are based on famous TreeTagger [29] by Schimd. PoS tags yielded are based on Penn-Treebank-Tagset [30]. 3.3 Implementation 3.3.1 Twitter corpus to train sentiment classifier Each posts are individually scored based on two scorers. Following (Pak and Paroubek 2010) [1], two classifiers are built. To train them, tweets are queried as such, (1) positive tweets are fetched with a search of q=””, (2) negative tweets are fetched with a search of q=”” and (3) objective tweets are fetched from new media accounts. One classifier exploits parts-of-speech (PoS) distribution amongst objective and polar statements. PoS distribution differs amongst positive and negative statements. See Figure 1 and Figure 2. Another classifier is made exploiting the distribution of n-grams (n=2). N-grams indicate strong correlation with bias or with objectivity. Human usually uses common phrases to express a type of feeling. On the other hand, some phrases are of assertive nature. This feature of natural language is captured using n-grams. See Table 2 for top 20 polar n-grams of 94k n-grams. The reference work used classification result from two classifiers to verdict final classification. This project enhances the approach by implementing classifiers as scorers to evaluate PoS score and N- Gram score for each statement. Then, both score contribute to a final score of the statement (tweet). 3.3.2 From strapped tweets to graphs As outlined in 3.1.2, (T,E,kw,P) tuples, E×kw bigraph and E×E graph are generated from a given stream of tweets. 3.3.2.1 Analyzing tweets To do so, first each tweet is scored using sentiment classifier described in 3.3.1. PoS tags are exploited to primarily identify entities and keywords. Entity: Our goal is to analyze entities (location, place, person, product etc.) In English, they are generally represented by proper nouns. Also, in Twitter, users can be regarded as entities. Hence, from, PoS tags, proper nouns (NNP, NNPS, USR) are regarded as entities. Keyword: In English adjective, adverbs and verbs are used to describe an entity. This property is exploited by identifying words with tags for these PoS (JJ, RB, VB etc.) as keywords. The algorithm also allows an alternate using a parameter that include common nouns (not NNP) as keywords. 3.3.2.2 Entity-keyword bigraph From analyzed tweets, (T,E,kw,P) tuples are iterated on to build an E×kw bigraph as described in 3.3.1. A general intuition, also confirmed by several studies, is that, graphs are generally sparse. Thus, instead of building full blown matrix, two dictionary/maps are stored to represent E×kw bigraph:-  A dictionary of entities, with pointers to keywords, as well as weight and pScore associated with that node  For ease of iteration, another dictionary of keywords is stored, which stores pointers back to entities from keywords. This representation, assure small storage for the entire bigraph, yet describes entire bigraph with edges and nodes. This reduces the storage from (E*kw) to edgeCount. Note that, 2*(E+kw) < edgeCount << (E*kw) Running time for building a bigraph is proportional to number of edges, i.e. 𝑂(𝑒𝑑𝑔𝑒𝑠). 3.3.2.3 Entity-Entity graph From the E×kw bigraph generated above, an E×E is generated by iterating over each entity. For each entity, Ei, a set of keywords kw(Ei) are processed. Each keyword points to another set of entities, E(kw(Ei). These set of entities are added to neighbor of Ei. In this step also, a dictionary is used to represent the graph. It requires one dictionary of entities, where each entry also point to immediate neighbors. This requires a storage of 2*edge. Runtime to build this graph is proportional to number of edges. However, a filtration of entities is done a priori to remove nodes with very few neighbor from simulation (thus building a set of significant entities). Filtering generic terms from keyword list (thus only using legitimate keywords) reduces search space. 3.3.3 Keywords form data Keywords are filtered in several steps to let data define legitimate keywords. In first step, PoS tagging define preliminary set. After all tweets are analyzed, a filtration is used to remove low frequency terms from keyword lists of each entity. After E×kw bigraph is built, another filtration is used to rule out generic terms. Generic terms are those potential keywords that are found in too many entities. A threshold parameter is supplied to the algorithm for this. Finally after generating communities consolidation step filters out irregular keywords to yield final set of keywords. 3.3.4 Community detection: group of entities After E×E graph is generated, consisting legitimate keywords and significant entities a community detection algorithm can be used to detect community in them. This project implements a fast derivation of Andersen et al [28]. Table 1. Community Detection Algorithm 1. Significant_entities := entities in (E×E) 2. Seed_node := supplied_seed 3. if(seed null or not exist) then seed:=first(Significant_entities) 4. aCommunity := new Community() 5. entity :=seed 6. eval := evaluate(entity,aCommunity) 7. if(eval.member) then aCommunity.Add(entity) remove(entity, Significant_entities) remove(a.Community. Nbor, entity) 8. if(aCommunity.Nbor = empty) goto 11 9. entity := first(aCommunity. Nbor) 10. goto 5 11. add(aCommunity,Communities) 12. if(Significant_entities not emmpty) goto 4 13. return
  • 5. 5 Algorithm described above uses objects of class Community. It’s Add() member function adds the entity and updates the community with, Volume (=edges inside) and outward links. evaluate() function check membership by calculated conductance if this node added to community and compare with original conductance. Conductance is defined as, Cond = (links outward from community)/(edges inside). This will generate a set of communities. After generating each communities, a consolidation step in is undergone to further filter keywords. This is done as, size:= size of community := number of entities in it Threshold := ln(size) If (Occcurance(kw)< Threshold) then Remove(kw) After this step, a set of descriptive keywords is associated with the group of entities. 3.3.5 Storing result The final outcome of communities is returned as an XML document from the implementation. Also, (T,E,kw,P) tuples are returned as XML. Other intermediate graphs, E×kw bigraph and E×E graph are exported as CSV (comma separated value) files. 4. RESULTS AND FINDINGS 4.1 Findings Findings reported here are based on 160,711 tweets collected in late November of 2013. 4.1.1 PoS Distributions and n-grams Later in this section are figures of PoS distributions over subjective- objective statements and positive-negative statements. A positive bias value in Figure 1 indicate presence of such PoS is more indicative of the statement of being positive. Same is for negative values. Subjectivity score in Figure 2 indicates similar score. Table 2 shows top few n-grams. Note that, PoS distributions and top n- grams slightly differ from referred work [1]. Again, if training data is collected in different time, some slight change will occur. Table 2. Top n-gram with occurrence in each class of data n-gram Positive Negative Objective 'enjoying break' 1 328 1 'happy birthday' 22 207 1 'so happy' 106 53 1 'follow back' 10 132 1 'miss my' 93 10 1 'no one notices' 97 4 1 'notices my' 97 1 1 'good day' 5 82 1 'follow please' 47 38 1 'my phone' 64 18 1 'presenting emotional' 60 20 1 'please follow' 11 66 1 'follow love' 17 60 1 'am sorry' 71 4 1 'so sad' 71 3 1 'miss u' 65 7 1 'new followers' 53 17 1 Figure 1. Distribution of PoS in positive and negative statements Figure 2. Distribution of PoS between subjective and objective tweets 4.1.2 Power law in Entity and Keywords Figure 3 and Figure 4 show how entity and keywords follow power law. Figure 3. ln(Occurance) of Entities show power law Figure 4. ln(Occurance) of keyword show power law 4.1.3 Distribution of Polarity Score in Entities Figure 5 show how polarity score amongst entities are distributed. It is seen that, polarity score has skewed distribution. Figure 6 shows the distribution of polarity score over natural logarithm (ln) of occurrence of the entity. POS,0.600 WP$,0.500 PDT,0.333 RBS,0.280 URL,0.229 WP,0.217 JJS,0.187 SYM,0.176 USR,0.155 FW,0.127 NNP,0.110 CD,0.068 DT,0.032 VB,0.000 UH,-0.004 NN,-0.007 JJR,-0.010 IN,-0.012 NNS,-0.015 JJ,-0.019 RBR,-0.024 WDT,-0.031 VBG,-0.034 NNPS,-0.050 VBZ,-0.055 EX,-0.064 MD,-0.099 CC,-0.102 PRP$,-0.114 PRP,-0.135 VBP,-0.144 TO,-0.149 RP,-0.175 RB,-0.182 VBD,-0.227 VBN,-0.245 WRB,-0.282 BIAS WRB,0.164 VBN,0.140 VBD,0.128 RB,0.100 RP,0.096 TO,0.081 VBP,0.078 PRP,0.072 PRP$,0.061 CC,0.054 MD,0.052 EX,0.033 VBZ,0.028 NNPS,0.025 VBG,0.017 WDT,0.016 RBR,0.012 JJ,0.010 NNS,0.008 IN,0.006 JJR,0.005 NN,0.003 UH,0.002 VB,0.000 LS,0.000 DT,-0.016 CD,-0.033 NNP,-0.052 FW,-0.060 USR,-0.072 SYM,-0.081 JJS,-0.085 WP,-0.098 URL,-0.103 RBS,-0.123 PDT,-0.143 WP$,-0.200 POS,-0.231 SUBJECTIVITY 0 1 2 3 4 5 6 7 8 9 0 2000 4000 6000 8000 10000 12000 14000 0 1 2 3 4 5 6 7 8 9 10 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
  • 6. 6 Figure 5. Distribution of Polarity Score over entire entity space Figure 6. Polarity Score over ln(Occurance) of entities 4.1.4 Graph BFS & communities in adjacency matrix From any arbitrary node, the E×E graph is traversed BFS (breadth first search) to generate an arbitrary random walk. This BFS assigns index to each entity and then an adjacency matrix is visualized as in Figure 7. Notice that, this is a near diagonal matrix. Although the diagram is white, as there is no self-edge. Notice the blocks; these blocks are representative of communities. There are tiny and large communities. There are 157 communities having a maximum size of 136. Figure 7. Adjacency matrix of significant entities 4.1.5 Observation of Groups Different size of feed tweet set are examined. It is seen that, number of significant entities and number of legitimate keywords increase with size of tweets. They all yield communities with different size. When manually examined these communities, and keywords, they matched intuition. An interesting community where the keyword cries is associated with two stars is noted in Figure 8. <Community id="146" size="2" conductance="0.5" pScore="0.63566754320156"> <trapped-keywords count="1"> Cries:4, </trapped-keywords> <e>Kristen Stewart</e> <e>Robert Pattinson</e> </Community> Figure 8. XML representation of a community 4.2 Results Figure 9 shows some sample runs where the system is queried for overall sentiment analysis of an entity. <opinion entity='mermaid'> <score>0.21</score> <analysis post-count='1086' percent-positive='52.03' percent-negative='24.59'/> </opinion> <opinion entity='bankrupt'> <score>-0.18</score> <analysis post-count='2073' percent-positive='30.29' percent-negative='47.03'/> </opinion> <opinion entity='drunk man'> <score>-0.50</score> <analysis post-count='1084' percent-positive='11.99' percent-negative='65.59'/> </opinion> <opinion entity='November'> <score>0.20</score> <analysis post-count='2062' percent-positive='53.25' percent-negative='25.12'/> </opinion> Figure 9. Result runs for query over entity Few parameters are fluctuated on the sample to see how they works. Kw threshold (εkwo), Minimum nodes (εeo), Common Noun as keyword are varied and results are shown in Table 3. Using common nouns as keyword yield a few groups with very large size. Thus, it is recommended to discard common noun from keywords. Table 3. Effect of parameters change Kw threshold 350 350 450 Minimum nodes 2 2 2 Common Noun as keyword false true false Potential kw 15108 31593 15108 Legitimate kw 14967 31368 14997 Entities 97147 97147 97147 E occurring > 2 7580 7580 7580 Significant E. 1190 2012 1378 Groups 170 92 157 Largest size 70 1256 136 Polarity scores of each entities and keywords are stored and can be accessed directly in E×kw bigraph. Building a polarity invariant E×kw bigraph is also tested. For, similar setting as of last column in Table 3, polarity invariant version generated 174 groups with largest group of size 598 for 1854 significant entities. Generated groups are also significantly different. -1.5 -1 -0.5 0 0.5 1 1.5 -1.5 -1 -0.5 0 0.5 1 1.5 0 1 2 3 4 5 6 7 8
  • 7. 7 Files generated containing result sets are kept online at http://meaningofdata.com/mining 4.3 Performance 4.3.1 Sentiment Scoring There is no available way to evaluate correctness for overall sentiment analysis. Therefore, performance for individual scoring is tested against a publicly available Mechanical Turk annotated Twitter data [18]. This data set includes 3771 annotated tweets. It is to be noted that, each of them were annotated by three human. They annotated 21600 tweets and all three of them agreed on only 3771 tweets. As the test set has strict classification, test tweets are scored and then classified for testing purpose with a threshold of .5 (i.e. tweets with score above +0.5 are regarded positive, scored below -0.5 regarded negative and rest are neutral). This yields only 61% matching of sentiment with test data. However most disagreement are seen in non-biased annotated entries. For biased posts, mismatch is around 26%. 4.3.2 Opinion based entity groups in recent stream Data strapping from Twitter requires a long while due to the query restriction per window. The third party GATE TreeTagger I used performs slowly and this hinders overall performance. However, the part implemented for the project performs fast in linear fashion. Table 4 lists time variance over size of sample. Table 4. Performance of graph analysis for different data size Sample 1 Large Sample Very large Sample Tweets 160711 485447 847276 Time to analyze each 48.91s 148.53s 262.01s Build Bigraph 9.29s 34.24s 66.45 Generate EE graph 1.54s 3.49s 4.99s Time to Find Groups 0.126s 0.310s 0.358s Groups count 157 334 457 Largest Group size 136 183 162 Significant Entities 1378 2627 3560 Legitimate Keywords 14997 25818 35005 5. Conclusion This project has devised and studied an approach to mine social network for eliciting public opinion about entities. Public opinion is represented as, analysis of individual entities and graph analysis of entities based on polarity aligned keyword relationship. Sentiment analysis itself is still an open problem and needs further investigation. This project uses an approach to analyze sentiment of tweets, which is built from Twitter as learning corpus. This approach yield polarity score rather than discrete polarity marker. To elicit overall opinion about an entity, aggregative polarity score and representative keywords are detected. For grouping entities, an entity graph is built from entity-keyword bigraph involving polarity scores. A local community detection mechanism is used to finally cluster them. The problem of detecting keywords is solved as an embedded approach. During steps for building entity groups from strapped tweets, keyword are filtered from raw candidate set of keywords to final set of keywords. This approach can be useful in building keyword lexicons dynamically. A report of sample runs of implementation is also added in this document. Several key observation are noted in section 4. 6. References [1] A. Pak and P. Paroubek, "Twitter as a Corpus for Sentiment Analysis and Opinion Mining," in Language Resources and Evaluation, 2010. [2] Twitter, "REST API v1.1 Resources," [Online]. Available: https://dev.twitter.com/docs/api/1.1. [3] "GATE Twitter part-of-speech tagger," [Online]. Available: https://gate.ac.uk/wiki/twitter-postagger.html. [4] B. Pang, L. Lee and S. Vaithyanathan, "Thumbs up? Sentiment Classification using Machine Learning Techniques," in Proceedings of the ACL-02 conference on Empirical methods in natural language processing, Philadelphia, PA, USA, 2002. [5] T. Wilson, J. Wiebe and P. Hoffmann, "Recognizing contextual polarity in phrase-level sentiment analysis," in HLT '05 Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing, Stroudsburg, PA, USA, 2005 . [6] A. Esuli and F. Sebastiani, "Sentiwordnet: A publicly available lexical resource for opinion mining," in Proceedings of LREC, 2006. [7] S. Baccianella, A. Esuli and F. Sebastiani, "SentiWordNet 3.0: An Enhanced Lexical Resource for Sentiment Analysis and Opinion Mining," in LREC, 2010. [8] M. Taboada, J. Brooke, M. Tofiloski, K. Voll and M. Stede, "Lexicon-based methods for sentiment analysis," Computational linguistics, vol. 37, pp. 267-307, 2011. [9] V. Hatzivassiloglou and J. M. Wiebe, "Effects of adjective orientation and gradability on sentence subjectivity," in Proceedings of the 18th conference on Computational linguistics-Volume 1, 2000. [10] C. Whitelaw, N. Garg and S. Argamon, "Using appraisal groups for sentiment analysis," in Proceedings of the 14th ACM international conference on Information and knowledge management, 2005. [11] F. Benamara, C. Cesarano, A. Picariello, D. Reforgiato and V. Subrahmanian, "Sentiment Analysis: Adjectives and Adverbs are better than Adjectives Alone," in International Conference on Weblogs and Social Media, Boulder, CO USA, 2007. [12] V. S. Subrahmanian and D. Reforgiato, "AVA: Adjective- verb-adverb combinations for sentiment analysis," Intelligent Systems, vol. 23, no. 4, pp. 43-50, 2008.
  • 8. 8 [13] T. Mullen and N. Collier, "Sentiment Analysis using Support Vector Machines with Diverse Information Sources," in EMNLP, 2004. [14] A. Bifet and E. Frank., "Sentiment knowledge discovery in twitter streaming data," in Discovery Science, Berlin Heidelberg, Springer , 2010, pp. 1-15. [15] C. Lin and Y. He, "Joint sentiment/topic model for sentiment analysis," in Proceedings of the 18th ACM conference on Information and knowledge management, 2009. [16] S. a. L. Y. a. S. H. Tan, Z. Guan, X. Yan, J. Bu, C. Chen and X. He, "Interpreting the Public Sentiment Variations on Twitter," IEEE Transactions on Knowledge and Data Engineering, vol. 6, no. 1, pp. 1-14, 2012. [17] X. Ding, B. Liu and P. S. Yu, "A holistic lexicon-based approach to opinion mining," in WSDM '08 Proceedings of the 2008 International Conference on Web Search and Data Mining, New York, NY, USA, 2008. [18] S. Narr, "Annotated Twitter Sentiment Dataset," [Online]. Available: http://data.dai-labor.de/corpus/sentiment/. [Accessed 7 10 2013]. [19] "Sentiment140," [Online]. Available: http://www.sentiment140.com. [20] K. Zhang, H. Xu, J. Tang and J. Li, "Keyword Extraction Using Support Vector Machine," in Advances in Web-Age Information Management, Springer, 2006, pp. 85--96. [21] A. Hulth, "Improved automatic keyword extraction given more linguistic knowledge," in EMNLP '03 Proceedings of the 2003 conference on Empirical methods in natural language processing, Stroudsburg, PA, USA, 2003. [22] O. Medelyan and I. H. Witten, "Thesaurus based automatic keyphrase indexing," in Proceedings of the 6th ACM/IEEE- CS joint conference on Digital libraries, 2006. [23] G. Karypis and V. Kumar, "Multilevel k-way Partitioning Scheme for Irregular Graphs," J. Parallel Distrib. Comput, vol. 48, no. 1, pp. 96-129, 1998. [24] M. Girvan and M. E. J. Newman, "Community structure in social and biological networks," in Proc. Natl. Acad. Sci. USA, 1999. [25] M. E. J. Newman, "Fast algorithm for detecting community structure in networks," in Phys. Rev. E 69, 066133., 2004. [26] A. Clauset, M. E. J. Newman and C. Moore, "Finding community structure in very large networks," in Phys. Rev. E 70, 066111, 2004. [27] M. E. J. Newman, "Modularity and community structure in networks," in Proc. Natl. Acad. Sci. USA 103, 8577–8582, 2006. [28] R. Andersen, F. Chung and K. Lang, "Local graph partitioning using pagerank vectors," in Foundations of Computer Science, FOCS'06. 47th Annual IEEE Symposium on, 2006. [29] H. Schmid, "TreeTagger," TC project at the Institute for Computational Linguistics of the University of Stuttgart, 1994. [30] B. Santorini, Part-of-speech tagging guidelines for the Penn Treebank Project, 3rd revision ed., 1990. [31] A. Go, R. Bhayani and L. Huang, "Twitter sentiment classification using distant supervision," Stanford, 2009. [32] L. Derczynski, A. Ritter, S. Clark and K. Bontcheva, "Twitter Part-of-Speech Tagging for All: Overcoming Sparse and Noisy Data," in Proceedings of the International Conference on Recent Advances in Natural Language Processing, 2013.