The document discusses using sentiment analysis on tweets to predict time series data like stock markets or box office success. It involves 3 parts: 1) classifying tweet sentiment, 2) building a network of Twitter users, and 3) finding a time series of sentiment for each user. Methods discussed include classifying tweets as spam/not spam and objective/subjective/positive/negative, using an algorithm to extract opinion words and targets, and using community detection and LDA to analyze the Twitter user network. The end goal is to use the sentiment time series to predict real-world time series data.
Asli amil baba in Karachi Pakistan and best astrologer Black magic specialist
Twitter Sentiment Analysis for Time Series Prediction
1. Sentiment Analysis
1. Discover a niche network of Twitter users
2. Model their emotions on topics
3. Use feelings to more accurately predict a
time series e.g. The stock market
e.g. Box office success
4. Are some [users/networks] more influential
than others?
2. This Talk
The Design Decision
The Core Goals
The 3 parts of the project:
1. Classifying the SENTIMENT of tweets
2. Building a NETWORK of twitter users
3. Finding a TIME SERIES of sentiment for each
user
3. Sentiment Analysis Used Already
Derwent Capital Markets - ”The twitter
hedgefund”
£25m fund
10% of tweets
predicts Dow Jones movement direction with
87.6% accuracy
Returned 1.85% in its first month of trading
Johan Bollen, Indiana University, used bag-of-
words approach
6. Design Decision
Many paragraphs of text (Product Reviews)
+ : Better accuracy of prediction
- : Less data overall
Huge amount of small quantities of text (Twitter)
+ : Opinions of greater number of people
& at high enough frequency to model as a signal
- : Classification of opinion is v. poor
=> TWITTER
7. 2 Current Aims (will change later)
1. Project aims to be context
independent (i.e. Movies & products)
2. When context is given, use it to
better classify tweets
8. 1: Sentiment Analysis of Tweets
Three-tier classification process:
tweet
spam not spam
objective subjective
positive negative
9. 1: Sentiment Analysis of Tweets
Double-Back Propagation Algorithm
ACL Journal, March 2011, MIT Press
Opinion Word Extraction & Target Extraction
4 rules
”The phone has a good screen”
=> add ”good” to list of adjectives
=> add ”screen” to list of nouns
Etc.
Great for rating features of a product
Not great for tweets
10. 1: Sentiment Analysis of Tweets
Twitter Part Of Speech (POS) tagger:
www.ark.cs.cmu.edu/TweetNLP/
Written in java " ^
Drive ^
Max Ent " ^
, ,
go V
and &
watch V
it O
! ,
Fantastic A
movie N
. ,
14. 2: Building a Network
Community detection:
Paper 1: Near linear time algorithm for
detecting community structures on large
scale networks
Paper 2: An LDA-based Community Structure
Discovery Approach for Large-Scale Social
Networks Haizheng Zhang
15. 2: Building a Network
Like MapReduce
Instead of ”map” and ”reduce”
Map = 'Update':
modify overlapping sets of data
Reduce = 'Sync': perform reductions in the
background while sync is running
Label Propagation & LDA
16. 3: Time series prediction
Will get time series from python to R
using the rpy2 module
R has a great package ”quantmod” for
importing financial market data.
Can also import other time series
very easily & many great libraries.
17. Built With
Python - For majority of code
Packages: numpy, scipy, matplotlib
networkx, graphviz, rpy2
django, twython, nltk
R - For time series analysis
Postgreql - SQL database
Java - Twitter POS tagger
C/C++ - GraphLab
18. End Product
IMDB Movie
Review Corpora Tweet
Tweet
Sentiment
Tweet
Double-Back Analysis
Prop. Algo Tweet
Tweet
19. Thank You
Mike Davies
Documented at www.m1ked.com
20. Notes: Vowpal Wabbit LDA
Vowpal Wabbit is an open source library
for fast online learning (mostly SGD)
mainly developed by a guy at Yahoo.
Optimised for speed
LDA uses clever tricks like vectorisation,
floating point representation to avoid using
pow() and exp() functions.
21. Notes: Label Propagation
Label Propagation has been proven to be an
effective semi-supervised learning approach in
many applications. The key idea behind label
propagation is to first construct a graph in which
each node represents a data point and each
edge is assigned a weight often computed as
the similarity between data points, then
propagate the class labels of labeled data to
neighbors in the constructed graph in order to
make predictions.