Finding Relevant Tweets in Social Media

Separating the Wheat from the Chaﬀ
Finding Relevant Tweets in Social Media Streams
Na’im Tyson, PhD
Sciences, About.com
April 20, 2017
Tyson (About.com) Finding Relevance in Tweets April 20, 2017 1 / 23

1 Introduction

1 Introduction
2 Ingesting Text Data

1 Introduction
3 Document Preprocessing

1 Introduction
4 Process Steps

1 Introduction
4 Process Steps
5 Tokenization

1 Introduction
4 Process Steps
5 Tokenization
6 Vectorization

1 Introduction
4 Process Steps
5 Tokenization
6 Vectorization
7 Clustering

1 Introduction
4 Process Steps
5 Tokenization
6 Vectorization
7 Clustering
8 Model Diagnostics

1 Introduction
4 Process Steps
5 Tokenization
6 Vectorization
7 Clustering
8 Model Diagnostics
9 Roads Not Taken

Introduction Consultant Role
Your Role as Consultant. . .
• Advise on open source and proprietary analytical solutions for small- to
medium-sized businesses

• Build solutions to solve business goals using Open Source Software
(whenever possible)

(whenever possible)
• Develop systems to monitor solutions over time (when requested) OR

(whenever possible)
• Develop systems to monitor solutions over time (when requested) OR
• Develop diagnostics to monitor model behaviour

Introduction Client Description
Brand Intelligence Firm
• Boutique Social Monitoring & Analysis Firm

• Provide quantitative summaries from qualitative data (tweets, facebook
posts, web pages, etc.)

• Analytics Dashboards

• How do they acquire data?

• How do they acquire data?
• Data Collector/Aggregation Services
• Collect social data from multiple APIs
• Saves engineering resources

Introduction Project Scope
Business Problem
Imagine: One batch of data - tweets

Business Problem
Relevance: How do you know which ones are relevant to the brand?

Business Problem
Labeling: Would Turkers make good labelers for marking tweets as relevant?

Business Problem
Cost: How many tweets will they label for creating a
model?

Business Problem
model?
Scalability: Labeling thousands or hundreds of thousands of
tweets

Business Problem
model?
tweets
Consistency: How do you know whether they are consistent
labelers?

Business Problem
model?
tweets
labelers?
• Implementation of consistency labeling
statistics

Business Problem
model?
tweets
labelers?
• Implementation of consistency labeling
statistics
Goal: Establish a system for programmatically computing relevance of tweets

Ingesting Text Data Scraping & Crawling
Most of the methods in this section—except the last two—came from
[Bengfort (2016)]

Two Sides of the Same Coin?
• Scraping (from a web page) is an information extraction task

• Text content, publish data, page links or any other goodies

• Crawling is an information processing task

• Traversal of a website’s link network by crawler or spider

• Find out what you can crawl before you start crawling!

• Find out what you can crawl before you start crawling!
• Type into Google search: <DOMAIN NAME> robots.txt

Sample Scrape & Crawl in Python
import bs4
import requests
from slugify import slugify
sources = ['https://www.washingtonpost.com', 'http://www.nytimes.com/',
'http://www.chicagotribune.com/', 'http://www.bostonherald.com/',
'http://www.sfchronicle.com/']
def scrape_content(url, page_name):
try:
page = requests.get(url).content
filename = slugify(page_name).lower() + '.html'
with open(filename, 'wb') as f:
f.write(page)
except:
pass
def crawl(url):
domain = url.split("//www.")[-1].split("/")[0]
html = requests.get(url).content
soup = bs4.BeautifulSoup(html, "lxml")
links = set(soup.find_all('a', href=True))
for link in links:
sub_url = link['href']
page_name = link.string
if domain in sub_url:
scrape_content(sub_url, page_name)
if __name__ == '__main__':
for url in sources:
crawl(url)

Ingesting Text Data RSS Reading
• RSS = Real Simple Syndication

• Standardized XML format for syndicated text content

• Standardized XML format for syndicated text content
import bs4
import feedparser
from slugify import slugify
feeds = ['http://blog.districtdatalabs.com/feed',
'http://feeds.feedburner.com/oreilly/radar/atom',
'http://blog.revolutionanalytics.com/atom.xml']
def rss_parse(feed):
parsed = feedparser.parse(feed)
posts = parsed.entries
for post in posts:
html = post.content[0].get('value')
soup = bs4.BeautifulSoup(html, 'lxml')
post_title = post.title
filename = slugify(post_title).lower() + '.xml'
TAGS = ['h1', 'h2', 'h3', 'h4', 'h5', 'h6', 'h7', 'p', 'li']
for tag in soup.find_all(TAGS):
paragraphs = tag.get_text()
with open(filename, 'a') as f:
f.write(paragraphs + 'n n')

Ingesting Text Data APIs
API Details & Sample Python

• API = application programming interface

• Allows interaction between a client and server-side service that are
independent of each other

• Usually requires an API key, an API secret, an access token, and an
access token secret
• Twitter requires registration at https://apps.twitter.com for API credentials
—import tweepy

• Usually requires an API key, an API secret, an access token, and an
access token secret
• Twitter requires registration at https://apps.twitter.com for API credentials
—import tweepy
import oauth2
API_KEY = ' '
API_SECRET = ' '
TOKEN_KEY = ' '
TOKEN_SECRET = ' '
def oauth_req(url, key, secret, http_method="GET", post_body="",
http_headers=None):
consumer = oauth2.Consumer(key=API_KEY, secret=API_SECRET)
token = oauth2.Token(key=key, secret=secret)
client = oauth2.Client(consumer, token)
resp, content = client.request(url, method=http_method,
body=post_body, headers=http_headers)
return content

Ingesting Text Data PDF Miner
PDF to Text
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from cStringIO import StringIO
def convert_pdf_to_txt(path, codec='utf-8', password='', maxpages=0, caching=True, pages=None):
''' convert pdf to text using PDFMiner.
:param codec: target encoding of text
:param password: password for the pdf if it is password-protected
:param maxpages: maximum number of pages to extract
:param caching: boolean
:param pages: a list of page number to extract from the pdf (zero-based)
:return: text string of all pages specified in the pdf
'''
rsrcmgr = PDFResourceManager()
retstr = StringIO()
device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=LAParams())
interpreter = PDFPageInterpreter(rsrcmgr, device)
pagenos = set(pages) if pages else set()
with open(path, 'rb') as fp:
for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages,
password=password, caching=caching,
check_extractable=True):
interpreter.process_page(page)
device.close()
txt = retstr.getvalue()
retstr.close()
return txt

Document Preprocessing Business Considerations
• Every tweet is a document

• Reject retweets

• Reject retweets
• Ignore (toss) hypertext links

• Reject retweets
• Why might this be a bad idea?

• Reject retweets
• Hint: can links tell you about relevant tweets?

• Reject retweets
• Hint: can links tell you about relevant tweets?
RT @chriswheeldon2: #pinchmeplease so honored. #Beginnersluck Congrats to all at @AmericanInParis for Best Mus
------------------------------------------------------------
RT @VanKaplan: .@AmericanInParis won Best Musical @ Outer Critics Awards! http://t.co/3y9Xem0c9I @PittsburghCLO
------------------------------------------------------------
RT @cope_leanne: Congratulations @AmericanInParis @chriswheeldon2 @robbiefairchild 4 outer Critic Circle wins .
------------------------------------------------------------
.@robbiefairchild @chriswheeldon2 Congrats on Outer Critics Circle Awards for your brilliant work in @AmericanI

Document Preprocessing Cleaning Code
def extract_links(text):
''' get hypertext links in a piece of text. '''
regex = r'https?://[^s<>"]+|www.[^s<>"]+'
return re.findall(regex, text)
def clean_posts(postList):
''' remove retweets found w/in posts. keep a cache of urls to keep track
of a mapping b/t a unique token for that url and the url itself. '''
retweet_regex = r'^RT @w+:'
url_cache = {}
link_num = 1
cleaned_posts = []
for post in postList:
if re.match(retweet_regex, post): continue
urls = extract_links(post)
for url in urls:
if url not in url_cache:
url_cache.setdefault(url, 'LINK{0}'.format(link_num))
link_num = link_num + 1
post = post.replace(url, url_cache[url])
cleaned_posts.append(post.strip())
return cleaned_posts
def get_posts(post_filepath):
postlist = open(post_filepath).read().splitlines()
postlist = [p for p in postlist if len(p) > 0 and not p.startswith('---')]
return postlist

Process Steps
Inspired by [Richert (2014)]
• Feature Extraction

Process Steps
• Extract salient features from each tweet; store it as a vector

Process Steps
• Cluster Vectors (of Tweets)

Process Steps
• Cluster Vectors (of Tweets)
• Determine the cluster for the tweet in question

Tokenization Tokenizing Tweets
from nltk.tokenize import RegexpTokenizer
POST_PATTERN = r'''(?x) # set flag to allow verbose regexps
([A-Z].)+ # abbreviations, e.g. U.S.A.
| https?://[^s<>"]+|www.[^s<>"]+ # html links
| w+([-']w+)* # words with optional internal hyphens
| $?d+(.d+)?%? # currency and percentages, e.g. $12.40, 82%
| #w+b # hashtags
| @w+b # handles
'''
class MediaTokenizer(RegexpTokenizer):
''' regex tokenization class for tokenizing media posts given a pattern. '''
def __init__(self, tokPattern, **kwargs):
super(self.__class__, self).__init__(tokPattern, **kwargs)
def __call__(self, text):
return self.tokenize(text)
tweet_tokenizer = MediaTokenizer(POST_PATTERN)
print tweet_tokenizer('The quick brown fox jumped over the lazy dog.')

Vectorization
Sci-kit Learn’s Vectorizer Implemented
from ast import literal_eval
from nltk.stem import SnowballStemmer
from sklearn.feature_extraction.text import CountVectorizer
english_stemmer = SnowballStemmer('english')
class StemmedCountVectorizer(CountVectorizer):
''' stem words using english stemmer so they can be vectorized by count. '''
def build_analyzer(self):
analyzer = super(StemmedCountVectorizer, self).build_analyzer()
return lambda doc: (english_stemmer.stem(w) for w in analyzer(doc))
config = {'encoding': 'utf-8', 'decode_error': 'strict', 'strip_accents': 'ascii',
'ngram_range': '(1,2)', 'stop_words': 'english', 'lowercase': True, 'min_df': 5,
'max_df': 0.8, 'binary': False, 'smooth_idf': False}
vectorizer = StemmedCountVectorizer(min_df=config['min_df'], max_df=config['max_df'],
encoding=config['encoding'], binary=config['binary'],
lowercase=config['lowercase'],
strip_accents=config['strip_accents'],
stop_words=config['stop_words'],
ngram_range=literal_eval(config['ngram_range']),
smooth_idf=config['smooth_idf'],
tokenizer=tweet_tokenizer # FROM LAST SLIDE!
# NOTE: tokenizer MUST have __call__()
)
vec_posts = vectorizer.fit_transform(posts)

Clustering
What is KMeans?
• Clustering algorithm that segments data into k clusters

Clustering
What is KMeans?
• Nondeterministic: diﬀerent starting values may result in a diﬀerent
assignment of points to clusters

Clustering
What is KMeans?
• Run the k-means algorithm several times and then compare the results

Clustering
What is KMeans?
• This assumes you have time to do this!

Clustering
What is KMeans?
• Might be simpler to change tokenization and vectorization methods

Clustering
What is KMeans?
• Might be simpler to change tokenization and vectorization methods
Algorithm [Janert (2010), p. 662-663]
choose initial positions for the cluster centroids
repeat:
for each point:
calculate its distance from each cluster centroid
assign the point to the nearest cluster
recalculate the positions of the cluster centroids

Clustering How is it implemented?
import scipy as sp, sys, yaml
from sklearn.cluster import KMeans
seed = 2
sp.random.seed(seed) # to reproduce the data later on
def train_cluster_model(posts, configDoc='prelim.yaml', tokenizer=None,
vectorizer_type=StemmedCountVectorizer):
try:
config = yaml.load(open(configDoc))
except IOError, ie:
sys.stderr.write("Can't open config file: %s" % str(ie))
sys.exit(1)
if not tokenizer:
tokenizer = MediaTokenizer(POST_PATTERN)
vectorizer = vectorizer_type(
min_df=config['min_df'],
max_df=config['max_df'],
encoding=config['encoding'],
lowercase=config['lowercase'],
strip_accents=config['strip_accents'],
stop_words=config['stop_words'],
ngram_range=literal_eval(
config['ngram_range']),
smooth_idf=config['smooth_idf'],
tokenizer=tokenizer)
vec_posts = vectorizer.fit_transform(posts)
cls_model = KMeans(n_clusters=2, init='k-means++', n_jobs=2)
cls_model.fit(vec_posts)
return {'model':cls_model, 'vectorizer': vectorizer}

Clustering Model Testing
import cPickle as pickle, sys, yaml
from scipy.spatial.distance import euclidean
def test_model(posts_path, cls_mod_path, vectorizer_path, yaml_filepath):
orig, posts = vectorize_posts(posts_path, vectorizer_path)
try:
config = yaml.load(open(yaml_filepath))
except IOError, ie:
sys.stderr.write("Can't open yaml file: %s" % str(ie))
sys.exit(1)
vectorizer = pickle.load(open(vectorizer_path, 'rb'))
vec_posts = vectorizer.transform(posts)
cls_model = pickle.load(open(cls_mod_path, 'rb'))
cls_labels = cls_model.predict(vec_posts).tolist()
dists = [None] * len(cls_labels)
for i, label in enumerate(cls_labels):
dists[i] = euclidean(vec_posts.getrow(i).toarray(),
cls_model.cluster_centers_[label])
for t, l, d in zip(orig, cls_labels, dists):
print '{0}t{1}t{2:.6f}'.format(t, l, d)

Model Diagnostics Top Terms Per Cluster
def top_terms_per_cluster(km, vectorizer, outFile, k=2, topNTerms=10):
''' print top terms from each cluster '''
from warnings import warn, simplefilter
''' NOTE: ignore the following (annoying) deprecation warning:
/Library/Python/2.7/site-packages/sklearn/utils/__init__.py:94:
DeprecationWarning: Function fixed_vocabulary is deprecated;
The `fixed_vocabulary` attribute is deprecated and will be removed in 0.18.
Please use `fixed_vocabulary_` instead. '''
simplefilter('ignore', DeprecationWarning)
order_centroids = km.cluster_centers_.argsort()[:, ::-1]
# check to see if top n terms is beyond centroid length
centroid_vec_length = order_centroids[0,].shape[0]
if topNTerms > centroid_vec_length:
warn('Top n terms parameter exceed centroid vector length!')
warn('Switching to centroid vector length: %d' % centroid_vec_length)
topNTerms = centroid_vec_length
terms = vectorizer.get_feature_names()
with open(outFile, 'w') as topFeatsFile:
topFeatsFile.write("Top terms per cluster:n")
for i in range(k):
topFeatsFile.write("Cluster %d:n" % (i + 1))
for ind in order_centroids[i, :topNTerms]:
topFeatsFile.write(" %sn" % terms[ind])
topFeatsFile.write('n')

Model Diagnostics Model Visualization [Bari (2014)]
>>> from sklearn.decomposition import PCA
>>> from sklearn.cluster import KMeans
>>> import pylab as pl
>>> pca = PCA(n_components=2).fit(vectorized_posts)
>>> pca_2d = pca.transform(vectorized_posts)
>>> pl.figure('Reference Plot')
>>> pl.scatter(pca_2d[:, 0], pca_2d[:, 1], c=vectorized_posts_targets)
>>> kmeans = KMeans(n_clusters=2) # REFER TO PRECEDING SLIDES
>>> kmeans.fit(vectorized_posts)
>>> pl.figure('K-means with 2 clusters')
>>> pl.scatter(pca_2d[:, 0], pca_2d[:, 1], c=kmeans.labels_)
>>> pl.show()

Roads Not Taken
• Batch vs. Stream Processing

Roads Not Taken
• Batch KMeans (sklearn.cluster.MiniBatchKMeans)

Roads Not Taken
• Other types of vectorization and tokenization

Roads Not Taken
• Using unsupervised machine learning as a segue to a supervised solution

Roads Not Taken
• Using unsupervised machine learning as a segue to a supervised solution
• What happened in the end with the client?

References
A. Bari, M. Chaouchi and T. Jung.
Predictive Analytics for Dummies (1st Edition).
For Dummies, 2014.
B. Bengfort, R. Bilbro and T. Ojeda
Applied Text Analysis with Python.
O’Reilly Media, 2016.
Philipp K. Janert.
Data Analysis with Open Source Tools.
O’Reilly Media, 2010.
W. Richert and L. Pedro Coelho
Building Machine Learning Systems with Python.
Packt Publishing, 2014.

Finding Relevant Tweets in Social Media

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Ähnlich wie Finding Relevant Tweets in Social Media

Ähnlich wie Finding Relevant Tweets in Social Media (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Finding Relevant Tweets in Social Media