Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Unleashing twitter data for fun and insight
1. Unleashing Twitter Data
for fun and insight
Matthew A. Russell
http://linkedin.com/in/ptwobrussell
@ptwobrussell
Agile Data Solutions Social Web
Mining the
3. Mining the Social Web
Chapters 1-5
Introduction: Trends, Tweets, and Twitterers
Microformats: Semantic Markup and Common Sense Collide
Mailboxes: Oldies but Goodies
Friends, Followers, and Setwise Operations
Twitter: The Tweet, the Whole Tweet, and
Nothing but the Tweet
4. Mining the Social Web
Chapters 6-10
LinkedIn: Clustering Your Professional Network For Fun (and
Profit?)
Google Buzz: TF-IDF, Cosine Similarity, and Collocations
Blogs et al: Natural Language Processing (and Beyond)
Facebook: The All-In-One Wonder
The Semantic Web: A Cocktail Discussion
5. O verview
• Trends, Tweets, and Retweet Visualizations
• Friends, Followers, and Setwise Operations
• The Tweet, the Whole Tweet, and Nothing but the Tweet
6. Insight Matters
• What is @user's potential influence?
• What are @user's passions right now?
• Who are @user's most trusted friends?
7. Part 1:
Tweets, Trends, and Retweet
Visualizations
Agile Data Solutions Social Web
Mining the
8. A point to ponder:
Twitter : Data :: JavaScript : Programming Languages (???)
19. Search Results (continued)
"profile_image_url": "http://a1.twimg.com/profile_images/80...",
"source": "<a href="http://twitter.com/&quo...",
"text": "im nt gonna go to sleep happy unless i see ...",
"to_user_id": null
}
... output truncated - 99 more tweets ...
],
"results_per_page": 100,
"since_id": 0
},
... output truncated - 4 more pages ...
]
20. Lexical Diversity
• Ratio of unique terms to total terms
• A measure of "stickiness"?
• A measure of "group think"?
• A crude indicator of retweets to originally authored tweets?
21. Distilling Tweet Text
>>> # search_results is already defined
>>> tweets = [ r['text']
... for result in search_results
... for r in result['results'] ]
>>> words = []
>>> for t in tweets:
... words += [ w for w in t.split() ]
...
22. Analyzing Data
Agile Data Solutions Social Web
Mining the
23. Lexical Diversity
>>> len(words)
7238
>>> # unique words
>>> len(set(words))
1636
>>> # lexical diversity
>>> 1.0*len(set(words))/len(words)
0.22602928985907708
>>> # average number of words per tweet
>>> 1.0*sum([ len(t.split()) for t in tweets ])/len(tweets)
14.476000000000001
24. Size Frequency Matters
• Counting: always the first step
• Simple but effective
• NLTK saves us a little trouble
27. Tweet and RT were sitting on a fence.
Tweet fell off. Who was left?
28. RTs: past, present, & future
• Retweet: Tweeting a tweet that's already been tweeted
• RT or via followed by @mention
• Example: RT @SocialWebMining Justin Bieber is on SNL 2nite. w00t?!?
• Relatively new APIs were rolled out last year for retweeting sans
conventions
29. Some people, when confronted with a problem, think "I know,
I'll use regular expressions." Now they have two
problems. -- Jamie Zawinski
30. Parsing Retweets
>>> example_tweets = ["Visualize Twitter search results w/ this simple script
http://bit.ly/cBu0l4 - Gist instructions http://bit.ly/9SZ2kb (via
@SocialWebMining @ptwobrussell)"]
>>> import re
>>> rt_patterns = re.compile(r"(RT|via)((?:bW*@w+)+)",
... re.IGNORECASE)
>>> rt_origins = []
>>> for t in example_tweets:
... try:
... rt_origins += [mention.strip()
... for mention in rt_patterns.findall(t)[0][1].split()]
... except IndexError, e:
... pass
>>> [rto.strip("@") for rto in rt_origins]
41. Insight Matters
• What is my potential influence?
• Who are the most popular people in my network?
• Who are my mutual friends?
• What common friends/followers do I have with @user?
• Who is not following me back?
• What can I learn from analyzing my friendship cliques?
42. Getting Data
Agile Data Solutions Social Web
Mining the
43. OAuth (1.0a)
import twitter
from twitter.oauth_dance import oauth_dance
# Get these from http://dev.twitter.com/apps/new
consumer_key, consumer_secret = 'key', 'secret'
(oauth_token, oauth_token_secret) = oauth_dance('MiningTheSocialWeb',
consumer_key, consumer_secret)
auth=twitter.oauth.OAuth(oauth_token, oauth_token_secret,
consumer_key, consumer_secret)
t = twitter.Twitter(domain='api.twitter.com', auth=auth)
44. Getting Friendship Data
friend_ids = t.friends.ids(screen_name='timoreilly', cursor=-1)
follower_ids = t.followers.ids(screen_name='timoreilly', cursor=-1)
# store the data somewhere...
47. Rate Limits
• 350 requests/hr for authenticated requests
• 150 requests/hr for anonymous requests
• Coping mechanisms:
• Caching & Archiving Data
• Streaming API
• HTTP 400 codes
• See http://dev.twitter.com/pages/rate-limiting
48. The Beloved Fail Whale
• Twitter is sometimes "overcapacity"
• HTTP 503 Error
• Handle it just as any other HTTP error
• RESTfulness has its advantages
49. Abstraction Helps
friend_ids = []
wait_period = 2 # secs
cursor = -1
while cursor != 0:
response = makeTwitterRequest(t, # twitter.Twitter instance
t.friends.ids,
screen_name=screen_name,
cursor=cursor)
friend_ids += response['ids']
cursor = response['next_cursor']
# break out of loop early if you don't need all ids
50. Abstracting Abstractions
screen_name = 'timoreilly'
# This is what you ultimately want...
friend_ids = getFriends(screen_name)
follower_ids = getFollowers(screen_name)
51. Storing Data
Agile Data Solutions Social Web
Mining the
54. A relational database?
import sqlite3 as sqlite
conn = sqlite.connect('data.db')
c = conn.cursor()
c.execute('''create table
friends...''')
c.execute('''insert into friends...
''')
# Lots of fun...sigh...
55. Redis (A Data Structures Server)
import redis
r = redis.Redis()
[ r.sadd("timoreilly$friend_ids", i) for i in friend_ids ]
r.smembers("timoreilly$friend_ids") # returns a set
Project page: http://redis.io
Windows binary: http://code.google.com/p/servicestack/wiki/RedisWindowsDownload
56. Redis Set Operations
• Key/value store...on typed values!
• Common set operations
• smembers, scard
• sinter, sdiff, sunion
• sadd, srem, etc.
• See http://code.google.com/p/redis/wiki/CommandReference
• Don't forget to $ easy_install redis
57. Analyzing Data
Agile Data Solutions Social Web
Mining the
60. Count Your Blessings
# A utility function
def getRedisIdByScreenName(screen_name, key_name):
return 'screen_name$' + screen_name + '$' + key_name
# Number of friends
n_friends = r.scard(getRedisIdByScreenName(screen_name,
'friend_ids'))
# Number of followers
n_followers = r.scard(getRedisIdByScreenName(screen_name,
'follower_ids'))
61. Asymmetric Relationships
# Friends who aren't following back
friends_diff_followers = r.sdiffstore('temp', [
getRedisIdByScreenName(screen_name, 'friend_ids'),
getRedisIdByScreenName(screen_name, 'follower_ids')
])
# ... compute interesting things ...
r.delete('temp')
64. Sample Output
timoreilly is following 663
timoreilly is being followed by 1,423,704
131 of 663 are not following timoreilly back
1,423,172 of 1,423,704 are not being followed back by
timoreilly
timoreilly has 532 mutual friends
65. Who Isn't Following Back?
user_ids = [ ... ] # Resolve these to user info objects
while len(user_ids) > 0:
user_ids_str, = ','.join([ str(i) for i in user_ids[:100] ])
user_ids = user_ids[100:]
response = t.users.lookup(user_id=user_ids)
if type(response) is dict: response = [response]
r.mset(dict([(getRedisIdByUserId(resp['id'], 'info.json'), json.dumps(resp))
for resp in response]))
r.mset(dict([(getRedisIdByScreenName(resp['screen_name'],'info.json'),
json.dumps(resp)) for resp in response]))
66. Friends in Common
# Assume we've harvested friends/followers and it's in Redis...
screen_names = ['timoreilly', 'mikeloukides']
r.sinterstore('temp$friends_in_common',
[getRedisIdByScreenName(screen_name, 'friend_ids')
for screen_name in screen_names])
r.sinterstore('temp$followers_in_common',
[getRedisIdByScreenName(screen_name,'follower_ids')
for screen_name in screen_names])
# Manipulate the sets
67. Potential Influence
• My followers?
• My followers' followers?
• My followers' followers' followers?
•for n in range(1, 7): # 6 degrees?
print "My " + "followers' "*n + "followers?"
71. Breadth-First Traversal
Create an empty graph
Create an empty queue to keep track of unprocessed nodes
Add the starting point to the graph as the "root node"
Add the root node to a queue for processing
Repeat until some maximum depth is reached or the queue is empty:
Remove a node from queue
For each of the node's neighbors:
If the neighbor hasn't already been processed:
Add it to the graph
Add it to the queue
Add an edge to the graph connecting the node & its neighbor
72. Breadth-First Harvest
next_queue = [ 'timoreilly' ] # seed node
d = 1
while d < depth:
d += 1
queue, next_queue = next_queue, []
for screen_name in queue:
follower_ids = getFollowers(screen_name=screen_name)
next_queue += follower_ids
getUserInfo(user_ids=next_queue)
73. The Most Popular Followers
freqs = {}
for follower in followers:
cnt = follower['followers_count']
if not freqs.has_key(cnt):
freqs[cnt] = []
freqs[cnt].append({'screen_name': follower['screen_name'],
'user_id': f['id']})
popular_followers = sorted(freqs, reverse=True)[:100]
74. Average # of Followers
all_freqs = [k for k in keys for user in freqs[k]]
avg = sum(all_freqs) / len(all_freqs)
75. @timoreilly's Popular Followers
The top 10 followers from the sample:
aplusk 4,993,072
BarackObama 4,114,901
mashable 2,014,615
MarthaStewart 1,932,321
Schwarzenegger 1,705,177
zappos 1,689,289
Veronica 1,612,827
jack 1,592,004
stephenfry 1,531,813
davos 1,522,621
76. Futzing the Numbers
• The average number of timoreilly's followers' followers: 445
• Discarding the top 10 lowers the average to around 300
• Discarding any follower with less than 10 followers of their
own increases the average to over 1,000!
• Doing both brings the average to around 800
78. Friendship Graphs
for i in ids: #ids is timoreilly's id along with friend ids
info = json.loads(r.get(getRedisIdByUserId(i, 'info.json')))
screen_name = info['screen_name']
friend_ids = list(r.smembers(getRedisIdByScreenName(screen_name,
'friend_ids')))
for friend_id in [fid for fid in friend_ids if fid in ids]:
friend_info = json.loads(r.get(getRedisIdByUserId(friend_id, 'info.json')))
g.add_edge(screen_name, friend_info['screen_name'])
nx.write_gpickle(g, 'timoreilly.gpickle') # see also nx.read_gpickle
80. Calculating Cliques
cliques = [c for c in nx.find_cliques(g)]
num_cliques = len(cliques)
clique_sizes = [len(c) for c in cliques]
max_clique_size = max(clique_sizes)
avg_clique_size = sum(clique_sizes) / num_cliques
max_cliques = [c for c in cliques if len(c) == max_clique_size]
num_max_cliques = len(max_cliques)
people_in_every_max_clique = list(reduce(
lambda x, y: x.intersection(y),[set(c) for c in max_cliques]
))
81. Cliques for @timoreilly
Num cliques: 762573
Avg clique size: 14
Max clique size: 26
Num max cliques: 6
Num people in every max clique: 20
83. Graphs, etc
• Your first instinct is naturally
G = (V, E) ?
84. Dorling Cartogram
• A location-aware bubble chart (ish)
• At least 3-dimensional
• Position, color, size
• Look at friends/followers by state
85. Sunburst of Friends
• A very compact visualization
• Slice and dice friends/followers by
gender, country, locale, etc.
86. Part 3:
The Tweet, the Whole Tweet, and
Nothing but the Tweet
Agile Data Solutions Social Web
Mining the
87. Insight Matters
• Which entities frequently appear in @user's tweets?
• How often does @user talk about specific friends?
• Who does @user retweet most frequently?
• How frequently is @user retweeted (by anyone)?
• How many #hashtags are usually in @user's tweets?
93. Entities & Annotations
• Entities
• Opt-in now but will "soon" be standard
• $ easy_install twitter_text
• Annotations
• User-defined metadata
• See http://dev.twitter.com/pages/annotations_overview
94. Manual Entity Extraction
import twitter_text
extractor = twitter_text.Extractor(tweet['text'])
mentions = extractor.extract_mentioned_screen_names_with_indices()
hashtags = extractor.extract_hashtags_with_indices()
urls = extractor.extract_urls_with_indices()
# Splice info into a tweet object
98. As easy as sitting on the couch
• Get it - http://www.couchone.com/get
• Install it
• Relax - http://localhost:5984/_utils/
• Also - $ easy_install couchdb
99. Storing Timeline Data
import couchdb
import twitter
TIMELINE_NAME = "user" # or "home" or "public"
t = twitter.Twitter(domain='api.twitter.com', api_version='1)
server = couchdb.Server('http://localhost:5984')
db = server.create(DB)
page_num = 1
while page_num <= MAX_PAGES:
api_call = getattr(t.statuses, TIMELINE_NAME + '_timeline')
tweets = makeTwitterRequest(t, api_call, page=page_num)
db.update(tweets, all_or_nothing=True)
print 'Fetched %i tweets' % len(tweets)
page_num += 1
106. Filtering Tweet Entities
• Let's find out how often someone talks about
specific friends
• We have friend info on hand
• We've extracted @mentions from the tweets
• Let's cound friend vs non-friend mentions
107. @timoreilly's friend mentions
Number of user entities in tweets who are
Number of @user entities in tweets: 20
not friends: 2
Number of @user entities in tweets who
are friends: 18
n2vip
timoreilly
ahier andrewsavikas
pkedrosky gnat
CodeforAmerica slashdot
nytimes OReillyMedia
brady dalepd
carlmalamud mikeloukides
pahlkadot monkchips
make fredwilson
jamesoreilly digiphile
andrewsavikas
112. Retweet Counts
• An API resource /statuses/retweet_count exists (and is now functional)
• Example: http://twitter.com/statuses/show/29016139807.json
• retweet_count
• retweeted
123. Hashtag Analysis
• TeaParty: ~ 5 hashtags per tweet.
• Example: “Rarely is the questioned asked: Is our children
learning?” - G.W. Bush #p2 #topprog #tcot #tlot #teaparty
#GOP #FF
• JustinBieber: ~ 2 hashtags per tweet
• Example: #justinbieber is so coool
128. Juxtaposing Friendships
• Harvest search results for #JustinBieber and #TeaParty
• Get friend ids for each @mention with /friends/ids
• Resolve screen names with /users/lookup
• Populate a NetworkX graph
• Analyze it
• Visualize with Graphviz