SlideShare ist ein Scribd-Unternehmen logo
1 von 132
Downloaden Sie, um offline zu lesen
Unleashing Twitter Data
     for fun and insight

Matthew A. Russell
http://linkedin.com/in/ptwobrussell
@ptwobrussell

                                      Agile Data Solutions Social Web
                                             Mining the
Happy Groundhog Day!
Mining the Social Web
          Chapters 1-5
Introduction: Trends, Tweets, and Twitterers
Microformats: Semantic Markup and Common Sense Collide
Mailboxes: Oldies but Goodies
Friends, Followers, and Setwise Operations
Twitter: The Tweet, the Whole Tweet, and
Nothing but the Tweet
Mining the Social Web
            Chapters 6-10

LinkedIn: Clustering Your Professional Network For Fun (and
Profit?)
Google Buzz: TF-IDF, Cosine Similarity, and Collocations
Blogs et al: Natural Language Processing (and Beyond)
Facebook: The All-In-One Wonder
The Semantic Web: A Cocktail Discussion
O                verview


• Trends, Tweets, and Retweet Visualizations
• Friends, Followers, and Setwise Operations
• The Tweet, the Whole Tweet, and Nothing but the Tweet
Insight Matters


• What is @user's potential influence?
• What are @user's passions right now?
• Who are @user's most trusted friends?
Part 1:
Tweets, Trends, and Retweet
       Visualizations



                     Agile Data Solutions Social Web
                            Mining the
A point to ponder:
Twitter : Data :: JavaScript : Programming Languages (???)
Getting Ready To Code




                  Agile Data Solutions Social Web
                         Mining the
Python Installation


• Mac users already have it
• Linux users probably have it
• Windows users should grab ActivePython
easy_install
• Installs packages from PyPI
• Get it:
  • http://pypi.python.org/pypi/setuptools
  • Ships with ActivePython
• It really is easy:
 easy_install twitter
 easy_install nltk
 easy_install networkx
Git It?
• http://github.com/ptwobrussell/Mining-the-Social-Web
• git clone git://github.com/ptwobrussell/Mining-the-Social-Web.git
 • introduction__*.py
 • friends_followers__*.py
 • the_tweet__*.py
Getting Data




               Agile Data Solutions Social Web
                      Mining the
Twitter Data Sources


• Twitter API Resources
• GNIP
• Infochimps
• Library of Congress
Trending Topics

>>>   import twitter # Remember to "easy_install twitter"
>>>   twitter_search = twitter.Twitter(domain="search.twitter.com")
>>>   trends = twitter_search.trends()
>>>   [ trend['name'] for trend in trends['trends'] ]

[u'#ZodiacFacts', u'#nowplaying', u'#ItsOverWhen',
 u'#Christoferdrew', u'Justin Bieber', u'#WhatwouldItBeLike',
 u'#Sagittarius', u'SNL', u'#SurveySays', u'#iDoit2']
Search Results


>>> search_results = []
>>> for page in range(1,6):
...   search_results.append(twitter_search.search(q="SNL",rpp=100, page=page))
Search Results (continued)
 >>> import json
 >>> print json.dumps(search_results, sort_keys=True, indent=1)
 [
   {
     "completed_in": 0.088122000000000006,
     "max_id": 11966285265,
     "next_page": "?page=2&max_id=11966285265&rpp=100&q=SNL",
     "page": 1,
     "query": "SNL",
     "refresh_url": "?since_id=11966285265&q=SNL",

   ...more...
Search Results (continued)
  "results": [
   {
     "created_at": "Sun, 11 Apr 2010 01:34:52 +0000",
     "from_user": "bieber_luv2",
     "from_user_id": 106998169,
     "geo": null,
     "id": 11966285265,
     "iso_language_code": "en",
     "metadata": {
      "result_type": "recent"
     },
     ...more...
Search Results (continued)
       "profile_image_url": "http://a1.twimg.com/profile_images/80...",
       "source": "<a href="http://twitter.com/&quo...",
       "text": "im nt gonna go to sleep happy unless i see ...",
       "to_user_id": null
       }
       ... output truncated - 99 more tweets ...
     ],
     "results_per_page": 100,
     "since_id": 0
    },
    ... output truncated - 4 more pages ...
]
Lexical Diversity

• Ratio of unique terms to total terms
  • A measure of "stickiness"?
  • A measure of "group think"?
  • A crude indicator of retweets to originally authored tweets?
Distilling Tweet Text
 >>> # search_results is already defined

 >>> tweets = [ r['text'] 
 ...     for result in search_results 
 ...         for r in result['results'] ]

 >>> words = []

 >>> for t in tweets:
 ...     words += [ w for w in t.split() ]
 ...
Analyzing Data




                 Agile Data Solutions Social Web
                        Mining the
Lexical Diversity
 >>> len(words)
 7238

 >>> # unique words
 >>> len(set(words))
 1636

 >>> # lexical diversity
 >>> 1.0*len(set(words))/len(words)
 0.22602928985907708

 >>> # average number of words per tweet
 >>> 1.0*sum([ len(t.split()) for t in tweets ])/len(tweets)
 14.476000000000001
Size Frequency Matters


• Counting: always the first step
• Simple but effective
• NLTK saves us a little trouble
Frequency Analysis
 >>> import nltk
 >>> freq_dist = nltk.FreqDist(words)
 >>> freq_dist.keys()[:50] #50 most frequent tokens

 [u'snl', u'on', u'rt', u'is', u'to', u'i', u'watch', u'justin',
  u'@justinbieber', u'be', u'the', u'tonight', u'gonna', u'at',
  u'in', u'bieber', u'and', u'you', u'watching', u'tina', u'for',
  u'a', u'wait', u'fey', u'of', u'@justinbieber:', u'if', u'with',
  u'so', u"can't", u'who', u'great', u'it', u'going', u'im', u':)',
  u'snl...', u'2nite...', u'are', u'cant', u'dress', u'rehearsal',
  u'see', u'that', u'what', u'but', u'tonight!', u':d', u'2',
  u'will']
Frequency Visualization
Tweet and RT were sitting on a fence.
    Tweet fell off. Who was left?
RTs: past, present, & future


• Retweet: Tweeting a tweet that's already been tweeted
• RT or via followed by @mention
• Example: RT @SocialWebMining Justin Bieber is on SNL 2nite. w00t?!?
• Relatively new APIs were rolled out last year for retweeting sans
 conventions
Some people, when confronted with a problem, think "I know,
   I'll use regular expressions." Now they have two
                 problems. -- Jamie Zawinski
Parsing Retweets
 >>> example_tweets = ["Visualize Twitter search results w/ this simple script
 http://bit.ly/cBu0l4 - Gist instructions http://bit.ly/9SZ2kb (via
 @SocialWebMining @ptwobrussell)"]

 >>> import re
 >>> rt_patterns = re.compile(r"(RT|via)((?:bW*@w+)+)", 
 ...                          re.IGNORECASE)

 >>> rt_origins = []
 >>> for t in example_tweets:
 ...    try:
 ...         rt_origins += [mention.strip() 
 ...         for mention in rt_patterns.findall(t)[0][1].split()]
 ...    except IndexError, e:
 ...         pass

 >>> [rto.strip("@") for rto in rt_origins]
Visualizing Data




                   Agile Data Solutions Social Web
                          Mining the
Graph Construction

 >>> import networkx as nx
 >>> g = nx.DiGraph()
 >>> g.add_edge("@SocialWebMining", "@ptwobrussell", 
 ...            {"tweet_id" : 4815162342},)
Writing out DOT
OUT_FILE = "out_file.dot"

try:
    nx.drawing.write_dot(g, OUT_FILE)
except ImportError, e:
    dot = ['"%s" -> "%s" [tweet_id=%s]' % 
    (n1, n2, g[n1][n2]['tweet_id']) for n1, n2 in g.edges()]

       f = open(OUT_FILE, 'w')
       f.write('strict digraph {n%sn}' % (';n'.join(dot),))
       f.close()
Example DOT Language

 strict digraph {
   "@ericastolte" -> "bonitasworld" [tweet_id=11965974697];
   "@mpcoelho" ->"Lil_Amaral" [tweet_id=11965954427];
   "@BieberBelle123" -> "BELIEBE4EVER" [tweet_id=11966261062];
   "@BieberBelle123" -> "sabrina9451" [tweet_id=11966197327];
 }
DOT to Image


• Download Graphviz: http://www.graphviz.org/
•$ dot -Tpng out_file.dot > graph.png
• Windows users might prefer GVEdit
Graphviz: Extreme Closeup
But you want more sexy?
Protovis: Extreme Closeup




                   38       Mining the Social Web
It Doesn't Have To Be a Graph

                Graph Connectedness
Part 2:
Friends, Followers, and Setwise
          Operations



                        Agile Data Solutions Social Web
                               Mining the
Insight Matters

• What is my potential influence?
• Who are the most popular people in my network?
• Who are my mutual friends?
• What common friends/followers do I have with @user?
• Who is not following me back?
• What can I learn from analyzing my friendship cliques?
Getting Data



               Agile Data Solutions Social Web
                      Mining the
OAuth (1.0a)
import twitter
from twitter.oauth_dance import oauth_dance

# Get these from http://dev.twitter.com/apps/new
consumer_key, consumer_secret = 'key', 'secret'

(oauth_token, oauth_token_secret) = oauth_dance('MiningTheSocialWeb',
                                       consumer_key, consumer_secret)

auth=twitter.oauth.OAuth(oauth_token, oauth_token_secret,
                         consumer_key, consumer_secret)

t = twitter.Twitter(domain='api.twitter.com', auth=auth)
Getting Friendship Data


 friend_ids = t.friends.ids(screen_name='timoreilly', cursor=-1)
 follower_ids = t.followers.ids(screen_name='timoreilly', cursor=-1)

 # store the data somewhere...
Perspective: Fetching all of Lady Gaga's
~7M followers would take ~4 hours
But there's always a catch...
Rate Limits
• 350 requests/hr for authenticated requests
• 150 requests/hr for anonymous requests
• Coping mechanisms:
  • Caching & Archiving Data
  • Streaming API
  • HTTP 400 codes
• See http://dev.twitter.com/pages/rate-limiting
The Beloved Fail Whale


 • Twitter is sometimes "overcapacity"
 • HTTP 503 Error
 • Handle it just as any other HTTP error
 • RESTfulness has its advantages
Abstraction Helps
 friend_ids = []
 wait_period = 2 # secs
 cursor = -1

 while cursor != 0:
     response = makeTwitterRequest(t, # twitter.Twitter instance
                                   t.friends.ids,
                                   screen_name=screen_name,
                                   cursor=cursor)

     friend_ids += response['ids']
     cursor = response['next_cursor']
     # break out of loop early if you don't need all ids
Abstracting Abstractions
 screen_name = 'timoreilly'

 # This is what you ultimately want...

 friend_ids = getFriends(screen_name)
 follower_ids = getFollowers(screen_name)
Storing Data



               Agile Data Solutions Social Web
                      Mining the
Flat Files?
  ./
  screen_name1/
      friend_ids.json
      follower_ids.json
      user_info.json

  screen_name2/
      ...

  ...
Pickles?
import cPickle

o = {
    'friend_ids'   : friend_ids,
    'follower_ids' : follower_ids,
    'user_info'    : user_info
}

f = open('screen_name1.pickle, 'wb')
cPickle.dump(o, f)
f.close()
A relational database?
 import sqlite3 as sqlite

 conn = sqlite.connect('data.db')
 c = conn.cursor()

 c.execute('''create table
              friends...''')


 c.execute('''insert into friends...
 ''')


 # Lots of fun...sigh...
Redis (A Data Structures Server)


  import redis

  r = redis.Redis()

  [ r.sadd("timoreilly$friend_ids", i) for i in friend_ids ]

  r.smembers("timoreilly$friend_ids") # returns a set


         Project page: http://redis.io
         Windows binary: http://code.google.com/p/servicestack/wiki/RedisWindowsDownload
Redis Set Operations
• Key/value store...on typed values!
• Common set operations
  • smembers, scard
  • sinter, sdiff, sunion
  • sadd, srem, etc.
• See http://code.google.com/p/redis/wiki/CommandReference
• Don't forget to $ easy_install redis
Analyzing Data



                 Agile Data Solutions Social Web
                        Mining the
Setwise Operations

• Union
• Intersection
• Difference
• Complement
Venn Diagrams

              Followers - Friends
  Friends
                                              Friends - Followers




                    Friends       Followers
                              U
  Followers
Count Your Blessings
# A utility function
def getRedisIdByScreenName(screen_name, key_name):
    return 'screen_name$' + screen_name + '$' + key_name


# Number of friends
n_friends = r.scard(getRedisIdByScreenName(screen_name,
                                           'friend_ids'))

# Number of followers
n_followers = r.scard(getRedisIdByScreenName(screen_name,
                                             'follower_ids'))
Asymmetric Relationships


# Friends who aren't following back
friends_diff_followers = r.sdiffstore('temp', [
                 getRedisIdByScreenName(screen_name, 'friend_ids'),
                 getRedisIdByScreenName(screen_name, 'follower_ids')
                 ])
# ... compute interesting things ...
r.delete('temp')
Asymmetric Relationships


# Followers who aren't friended
followers_diff_friends = r.sdiffstore('temp', [
                  getRedisIdByScreenName(screen_name, 'follower_ids'),
                  getRedisIdByScreenName(screen_name, 'friend_ids')
                  ])
# ... compute interesting things ...
r.delete('temp')
Symmetric Relationships

 mutual_friends = r.sinterstore('temp', [
         getRedisIdByScreenName(screen_name, 'follower_ids'),
         getRedisIdByScreenName(screen_name, 'friend_ids')
         ])
 # ... compute interesting things ...
 r.delete('temp')
Sample Output

 timoreilly is following 663

 timoreilly is being followed by 1,423,704

 131 of 663 are not following timoreilly back

 1,423,172 of 1,423,704 are not being followed back by
 timoreilly

 timoreilly has 532 mutual friends
Who Isn't Following Back?
 user_ids = [ ... ] # Resolve these to user info objects

 while len(user_ids) > 0:
   user_ids_str, = ','.join([ str(i) for i in user_ids[:100] ])
   user_ids = user_ids[100:]

   response = t.users.lookup(user_id=user_ids)

   if type(response) is dict: response = [response]
   r.mset(dict([(getRedisIdByUserId(resp['id'], 'info.json'), json.dumps(resp))
                for resp in response]))

   r.mset(dict([(getRedisIdByScreenName(resp['screen_name'],'info.json'),
                json.dumps(resp)) for resp in response]))
Friends in Common
# Assume we've harvested friends/followers and it's in Redis...
screen_names = ['timoreilly', 'mikeloukides']

r.sinterstore('temp$friends_in_common',
              [getRedisIdByScreenName(screen_name, 'friend_ids')
              for screen_name in screen_names])

r.sinterstore('temp$followers_in_common',
              [getRedisIdByScreenName(screen_name,'follower_ids')
              for screen_name in screen_names])

# Manipulate the sets
Potential Influence

• My followers?
• My followers' followers?
• My followers' followers' followers?
•for n in range(1, 7): # 6 degrees?
   print "My " + "followers' "*n + "followers?"
Saving a Thousand Words...




         {
                                           1


                         2            Branching              3
                                      Factor = 2

 Depth = 3       4                5                 6                 7


             8       9       10       11       12       13       14       15
Same Data, Different Layout
                 9         10

                 4           5
             8        2          11

                      1

            12        3          15
                 6           7

                 13        14

                 4        12 5
Space Complexity
                     Depth
                1    2    3     4    5
            2   3    7   15     31   63
Branching   3   4   13   40    121 364
 Factor     4   5   21   85    341 1365
            5   6   31   156   781 3906
            6   7   43   259   1555 9331
Breadth-First Traversal
Create an empty graph
Create an empty queue to keep track of unprocessed nodes

Add the starting point to the graph as the "root node"
Add the root node to a queue for processing

Repeat until some maximum depth is reached or the queue is empty:
  Remove a node from queue
  For each of the node's neighbors:
    If the neighbor hasn't already been processed:
      Add it to the graph
      Add it to the queue
      Add an edge to the graph connecting the node & its neighbor
Breadth-First Harvest

 next_queue = [ 'timoreilly' ] # seed node
 d = 1

 while d < depth:
     d += 1
     queue, next_queue = next_queue, []
     for screen_name in queue:
         follower_ids = getFollowers(screen_name=screen_name)
         next_queue += follower_ids
         getUserInfo(user_ids=next_queue)
The Most Popular Followers

 freqs = {}
 for follower in followers:
     cnt = follower['followers_count']
     if not freqs.has_key(cnt):
         freqs[cnt] = []

     freqs[cnt].append({'screen_name': follower['screen_name'],
                        'user_id': f['id']})

 popular_followers = sorted(freqs, reverse=True)[:100]
Average # of Followers

 all_freqs = [k for k in keys for user in freqs[k]]
 avg = sum(all_freqs) / len(all_freqs)
@timoreilly's Popular Followers

          The top 10 followers from the sample:

          aplusk              4,993,072
          BarackObama         4,114,901
          mashable            2,014,615
          MarthaStewart       1,932,321
          Schwarzenegger      1,705,177
          zappos              1,689,289
          Veronica            1,612,827
          jack                1,592,004
          stephenfry          1,531,813
          davos               1,522,621
Futzing the Numbers

• The average number of timoreilly's followers' followers: 445
• Discarding the top 10 lowers the average to around 300
• Discarding any follower with less than 10 followers of their
 own increases the average to over 1,000!
• Doing both brings the average to around 800
The Right Tool For the Job:
NetworkX for Networks
Friendship Graphs
for i in ids: #ids is timoreilly's id along with friend ids
  info = json.loads(r.get(getRedisIdByUserId(i, 'info.json')))
  screen_name = info['screen_name']
  friend_ids = list(r.smembers(getRedisIdByScreenName(screen_name,
                                                      'friend_ids')))

  for friend_id in [fid for fid in friend_ids if fid in ids]:
      friend_info = json.loads(r.get(getRedisIdByUserId(friend_id, 'info.json')))
      g.add_edge(screen_name, friend_info['screen_name'])

nx.write_gpickle(g, 'timoreilly.gpickle') # see also nx.read_gpickle
Clique Analysis


                                              • Cliques
                                              • Maximum Cliques
                                              • Maximal Cliques

http://en.wikipedia.org/wiki/Clique_problem
Calculating Cliques
cliques = [c for c in nx.find_cliques(g)]

num_cliques = len(cliques)
clique_sizes = [len(c) for c in cliques]

max_clique_size = max(clique_sizes)
avg_clique_size = sum(clique_sizes) / num_cliques
max_cliques = [c for c in cliques if len(c) == max_clique_size]
num_max_cliques = len(max_cliques)

people_in_every_max_clique = list(reduce(
    lambda x, y: x.intersection(y),[set(c) for c in max_cliques]
))
Cliques for @timoreilly


         Num   cliques:                762573
         Avg   clique size:                14
         Max   clique size:                26
         Num   max cliques:                 6
         Num   people in every max clique: 20
Visualizing Data



                   Agile Data Solutions Social Web
                          Mining the
Graphs, etc


    • Your first instinct is naturally
      G = (V, E) ?
Dorling Cartogram

  • A location-aware bubble chart (ish)
  • At least 3-dimensional
    • Position, color, size
  • Look at friends/followers by state
Sunburst of Friends


 • A very compact visualization
 • Slice and dice friends/followers by
  gender, country, locale, etc.
Part 3:
The Tweet, the Whole Tweet, and
     Nothing but the Tweet



                        Agile Data Solutions Social Web
                               Mining the
Insight Matters

• Which entities frequently appear in @user's tweets?
• How often does @user talk about specific friends?
• Who does @user retweet most frequently?
• How frequently is @user retweeted (by anyone)?
• How many #hashtags are usually in @user's tweets?
Pen : Sword :: Tweet : Machine Gun (?!?)
Getting Data



               Mining the Social Web
Let me count the APIs...

• Timelines
• Tweets
• Favorites
• Direct Messages
• Streams
Anatomy of a Tweet (1/2)
{
    "created_at" : "Thu Jun 24 14:21:11 +0000 2010",
    "id" : 16932571217,
    "text" : "Great idea from @crowdflower: Crowdsourcing ... #opengov",
    "user" : {
       "description" : "Founder and CEO, O'Reilly Media. Watching the alpha geeks...",
       "id" : 2384071,
       "location" : "Sebastopol, CA",
       "name" : "Tim O'Reilly",
       "screen_name" : "timoreilly",
       "url" : "http://radar.oreilly.com"
    },

    ...
Anatomy of a Tweet (2/2)

    ...

    "entities" : {
      "hashtags" : [    {"indices" : [ 97, 103 ], "text" : "gov20"},
                        {"indices" : [ 104, 112 ], "text" : "opengov"} ],

        "urls" : [{"expanded_url" : null, "indices" : [ 76, 96 ],
                   "url" : "http://bit.ly/9o4uoG"} ],

        "user_mentions" : [{"id" : 28165790, "indices" : [ 16, 28 ],
                            "name" : "crowdFlower","screen_name" : "crowdFlower"}]
    }
}
Entities & Annotations

• Entities
  • Opt-in now but will "soon" be standard
 • $ easy_install twitter_text
• Annotations
  • User-defined metadata
  • See http://dev.twitter.com/pages/annotations_overview
Manual Entity Extraction
 import twitter_text

 extractor = twitter_text.Extractor(tweet['text'])

 mentions = extractor.extract_mentioned_screen_names_with_indices()
 hashtags = extractor.extract_hashtags_with_indices()
 urls = extractor.extract_urls_with_indices()

 # Splice info into a tweet object
Storing Data



               Mining the Social Web
Storing Tweets

• Flat files? (Really, who does that?)
• A relational database?
• Redis?
• CouchDB (Relax...?)
CouchDB: Relax

• Document-oriented key/value
• Map/Reduce
• RESTful API
• Erlang
As easy as sitting on the couch


• Get it - http://www.couchone.com/get
• Install it
• Relax - http://localhost:5984/_utils/
• Also - $ easy_install couchdb
Storing Timeline Data
import couchdb
import twitter

TIMELINE_NAME = "user" # or "home" or "public"

t = twitter.Twitter(domain='api.twitter.com', api_version='1)

server = couchdb.Server('http://localhost:5984')
db = server.create(DB)

page_num = 1
while page_num <= MAX_PAGES:
    api_call = getattr(t.statuses, TIMELINE_NAME + '_timeline')
    tweets = makeTwitterRequest(t, api_call, page=page_num)
    db.update(tweets, all_or_nothing=True)
    print 'Fetched %i tweets' % len(tweets)
    page_num += 1
Analyzing & Visualizing Data



                        Mining the Social Web
Approach:
Map/Reduce on Tweets
Map/Reduce Paraadigm

• Mapper: yields key/value pairs
• Reducer: operates on keyed mapper output
• Example: Computing the sum of squares
  • Mapper Input: (k, [2,4,6])
  • Mapper Output: (k, [4,16,36])
  • Reducer Input: [(k, 4,16), (k, 36)]
  • Reducer Output: 56
Which entities frequently appear in
       @mention's tweets?
@timoreilly's Tweet Entities
How often does @timoreilly
 mention specific friends?
Filtering Tweet Entities

• Let's find out how often someone talks about
 specific friends
• We have friend info on hand
• We've extracted @mentions from the tweets
• Let's cound friend vs non-friend mentions
@timoreilly's friend mentions
                                          Number of user entities in tweets who are
 Number of @user entities in tweets: 20
                                          not friends: 2
 Number of @user entities in tweets who
 are friends: 18
                                            n2vip
                                            timoreilly
   ahier               andrewsavikas
   pkedrosky           gnat
   CodeforAmerica      slashdot
   nytimes             OReillyMedia
   brady               dalepd
   carlmalamud         mikeloukides
   pahlkadot           monkchips
   make                fredwilson
   jamesoreilly        digiphile
   andrewsavikas
Who does @timoreilly retweet
     most frequently?
Counting Retweets

• Map @mentions out of tweets using a regex
• Reduce to sum them up
• Sort the results
• Display results
Retweets by @timoreilly
How frequently is @timoreilly
        retweeted?
Retweet Counts


• An API resource /statuses/retweet_count exists (and is now functional)
• Example: http://twitter.com/statuses/show/29016139807.json
  • retweet_count
  • retweeted
Survey Says...
@timoreilly is retweeted about 2/3
            of the time
How often does @timoreilly
include #hashtags in tweets?
Counting Hashtags


• Use a mapper to emit a #hashtag entities for tweets
• Use a reducer to sum them all up
• Been there, done that...
Survey Says...
About 1 out of every 3 tweets by
 @timoreilly contain #hashtags
But if you order within the next 5
            mintues...



                           Mining the Social Web
Bonus Material:
What do #JustinBieber and #TeaParty
         have in common?


                             Mining the Social Web
Tweet Entities
#JustinBieber co-occurrences

 #bieberblast           http://tinyurl.com/
                                              #music
 #Eclipse               343kax4
                                              @justinbieber
 #somebodytolove        @JustBieberFact
                                              #nowplaying
 http://bit.ly/aARD4t   @TinselTownDirt
                                              #Justinbieber
 http://bit.ly/b2Kc1L   #beliebers
                                              #JUSTINBIEBER
 #Escutando             #BieberFact
                                              #Proform
 #justinBieber          #Celebrity
                                              http://migre.me/TJwj
 #Restart               #Dschungel
                                              @ProSieben
 #TT                    @_Yassi_
                                              @lojadoaltivo
 #Telezwerge            #musicmonday
                                              #JustinBieber
 @rheinzeitung          #video
                                              #justinbieber
 #WTF                   #tickets
#TeaParty co-occurrences
                   @STOPOBAMA2012               #jcot
@blogging_tories   @TheFlaCracker               #tweetcongress
#cdnpoli           #palin2012                   #Obama
#fail              #AZ                          #topprog
#nra               #TopProg                     #palin
#roft              #conservative                #dems
@BrnEyeSuss        http://tinyurl.com/386k5hh   #acon
@crispix49         @ResistTyranny               #cspj
@koopersmith       #tsot                        #immigration
@Kriskxx           @ALIPAC                      #politics
#Kagan             #majority                    #hhrs
@Liliaep           #NoAmnesty                   #TeaParty
#nvsen             #patriottweets               #vote2010
@First_Patriots    @Drudge_Report               #libertarian
#patriot           #military                    #obama
#pjtv              #palin12                     #ucot
@andilinks         #rnc                         #iamthemob
@RonPaulNews       #TCOT                        #GOP
#ampats            http://tinyurl.com/24h36zq   #tpp
#cnn               #spwbt                       #dnc
#jews              @welshman007                 #twisters
#GOPDeficit        #FF                          #sgp
#wethepeople       #liberty                     #ocra
#asamom            #glennbeck                   #gop
@thenewdeal        #news                        #tlot
#AFIRE             #oilspill                    #p2
#Dems              #rs                          #tcot
@JIDF              #Teaparty                    #teaparty
Hashtag Distributions
Hashtag Analysis

• TeaParty: ~ 5 hashtags per tweet.
• Example: “Rarely is the questioned asked: Is our children
 learning?” - G.W. Bush #p2 #topprog #tcot #tlot #teaparty
 #GOP #FF
• JustinBieber: ~ 2 hashtags per tweet
• Example: #justinbieber is so coool
Common #hashtags
 #lol              #dancing
 #jesus            #music
 #worldcup         #glennbeck
 #teaparty         @addthis
 #AZ               #nowplaying
 #milk             #news
 #ff               #WTF
 #guns             #fail
 #WorldCup         #toomanypeople
 #bp               #oilspill
 #News             #catholic
Retweet Patterns
Retweet Behaviors
Friendship Networks
Juxtaposing Friendships

• Harvest search results for #JustinBieber and #TeaParty
• Get friend ids for each @mention with /friends/ids
• Resolve screen names with /users/lookup
• Populate a NetworkX graph
• Analyze it
• Visualize with Graphviz
Nodes Degrees
Two Kinds of Hairballs...




       #JustinBieber        #TeaParty
The world twitterverse is your oyster
• Twitter: @SocialWebMining
• GitHub: http://bit.ly/socialwebmining
• Facbook: http://facebook.com/MiningTheSocialWeb




                              Mining the Social Web

Weitere ähnliche Inhalte

Andere mochten auch

Tim Estes - Generating dynamic social networks from large scale unstructured ...
Tim Estes - Generating dynamic social networks from large scale unstructured ...Tim Estes - Generating dynamic social networks from large scale unstructured ...
Tim Estes - Generating dynamic social networks from large scale unstructured ...Digital Reasoning
 
Got Chaos? Extracting Business Intelligence from Email with Natural Language ...
Got Chaos? Extracting Business Intelligence from Email with Natural Language ...Got Chaos? Extracting Business Intelligence from Email with Natural Language ...
Got Chaos? Extracting Business Intelligence from Email with Natural Language ...Digital Reasoning
 
Mining the Social Web for Fun & Profit Within Your Organization
Mining the Social Web for Fun & Profit Within Your OrganizationMining the Social Web for Fun & Profit Within Your Organization
Mining the Social Web for Fun & Profit Within Your OrganizationDigital Reasoning
 
Mining the Social Web for Fun and Profit: A Getting Started Guide
Mining the Social Web for Fun and Profit: A Getting Started GuideMining the Social Web for Fun and Profit: A Getting Started Guide
Mining the Social Web for Fun and Profit: A Getting Started GuideMatthew Russell
 
Privacy, Ethics, and Future Uses of the Social Web
Privacy, Ethics, and Future Uses of the Social WebPrivacy, Ethics, and Future Uses of the Social Web
Privacy, Ethics, and Future Uses of the Social WebMatthew Russell
 
Mining Social Web APIs with IPython Notebook (PyCon 2014)
Mining Social Web APIs with IPython Notebook (PyCon 2014)Mining Social Web APIs with IPython Notebook (PyCon 2014)
Mining Social Web APIs with IPython Notebook (PyCon 2014)Matthew Russell
 
Mining the Geo Needles in the Social Haystack
Mining the Geo Needles in the Social HaystackMining the Geo Needles in the Social Haystack
Mining the Geo Needles in the Social HaystackMatthew Russell
 
Mining Social Web APIs with IPython Notebook (Data Day Texas 2015)
Mining Social Web APIs with IPython Notebook (Data Day Texas 2015)Mining Social Web APIs with IPython Notebook (Data Day Texas 2015)
Mining Social Web APIs with IPython Notebook (Data Day Texas 2015)Matthew Russell
 

Andere mochten auch (9)

Tim Estes - Generating dynamic social networks from large scale unstructured ...
Tim Estes - Generating dynamic social networks from large scale unstructured ...Tim Estes - Generating dynamic social networks from large scale unstructured ...
Tim Estes - Generating dynamic social networks from large scale unstructured ...
 
Got Chaos? Extracting Business Intelligence from Email with Natural Language ...
Got Chaos? Extracting Business Intelligence from Email with Natural Language ...Got Chaos? Extracting Business Intelligence from Email with Natural Language ...
Got Chaos? Extracting Business Intelligence from Email with Natural Language ...
 
Mining the Social Web for Fun & Profit Within Your Organization
Mining the Social Web for Fun & Profit Within Your OrganizationMining the Social Web for Fun & Profit Within Your Organization
Mining the Social Web for Fun & Profit Within Your Organization
 
How to Build a Tech Team
How to Build a Tech TeamHow to Build a Tech Team
How to Build a Tech Team
 
Mining the Social Web for Fun and Profit: A Getting Started Guide
Mining the Social Web for Fun and Profit: A Getting Started GuideMining the Social Web for Fun and Profit: A Getting Started Guide
Mining the Social Web for Fun and Profit: A Getting Started Guide
 
Privacy, Ethics, and Future Uses of the Social Web
Privacy, Ethics, and Future Uses of the Social WebPrivacy, Ethics, and Future Uses of the Social Web
Privacy, Ethics, and Future Uses of the Social Web
 
Mining Social Web APIs with IPython Notebook (PyCon 2014)
Mining Social Web APIs with IPython Notebook (PyCon 2014)Mining Social Web APIs with IPython Notebook (PyCon 2014)
Mining Social Web APIs with IPython Notebook (PyCon 2014)
 
Mining the Geo Needles in the Social Haystack
Mining the Geo Needles in the Social HaystackMining the Geo Needles in the Social Haystack
Mining the Geo Needles in the Social Haystack
 
Mining Social Web APIs with IPython Notebook (Data Day Texas 2015)
Mining Social Web APIs with IPython Notebook (Data Day Texas 2015)Mining Social Web APIs with IPython Notebook (Data Day Texas 2015)
Mining Social Web APIs with IPython Notebook (Data Day Texas 2015)
 

Ähnlich wie Unleashing twitter data for fun and insight

Mining social data
Mining social dataMining social data
Mining social dataMalk Zameth
 
Life at Twitter + Career Advice for Students
Life at Twitter + Career Advice for StudentsLife at Twitter + Career Advice for Students
Life at Twitter + Career Advice for StudentsChris Aniszczyk
 
"R & Text Analytics" (15 January 2013)
"R & Text Analytics" (15 January 2013)"R & Text Analytics" (15 January 2013)
"R & Text Analytics" (15 January 2013)Portland R User Group
 
The Web Application Hackers Toolchain
The Web Application Hackers ToolchainThe Web Application Hackers Toolchain
The Web Application Hackers Toolchainjasonhaddix
 
Twitter Presentation: #APIConSF
Twitter Presentation: #APIConSFTwitter Presentation: #APIConSF
Twitter Presentation: #APIConSFRyan Choi
 
PyData Texas 2015 Keynote
PyData Texas 2015 KeynotePyData Texas 2015 Keynote
PyData Texas 2015 KeynotePeter Wang
 
Datasploit - An Open Source Intelligence Tool
Datasploit - An Open Source Intelligence ToolDatasploit - An Open Source Intelligence Tool
Datasploit - An Open Source Intelligence ToolShubham Mittal
 
Mining the social web ch1
Mining the social web ch1Mining the social web ch1
Mining the social web ch1HyeonSeok Choi
 
Idea2app
Idea2appIdea2app
Idea2appFlumes
 
Python and Oracle : allies for best of data management
Python and Oracle : allies for best of data managementPython and Oracle : allies for best of data management
Python and Oracle : allies for best of data managementLaurent Leturgez
 
Creating More Engaging Content For Social
Creating More Engaging Content For SocialCreating More Engaging Content For Social
Creating More Engaging Content For SocialEric T. Tung
 
Continuum Analytics and Python
Continuum Analytics and PythonContinuum Analytics and Python
Continuum Analytics and PythonTravis Oliphant
 
Build a Twitter Bot with Basic Python
Build a Twitter Bot with Basic PythonBuild a Twitter Bot with Basic Python
Build a Twitter Bot with Basic PythonThinkful
 
ASFWS 2012 - Contourner les conditions d’utilisation et l’API du service Twit...
ASFWS 2012 - Contourner les conditions d’utilisation et l’API du service Twit...ASFWS 2012 - Contourner les conditions d’utilisation et l’API du service Twit...
ASFWS 2012 - Contourner les conditions d’utilisation et l’API du service Twit...Cyber Security Alliance
 
Protect Your Payloads: Modern Keying Techniques
Protect Your Payloads: Modern Keying TechniquesProtect Your Payloads: Modern Keying Techniques
Protect Your Payloads: Modern Keying TechniquesLeo Loobeek
 
Goodle Developer Days Munich 2008 - Open Social Update
Goodle Developer Days Munich 2008 - Open Social UpdateGoodle Developer Days Munich 2008 - Open Social Update
Goodle Developer Days Munich 2008 - Open Social UpdatePatrick Chanezon
 
Anaconda and PyData Solutions
Anaconda and PyData SolutionsAnaconda and PyData Solutions
Anaconda and PyData SolutionsTravis Oliphant
 
Working With Facebook, Twitter, et al. - Social Media Camp
Working With Facebook, Twitter, et al. - Social Media CampWorking With Facebook, Twitter, et al. - Social Media Camp
Working With Facebook, Twitter, et al. - Social Media CampMike Anderson
 

Ähnlich wie Unleashing twitter data for fun and insight (20)

Mining social data
Mining social dataMining social data
Mining social data
 
Life at Twitter + Career Advice for Students
Life at Twitter + Career Advice for StudentsLife at Twitter + Career Advice for Students
Life at Twitter + Career Advice for Students
 
Developing apps using Perl
Developing apps using PerlDeveloping apps using Perl
Developing apps using Perl
 
"R & Text Analytics" (15 January 2013)
"R & Text Analytics" (15 January 2013)"R & Text Analytics" (15 January 2013)
"R & Text Analytics" (15 January 2013)
 
Big data. Opportunità e rischi
Big data. Opportunità e rischiBig data. Opportunità e rischi
Big data. Opportunità e rischi
 
The Web Application Hackers Toolchain
The Web Application Hackers ToolchainThe Web Application Hackers Toolchain
The Web Application Hackers Toolchain
 
Twitter Presentation: #APIConSF
Twitter Presentation: #APIConSFTwitter Presentation: #APIConSF
Twitter Presentation: #APIConSF
 
PyData Texas 2015 Keynote
PyData Texas 2015 KeynotePyData Texas 2015 Keynote
PyData Texas 2015 Keynote
 
Datasploit - An Open Source Intelligence Tool
Datasploit - An Open Source Intelligence ToolDatasploit - An Open Source Intelligence Tool
Datasploit - An Open Source Intelligence Tool
 
Mining the social web ch1
Mining the social web ch1Mining the social web ch1
Mining the social web ch1
 
Idea2app
Idea2appIdea2app
Idea2app
 
Python and Oracle : allies for best of data management
Python and Oracle : allies for best of data managementPython and Oracle : allies for best of data management
Python and Oracle : allies for best of data management
 
Creating More Engaging Content For Social
Creating More Engaging Content For SocialCreating More Engaging Content For Social
Creating More Engaging Content For Social
 
Continuum Analytics and Python
Continuum Analytics and PythonContinuum Analytics and Python
Continuum Analytics and Python
 
Build a Twitter Bot with Basic Python
Build a Twitter Bot with Basic PythonBuild a Twitter Bot with Basic Python
Build a Twitter Bot with Basic Python
 
ASFWS 2012 - Contourner les conditions d’utilisation et l’API du service Twit...
ASFWS 2012 - Contourner les conditions d’utilisation et l’API du service Twit...ASFWS 2012 - Contourner les conditions d’utilisation et l’API du service Twit...
ASFWS 2012 - Contourner les conditions d’utilisation et l’API du service Twit...
 
Protect Your Payloads: Modern Keying Techniques
Protect Your Payloads: Modern Keying TechniquesProtect Your Payloads: Modern Keying Techniques
Protect Your Payloads: Modern Keying Techniques
 
Goodle Developer Days Munich 2008 - Open Social Update
Goodle Developer Days Munich 2008 - Open Social UpdateGoodle Developer Days Munich 2008 - Open Social Update
Goodle Developer Days Munich 2008 - Open Social Update
 
Anaconda and PyData Solutions
Anaconda and PyData SolutionsAnaconda and PyData Solutions
Anaconda and PyData Solutions
 
Working With Facebook, Twitter, et al. - Social Media Camp
Working With Facebook, Twitter, et al. - Social Media CampWorking With Facebook, Twitter, et al. - Social Media Camp
Working With Facebook, Twitter, et al. - Social Media Camp
 

Kürzlich hochgeladen

Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGSujit Pal
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 

Kürzlich hochgeladen (20)

Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAG
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 

Unleashing twitter data for fun and insight

  • 1. Unleashing Twitter Data for fun and insight Matthew A. Russell http://linkedin.com/in/ptwobrussell @ptwobrussell Agile Data Solutions Social Web Mining the
  • 3. Mining the Social Web Chapters 1-5 Introduction: Trends, Tweets, and Twitterers Microformats: Semantic Markup and Common Sense Collide Mailboxes: Oldies but Goodies Friends, Followers, and Setwise Operations Twitter: The Tweet, the Whole Tweet, and Nothing but the Tweet
  • 4. Mining the Social Web Chapters 6-10 LinkedIn: Clustering Your Professional Network For Fun (and Profit?) Google Buzz: TF-IDF, Cosine Similarity, and Collocations Blogs et al: Natural Language Processing (and Beyond) Facebook: The All-In-One Wonder The Semantic Web: A Cocktail Discussion
  • 5. O verview • Trends, Tweets, and Retweet Visualizations • Friends, Followers, and Setwise Operations • The Tweet, the Whole Tweet, and Nothing but the Tweet
  • 6. Insight Matters • What is @user's potential influence? • What are @user's passions right now? • Who are @user's most trusted friends?
  • 7. Part 1: Tweets, Trends, and Retweet Visualizations Agile Data Solutions Social Web Mining the
  • 8. A point to ponder: Twitter : Data :: JavaScript : Programming Languages (???)
  • 9. Getting Ready To Code Agile Data Solutions Social Web Mining the
  • 10. Python Installation • Mac users already have it • Linux users probably have it • Windows users should grab ActivePython
  • 11. easy_install • Installs packages from PyPI • Get it: • http://pypi.python.org/pypi/setuptools • Ships with ActivePython • It really is easy: easy_install twitter easy_install nltk easy_install networkx
  • 12. Git It? • http://github.com/ptwobrussell/Mining-the-Social-Web • git clone git://github.com/ptwobrussell/Mining-the-Social-Web.git • introduction__*.py • friends_followers__*.py • the_tweet__*.py
  • 13. Getting Data Agile Data Solutions Social Web Mining the
  • 14. Twitter Data Sources • Twitter API Resources • GNIP • Infochimps • Library of Congress
  • 15. Trending Topics >>> import twitter # Remember to "easy_install twitter" >>> twitter_search = twitter.Twitter(domain="search.twitter.com") >>> trends = twitter_search.trends() >>> [ trend['name'] for trend in trends['trends'] ] [u'#ZodiacFacts', u'#nowplaying', u'#ItsOverWhen', u'#Christoferdrew', u'Justin Bieber', u'#WhatwouldItBeLike', u'#Sagittarius', u'SNL', u'#SurveySays', u'#iDoit2']
  • 16. Search Results >>> search_results = [] >>> for page in range(1,6): ... search_results.append(twitter_search.search(q="SNL",rpp=100, page=page))
  • 17. Search Results (continued) >>> import json >>> print json.dumps(search_results, sort_keys=True, indent=1) [ { "completed_in": 0.088122000000000006, "max_id": 11966285265, "next_page": "?page=2&max_id=11966285265&rpp=100&q=SNL", "page": 1, "query": "SNL", "refresh_url": "?since_id=11966285265&q=SNL", ...more...
  • 18. Search Results (continued) "results": [ { "created_at": "Sun, 11 Apr 2010 01:34:52 +0000", "from_user": "bieber_luv2", "from_user_id": 106998169, "geo": null, "id": 11966285265, "iso_language_code": "en", "metadata": { "result_type": "recent" }, ...more...
  • 19. Search Results (continued) "profile_image_url": "http://a1.twimg.com/profile_images/80...", "source": "&lt;a href=&quot;http://twitter.com/&quo...", "text": "im nt gonna go to sleep happy unless i see ...", "to_user_id": null } ... output truncated - 99 more tweets ... ], "results_per_page": 100, "since_id": 0 }, ... output truncated - 4 more pages ... ]
  • 20. Lexical Diversity • Ratio of unique terms to total terms • A measure of "stickiness"? • A measure of "group think"? • A crude indicator of retweets to originally authored tweets?
  • 21. Distilling Tweet Text >>> # search_results is already defined >>> tweets = [ r['text'] ... for result in search_results ... for r in result['results'] ] >>> words = [] >>> for t in tweets: ... words += [ w for w in t.split() ] ...
  • 22. Analyzing Data Agile Data Solutions Social Web Mining the
  • 23. Lexical Diversity >>> len(words) 7238 >>> # unique words >>> len(set(words)) 1636 >>> # lexical diversity >>> 1.0*len(set(words))/len(words) 0.22602928985907708 >>> # average number of words per tweet >>> 1.0*sum([ len(t.split()) for t in tweets ])/len(tweets) 14.476000000000001
  • 24. Size Frequency Matters • Counting: always the first step • Simple but effective • NLTK saves us a little trouble
  • 25. Frequency Analysis >>> import nltk >>> freq_dist = nltk.FreqDist(words) >>> freq_dist.keys()[:50] #50 most frequent tokens [u'snl', u'on', u'rt', u'is', u'to', u'i', u'watch', u'justin', u'@justinbieber', u'be', u'the', u'tonight', u'gonna', u'at', u'in', u'bieber', u'and', u'you', u'watching', u'tina', u'for', u'a', u'wait', u'fey', u'of', u'@justinbieber:', u'if', u'with', u'so', u"can't", u'who', u'great', u'it', u'going', u'im', u':)', u'snl...', u'2nite...', u'are', u'cant', u'dress', u'rehearsal', u'see', u'that', u'what', u'but', u'tonight!', u':d', u'2', u'will']
  • 27. Tweet and RT were sitting on a fence. Tweet fell off. Who was left?
  • 28. RTs: past, present, & future • Retweet: Tweeting a tweet that's already been tweeted • RT or via followed by @mention • Example: RT @SocialWebMining Justin Bieber is on SNL 2nite. w00t?!? • Relatively new APIs were rolled out last year for retweeting sans conventions
  • 29. Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems. -- Jamie Zawinski
  • 30. Parsing Retweets >>> example_tweets = ["Visualize Twitter search results w/ this simple script http://bit.ly/cBu0l4 - Gist instructions http://bit.ly/9SZ2kb (via @SocialWebMining @ptwobrussell)"] >>> import re >>> rt_patterns = re.compile(r"(RT|via)((?:bW*@w+)+)", ... re.IGNORECASE) >>> rt_origins = [] >>> for t in example_tweets: ... try: ... rt_origins += [mention.strip() ... for mention in rt_patterns.findall(t)[0][1].split()] ... except IndexError, e: ... pass >>> [rto.strip("@") for rto in rt_origins]
  • 31. Visualizing Data Agile Data Solutions Social Web Mining the
  • 32. Graph Construction >>> import networkx as nx >>> g = nx.DiGraph() >>> g.add_edge("@SocialWebMining", "@ptwobrussell", ... {"tweet_id" : 4815162342},)
  • 33. Writing out DOT OUT_FILE = "out_file.dot" try: nx.drawing.write_dot(g, OUT_FILE) except ImportError, e: dot = ['"%s" -> "%s" [tweet_id=%s]' % (n1, n2, g[n1][n2]['tweet_id']) for n1, n2 in g.edges()] f = open(OUT_FILE, 'w') f.write('strict digraph {n%sn}' % (';n'.join(dot),)) f.close()
  • 34. Example DOT Language strict digraph { "@ericastolte" -> "bonitasworld" [tweet_id=11965974697]; "@mpcoelho" ->"Lil_Amaral" [tweet_id=11965954427]; "@BieberBelle123" -> "BELIEBE4EVER" [tweet_id=11966261062]; "@BieberBelle123" -> "sabrina9451" [tweet_id=11966197327]; }
  • 35. DOT to Image • Download Graphviz: http://www.graphviz.org/ •$ dot -Tpng out_file.dot > graph.png • Windows users might prefer GVEdit
  • 37. But you want more sexy?
  • 38. Protovis: Extreme Closeup 38 Mining the Social Web
  • 39. It Doesn't Have To Be a Graph Graph Connectedness
  • 40. Part 2: Friends, Followers, and Setwise Operations Agile Data Solutions Social Web Mining the
  • 41. Insight Matters • What is my potential influence? • Who are the most popular people in my network? • Who are my mutual friends? • What common friends/followers do I have with @user? • Who is not following me back? • What can I learn from analyzing my friendship cliques?
  • 42. Getting Data Agile Data Solutions Social Web Mining the
  • 43. OAuth (1.0a) import twitter from twitter.oauth_dance import oauth_dance # Get these from http://dev.twitter.com/apps/new consumer_key, consumer_secret = 'key', 'secret' (oauth_token, oauth_token_secret) = oauth_dance('MiningTheSocialWeb', consumer_key, consumer_secret) auth=twitter.oauth.OAuth(oauth_token, oauth_token_secret, consumer_key, consumer_secret) t = twitter.Twitter(domain='api.twitter.com', auth=auth)
  • 44. Getting Friendship Data friend_ids = t.friends.ids(screen_name='timoreilly', cursor=-1) follower_ids = t.followers.ids(screen_name='timoreilly', cursor=-1) # store the data somewhere...
  • 45. Perspective: Fetching all of Lady Gaga's ~7M followers would take ~4 hours
  • 46. But there's always a catch...
  • 47. Rate Limits • 350 requests/hr for authenticated requests • 150 requests/hr for anonymous requests • Coping mechanisms: • Caching & Archiving Data • Streaming API • HTTP 400 codes • See http://dev.twitter.com/pages/rate-limiting
  • 48. The Beloved Fail Whale • Twitter is sometimes "overcapacity" • HTTP 503 Error • Handle it just as any other HTTP error • RESTfulness has its advantages
  • 49. Abstraction Helps friend_ids = [] wait_period = 2 # secs cursor = -1 while cursor != 0: response = makeTwitterRequest(t, # twitter.Twitter instance t.friends.ids, screen_name=screen_name, cursor=cursor) friend_ids += response['ids'] cursor = response['next_cursor'] # break out of loop early if you don't need all ids
  • 50. Abstracting Abstractions screen_name = 'timoreilly' # This is what you ultimately want... friend_ids = getFriends(screen_name) follower_ids = getFollowers(screen_name)
  • 51. Storing Data Agile Data Solutions Social Web Mining the
  • 52. Flat Files? ./ screen_name1/ friend_ids.json follower_ids.json user_info.json screen_name2/ ... ...
  • 53. Pickles? import cPickle o = { 'friend_ids' : friend_ids, 'follower_ids' : follower_ids, 'user_info' : user_info } f = open('screen_name1.pickle, 'wb') cPickle.dump(o, f) f.close()
  • 54. A relational database? import sqlite3 as sqlite conn = sqlite.connect('data.db') c = conn.cursor() c.execute('''create table friends...''') c.execute('''insert into friends... ''') # Lots of fun...sigh...
  • 55. Redis (A Data Structures Server) import redis r = redis.Redis() [ r.sadd("timoreilly$friend_ids", i) for i in friend_ids ] r.smembers("timoreilly$friend_ids") # returns a set Project page: http://redis.io Windows binary: http://code.google.com/p/servicestack/wiki/RedisWindowsDownload
  • 56. Redis Set Operations • Key/value store...on typed values! • Common set operations • smembers, scard • sinter, sdiff, sunion • sadd, srem, etc. • See http://code.google.com/p/redis/wiki/CommandReference • Don't forget to $ easy_install redis
  • 57. Analyzing Data Agile Data Solutions Social Web Mining the
  • 58. Setwise Operations • Union • Intersection • Difference • Complement
  • 59. Venn Diagrams Followers - Friends Friends Friends - Followers Friends Followers U Followers
  • 60. Count Your Blessings # A utility function def getRedisIdByScreenName(screen_name, key_name): return 'screen_name$' + screen_name + '$' + key_name # Number of friends n_friends = r.scard(getRedisIdByScreenName(screen_name, 'friend_ids')) # Number of followers n_followers = r.scard(getRedisIdByScreenName(screen_name, 'follower_ids'))
  • 61. Asymmetric Relationships # Friends who aren't following back friends_diff_followers = r.sdiffstore('temp', [ getRedisIdByScreenName(screen_name, 'friend_ids'), getRedisIdByScreenName(screen_name, 'follower_ids') ]) # ... compute interesting things ... r.delete('temp')
  • 62. Asymmetric Relationships # Followers who aren't friended followers_diff_friends = r.sdiffstore('temp', [ getRedisIdByScreenName(screen_name, 'follower_ids'), getRedisIdByScreenName(screen_name, 'friend_ids') ]) # ... compute interesting things ... r.delete('temp')
  • 63. Symmetric Relationships mutual_friends = r.sinterstore('temp', [ getRedisIdByScreenName(screen_name, 'follower_ids'), getRedisIdByScreenName(screen_name, 'friend_ids') ]) # ... compute interesting things ... r.delete('temp')
  • 64. Sample Output timoreilly is following 663 timoreilly is being followed by 1,423,704 131 of 663 are not following timoreilly back 1,423,172 of 1,423,704 are not being followed back by timoreilly timoreilly has 532 mutual friends
  • 65. Who Isn't Following Back? user_ids = [ ... ] # Resolve these to user info objects while len(user_ids) > 0: user_ids_str, = ','.join([ str(i) for i in user_ids[:100] ]) user_ids = user_ids[100:] response = t.users.lookup(user_id=user_ids) if type(response) is dict: response = [response] r.mset(dict([(getRedisIdByUserId(resp['id'], 'info.json'), json.dumps(resp)) for resp in response])) r.mset(dict([(getRedisIdByScreenName(resp['screen_name'],'info.json'), json.dumps(resp)) for resp in response]))
  • 66. Friends in Common # Assume we've harvested friends/followers and it's in Redis... screen_names = ['timoreilly', 'mikeloukides'] r.sinterstore('temp$friends_in_common', [getRedisIdByScreenName(screen_name, 'friend_ids') for screen_name in screen_names]) r.sinterstore('temp$followers_in_common', [getRedisIdByScreenName(screen_name,'follower_ids') for screen_name in screen_names]) # Manipulate the sets
  • 67. Potential Influence • My followers? • My followers' followers? • My followers' followers' followers? •for n in range(1, 7): # 6 degrees? print "My " + "followers' "*n + "followers?"
  • 68. Saving a Thousand Words... { 1 2 Branching 3 Factor = 2 Depth = 3 4 5 6 7 8 9 10 11 12 13 14 15
  • 69. Same Data, Different Layout 9 10 4 5 8 2 11 1 12 3 15 6 7 13 14 4 12 5
  • 70. Space Complexity Depth 1 2 3 4 5 2 3 7 15 31 63 Branching 3 4 13 40 121 364 Factor 4 5 21 85 341 1365 5 6 31 156 781 3906 6 7 43 259 1555 9331
  • 71. Breadth-First Traversal Create an empty graph Create an empty queue to keep track of unprocessed nodes Add the starting point to the graph as the "root node" Add the root node to a queue for processing Repeat until some maximum depth is reached or the queue is empty: Remove a node from queue For each of the node's neighbors: If the neighbor hasn't already been processed: Add it to the graph Add it to the queue Add an edge to the graph connecting the node & its neighbor
  • 72. Breadth-First Harvest next_queue = [ 'timoreilly' ] # seed node d = 1 while d < depth: d += 1 queue, next_queue = next_queue, [] for screen_name in queue: follower_ids = getFollowers(screen_name=screen_name) next_queue += follower_ids getUserInfo(user_ids=next_queue)
  • 73. The Most Popular Followers freqs = {} for follower in followers: cnt = follower['followers_count'] if not freqs.has_key(cnt): freqs[cnt] = [] freqs[cnt].append({'screen_name': follower['screen_name'], 'user_id': f['id']}) popular_followers = sorted(freqs, reverse=True)[:100]
  • 74. Average # of Followers all_freqs = [k for k in keys for user in freqs[k]] avg = sum(all_freqs) / len(all_freqs)
  • 75. @timoreilly's Popular Followers The top 10 followers from the sample: aplusk 4,993,072 BarackObama 4,114,901 mashable 2,014,615 MarthaStewart 1,932,321 Schwarzenegger 1,705,177 zappos 1,689,289 Veronica 1,612,827 jack 1,592,004 stephenfry 1,531,813 davos 1,522,621
  • 76. Futzing the Numbers • The average number of timoreilly's followers' followers: 445 • Discarding the top 10 lowers the average to around 300 • Discarding any follower with less than 10 followers of their own increases the average to over 1,000! • Doing both brings the average to around 800
  • 77. The Right Tool For the Job: NetworkX for Networks
  • 78. Friendship Graphs for i in ids: #ids is timoreilly's id along with friend ids info = json.loads(r.get(getRedisIdByUserId(i, 'info.json'))) screen_name = info['screen_name'] friend_ids = list(r.smembers(getRedisIdByScreenName(screen_name, 'friend_ids'))) for friend_id in [fid for fid in friend_ids if fid in ids]: friend_info = json.loads(r.get(getRedisIdByUserId(friend_id, 'info.json'))) g.add_edge(screen_name, friend_info['screen_name']) nx.write_gpickle(g, 'timoreilly.gpickle') # see also nx.read_gpickle
  • 79. Clique Analysis • Cliques • Maximum Cliques • Maximal Cliques http://en.wikipedia.org/wiki/Clique_problem
  • 80. Calculating Cliques cliques = [c for c in nx.find_cliques(g)] num_cliques = len(cliques) clique_sizes = [len(c) for c in cliques] max_clique_size = max(clique_sizes) avg_clique_size = sum(clique_sizes) / num_cliques max_cliques = [c for c in cliques if len(c) == max_clique_size] num_max_cliques = len(max_cliques) people_in_every_max_clique = list(reduce( lambda x, y: x.intersection(y),[set(c) for c in max_cliques] ))
  • 81. Cliques for @timoreilly Num cliques: 762573 Avg clique size: 14 Max clique size: 26 Num max cliques: 6 Num people in every max clique: 20
  • 82. Visualizing Data Agile Data Solutions Social Web Mining the
  • 83. Graphs, etc • Your first instinct is naturally G = (V, E) ?
  • 84. Dorling Cartogram • A location-aware bubble chart (ish) • At least 3-dimensional • Position, color, size • Look at friends/followers by state
  • 85. Sunburst of Friends • A very compact visualization • Slice and dice friends/followers by gender, country, locale, etc.
  • 86. Part 3: The Tweet, the Whole Tweet, and Nothing but the Tweet Agile Data Solutions Social Web Mining the
  • 87. Insight Matters • Which entities frequently appear in @user's tweets? • How often does @user talk about specific friends? • Who does @user retweet most frequently? • How frequently is @user retweeted (by anyone)? • How many #hashtags are usually in @user's tweets?
  • 88. Pen : Sword :: Tweet : Machine Gun (?!?)
  • 89. Getting Data Mining the Social Web
  • 90. Let me count the APIs... • Timelines • Tweets • Favorites • Direct Messages • Streams
  • 91. Anatomy of a Tweet (1/2) { "created_at" : "Thu Jun 24 14:21:11 +0000 2010", "id" : 16932571217, "text" : "Great idea from @crowdflower: Crowdsourcing ... #opengov", "user" : { "description" : "Founder and CEO, O'Reilly Media. Watching the alpha geeks...", "id" : 2384071, "location" : "Sebastopol, CA", "name" : "Tim O'Reilly", "screen_name" : "timoreilly", "url" : "http://radar.oreilly.com" }, ...
  • 92. Anatomy of a Tweet (2/2) ... "entities" : { "hashtags" : [ {"indices" : [ 97, 103 ], "text" : "gov20"}, {"indices" : [ 104, 112 ], "text" : "opengov"} ], "urls" : [{"expanded_url" : null, "indices" : [ 76, 96 ], "url" : "http://bit.ly/9o4uoG"} ], "user_mentions" : [{"id" : 28165790, "indices" : [ 16, 28 ], "name" : "crowdFlower","screen_name" : "crowdFlower"}] } }
  • 93. Entities & Annotations • Entities • Opt-in now but will "soon" be standard • $ easy_install twitter_text • Annotations • User-defined metadata • See http://dev.twitter.com/pages/annotations_overview
  • 94. Manual Entity Extraction import twitter_text extractor = twitter_text.Extractor(tweet['text']) mentions = extractor.extract_mentioned_screen_names_with_indices() hashtags = extractor.extract_hashtags_with_indices() urls = extractor.extract_urls_with_indices() # Splice info into a tweet object
  • 95. Storing Data Mining the Social Web
  • 96. Storing Tweets • Flat files? (Really, who does that?) • A relational database? • Redis? • CouchDB (Relax...?)
  • 97. CouchDB: Relax • Document-oriented key/value • Map/Reduce • RESTful API • Erlang
  • 98. As easy as sitting on the couch • Get it - http://www.couchone.com/get • Install it • Relax - http://localhost:5984/_utils/ • Also - $ easy_install couchdb
  • 99. Storing Timeline Data import couchdb import twitter TIMELINE_NAME = "user" # or "home" or "public" t = twitter.Twitter(domain='api.twitter.com', api_version='1) server = couchdb.Server('http://localhost:5984') db = server.create(DB) page_num = 1 while page_num <= MAX_PAGES: api_call = getattr(t.statuses, TIMELINE_NAME + '_timeline') tweets = makeTwitterRequest(t, api_call, page=page_num) db.update(tweets, all_or_nothing=True) print 'Fetched %i tweets' % len(tweets) page_num += 1
  • 100. Analyzing & Visualizing Data Mining the Social Web
  • 102. Map/Reduce Paraadigm • Mapper: yields key/value pairs • Reducer: operates on keyed mapper output • Example: Computing the sum of squares • Mapper Input: (k, [2,4,6]) • Mapper Output: (k, [4,16,36]) • Reducer Input: [(k, 4,16), (k, 36)] • Reducer Output: 56
  • 103. Which entities frequently appear in @mention's tweets?
  • 105. How often does @timoreilly mention specific friends?
  • 106. Filtering Tweet Entities • Let's find out how often someone talks about specific friends • We have friend info on hand • We've extracted @mentions from the tweets • Let's cound friend vs non-friend mentions
  • 107. @timoreilly's friend mentions Number of user entities in tweets who are Number of @user entities in tweets: 20 not friends: 2 Number of @user entities in tweets who are friends: 18 n2vip timoreilly ahier andrewsavikas pkedrosky gnat CodeforAmerica slashdot nytimes OReillyMedia brady dalepd carlmalamud mikeloukides pahlkadot monkchips make fredwilson jamesoreilly digiphile andrewsavikas
  • 108. Who does @timoreilly retweet most frequently?
  • 109. Counting Retweets • Map @mentions out of tweets using a regex • Reduce to sum them up • Sort the results • Display results
  • 111. How frequently is @timoreilly retweeted?
  • 112. Retweet Counts • An API resource /statuses/retweet_count exists (and is now functional) • Example: http://twitter.com/statuses/show/29016139807.json • retweet_count • retweeted
  • 113. Survey Says... @timoreilly is retweeted about 2/3 of the time
  • 114. How often does @timoreilly include #hashtags in tweets?
  • 115. Counting Hashtags • Use a mapper to emit a #hashtag entities for tweets • Use a reducer to sum them all up • Been there, done that...
  • 116. Survey Says... About 1 out of every 3 tweets by @timoreilly contain #hashtags
  • 117. But if you order within the next 5 mintues... Mining the Social Web
  • 118. Bonus Material: What do #JustinBieber and #TeaParty have in common? Mining the Social Web
  • 120. #JustinBieber co-occurrences #bieberblast http://tinyurl.com/ #music #Eclipse 343kax4 @justinbieber #somebodytolove @JustBieberFact #nowplaying http://bit.ly/aARD4t @TinselTownDirt #Justinbieber http://bit.ly/b2Kc1L #beliebers #JUSTINBIEBER #Escutando #BieberFact #Proform #justinBieber #Celebrity http://migre.me/TJwj #Restart #Dschungel @ProSieben #TT @_Yassi_ @lojadoaltivo #Telezwerge #musicmonday #JustinBieber @rheinzeitung #video #justinbieber #WTF #tickets
  • 121. #TeaParty co-occurrences @STOPOBAMA2012 #jcot @blogging_tories @TheFlaCracker #tweetcongress #cdnpoli #palin2012 #Obama #fail #AZ #topprog #nra #TopProg #palin #roft #conservative #dems @BrnEyeSuss http://tinyurl.com/386k5hh #acon @crispix49 @ResistTyranny #cspj @koopersmith #tsot #immigration @Kriskxx @ALIPAC #politics #Kagan #majority #hhrs @Liliaep #NoAmnesty #TeaParty #nvsen #patriottweets #vote2010 @First_Patriots @Drudge_Report #libertarian #patriot #military #obama #pjtv #palin12 #ucot @andilinks #rnc #iamthemob @RonPaulNews #TCOT #GOP #ampats http://tinyurl.com/24h36zq #tpp #cnn #spwbt #dnc #jews @welshman007 #twisters #GOPDeficit #FF #sgp #wethepeople #liberty #ocra #asamom #glennbeck #gop @thenewdeal #news #tlot #AFIRE #oilspill #p2 #Dems #rs #tcot @JIDF #Teaparty #teaparty
  • 123. Hashtag Analysis • TeaParty: ~ 5 hashtags per tweet. • Example: “Rarely is the questioned asked: Is our children learning?” - G.W. Bush #p2 #topprog #tcot #tlot #teaparty #GOP #FF • JustinBieber: ~ 2 hashtags per tweet • Example: #justinbieber is so coool
  • 124. Common #hashtags #lol #dancing #jesus #music #worldcup #glennbeck #teaparty @addthis #AZ #nowplaying #milk #news #ff #WTF #guns #fail #WorldCup #toomanypeople #bp #oilspill #News #catholic
  • 128. Juxtaposing Friendships • Harvest search results for #JustinBieber and #TeaParty • Get friend ids for each @mention with /friends/ids • Resolve screen names with /users/lookup • Populate a NetworkX graph • Analyze it • Visualize with Graphviz
  • 130. Two Kinds of Hairballs... #JustinBieber #TeaParty
  • 131. The world twitterverse is your oyster
  • 132. • Twitter: @SocialWebMining • GitHub: http://bit.ly/socialwebmining • Facbook: http://facebook.com/MiningTheSocialWeb Mining the Social Web