SlideShare ist ein Scribd-Unternehmen logo
1 von 14
Downloaden Sie, um offline zu lesen
Ch1. Introduction: Hacking on
   Twitter Data
   chois79

   2011.10.15


11년	 10월	 20일	 목요일
Installing Python Development
   Tools
   ✤   python
       ✤ http://www.python.org/download

   ✤   python package manager tools
       ✤ allow to effortlessly install Python packages

       ✤ easy_install

          ✤ http://pypi.python.org/pypi/setuptools

       ✤ pip

          ✤ http://www.pip-installer.org/en/latest/installing.html

   ✤   networkx
       ✤ creating and manipulating graphs and networks

       ✤ ex) easy_install networkx or pip install networkx




11년	 10월	 20일	 목요일
Collecting and Manipulating
   Twitter Data




11년	 10월	 20일	 목요일
Tinkering with Twitter’s API(1/2)

   ✤   Setup

        ✤   easy_install twitter

        ✤   but, Twitter’s apis was updated

            ✤    http://github.com/sixohsix/twitter/issues/56

   ✤   The Minimalist Twitter API for Python is a Python API for Twitter

        ✤   Equivalent REST query

            ✤   http://search.twitter.com/trends.json

11년	 10월	 20일	 목요일
Tinkering with Twitter’s API(2/2)

  ✤   Retrieving Twitter search trends
       # ex.3
       import twitter
       twitter_api = twitter.Twitter()
       WORLD_WOE_ID = 1 # The Yahoo! Where On Earth ID for the entire world
       world_trends = twitter_api.trends._(WORLD_WOE_ID) # get back a callable
       #[ trend["name"] for trend in world_trends()[0]['trends'] ] # call the callabl
       for trend in world_trends()[0]['trends']: # call the callabl
           print trend["name"]




  ✤   Paging through Twitter search results
       # ex.4
       search_results = []
       for page in range(1,6):
           search_results.append(twitter_api.search(q="Dennis Ritchie", rpp=20, page=page))




11년	 10월	 20일	 목요일
Frequency Analysis and Lexical
   Diversity(1/5)
   ✤   Lexical diversity
        ✤   One of the most intuitive measurements that can be applied to
            unstructured text
        ✤   Expression of the number of unique tokens in the text divided by
            the total number of tokens
        >>> words = []
        >>> for t in tweets:
        ...     words += [ w for w in t.split() ]
        >>> len(words) # total words
        7238
        >>> len(set(words)) # unique words
        1636
        >>> 1.0*len(set(words))/len(words) # lexical diversity
        0.22602928985907708
        >>> 1.0*sum([ len(t.split()) for t in tweets ])/len(tweets) # avg words per tweet
        14.476000000000001


        ✤   Each tweet carries about 20 percent unique infomation

11년	 10월	 20일	 목요일
Frequency Analysis and Lexical
   Diversity(2/5)
   ✤   Frequency Analysis: Use NLTK or collections.Count
        ✤    Very simple, powerful tool
       >>> import nltk
       >>> import cPickle
       >>> words = cPickle.load(open("myData.pickle"))
       >>> freq_dist = nltk.FreqDist(words)
       >>> freq_dist.keys()[:50] # 50 most frequent tokens
       [u'snl', u'on', u'rt', u'is', u'to', u'i', u'watch', u'justin', u'@justinbieber', u'be', u'the', u'tonight', u'gonna', u'at', u'in', u'bieber', u'and', u'you',
       u'watching', u'tina', u'for', u'a', u'wait', u'fey', u'of', u'@justinbieber:', u'if', u'with', u'so', u"can't", u'who', u'great', u'it', u'going',
       u'im', u':)', u'snl...', u'2nite...', u'are', u'cant', u'dress', u'rehearsal', u'see', u'that', u'what', u'but', u'tonight!', u':d', u'2', u'will']

       >>> freq_dist.keys()[-50:] # 50 least frequent tokens
       [u'what?!', u'whens', u'where', u'while', u'white', u'whoever', u'whoooo!!!!', u'whose', u'wiating', u'wii', u'wiig', u'win...', u'wink.', u'wknd.',
        u'wohh', u'won', u'wonder', u'wondering', u'wootwoot!', u'worked', u'worth', u'xo.', u'xx', u'ya', u'ya<3miranda', u'yay', u'yay!',
       u'yau2665', u'yea', u'yea.', u'yeaa', u'yeah!', u'yeah.', u'yeahhh.', u'yes,', u'yes;)', u'yess', u'yess,', u'you!!!!!', u"you'll", u'you+snl=', u'you,'
       u'youll', u'youtube??', u'youu<3', u'youuuuu', u'yum', u'yumyum', u'~', u'xacxac'

              ✤    Frequent tokens refer to entities such as people, times, activities
              ✤    Infrequent terms amount to mostly noise

11년	 10월	 20일	 목요일
Frequency Analysis and Lexical
   Diversity(3/5)
   ✤   Extracting relationships from the tweets
        ✤   The social web is foremost the linkages between people
        ✤   One high convenient format for storing social web data is graph
        ✤   Using regular expressions to find retweets
            ✤   RT followed by a username
            ✤   via followed by a username
                >>> import re
                >>> rt_patterns = re.compile(r"(RT|via)((?:bW*@w+)+)", re.IGNORECASE)
                >>> example_tweets = ["RT @SocialWebMining Justin Bieber is on SNL 2nite. w00t?!?",
                ... "Justin Bieber is on SNL 2nite. w00t?!? (via @SocialWebMining)"]
                >>> for t in example_tweets:
                ... rt_patterns.findall(t)
                [('RT', ' @SocialWebMining')]
                [('via', ' @SocialWebMining')




11년	 10월	 20일	 목요일
Frequency Analysis and Lexical
      Diversity(4/5)
  ✤   >>> import networkx as nx                                   ✤   ...    g.add_edge(rt_source, tweet["from_user"], {"tweet_id" :
                                                                      tweet["id"]})
  ✤   >>> import re
                                                                  ✤   >>> g.number_of_nodes()
  ✤   >>> g = nx.DiGraph()
                                                                  ✤   160
  ✤   >>>
                                                                  ✤   >>> g.number_of_edges()
  ✤   >>> all_tweets = [ tweet
                                                                  ✤   125
  ✤   ...         for page in search_results
                                                                  ✤   >>> g.edges(data=True)[0]
  ✤   ...            for tweet in page["results"] ]
                                                                  ✤   (u'@ericastolte', u'bonitasworld', {'tweet_id': 11965974697L})
  ✤   >>> def get_rt_sources(tweet):
                                                                  ✤   >>> len(nx.connected_components(g.to_undirected()))
  ✤   ... rt_patterns = re.compile(r"(RT|via)((?:bW*@w+)+)",
      re.IGNORECASE)                                              ✤   37
  ✤   ...   return [ source.strip()                               ✤   >>> sorted(nx.degree(g))
  ✤   ...        for tuple in rt_patterns.findall(tweet)           ✤   [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
  ✤   ...          for source in tuple                            ✤   1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
  ✤   ...             if source not in ("RT", "via") ]            ✤   1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
  ✤   >>> for tweet in all_tweets:                                ✤   1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
  ✤   ...   rt_sources = get_rt_sources(tweet["text"])            ✤   1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
  ✤   ...   if not rt_sources: continue                           ✤   2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 4, 4, 4, 5, 6, 6, 9, 37]
  ✤   ...   for rt_source in rt_sources:




11년	 10월	 20일	 목요일
Frequency Analysis and Lexical
   Diversity(5/5)
   ✤   Analysis
        ✤   500 tweets
            ✤   160 users: number of nodes
                 ✤   160 users involved in retweet relationships with one another
            ✤   125 edges connected
                 ✤   1.28(160/125): some nodes are connected to more than one
                     node
            ✤   37: The graph consists of 32 subgraphs and is not fully
                connected
            ✤   The output of degree
                 ✤   node are connected to anywhere

11년	 10월	 20일	 목요일
Visualizing Tweet Graphs(1/3)

   ✤   Dot language
        ✤   Text graph description language
        ✤   Support simple way of describing graphs that both humans and
            computer programs can use
   ✤   Graphviz
        ✤   install from source: http://www.graphviz.org/
        ✤   pygraphviz
            ✤   easy_install pygraphviz
                 ✤   setup.py: library_path, include_path


11년	 10월	 20일	 목요일
Visualizing Tweet Graphs(2/3)

   ✤   Generating DOT language output
        OUT = "snl_search_results.dot"
        try:
           nx.drawing.write_dot(g, OUT)
        except ImportError, e:
           # Help for Windows users:
           # Not a general-purpose method, but representative of
           # the same output write_dot would provide for this graph
           # if installed and easy to implement
           dot = ['"%s" -> "%s" [tweet_id=%s]' % (n1, n2, g[n1][n2]['tweet_id']) 
              for n1, n2 in g.edges()]
           f = open(OUT, 'w')
           f.write('strict digraph {n%sn}' % (';n'.join(dot),))
           f.close()

   ✤   Output
        strict digraph {
        "@ericastolte" -> "bonitasworld" [tweet_id=11965974697];
        "@mpcoelho" -> "Lil_Amaral" [tweet_id=11965954427];
        "@BieberBelle123" -> "BELIEBE4EVER" [tweet_id=11966261062];
        "@BieberBelle123" -> "sabrina9451" [tweet_id=11966197327];
   ✤    }



11년	 10월	 20일	 목요일
Visualizing Tweet Graphs(3/3)

   ✤   Convert
        ✤   $circo -Tpng -Osnl_search_results snl_search_results.dot




        ✤




11년	 10월	 20일	 목요일
Closing Remarks


   ✤   Illustrated how easy it is to use Python’s interactive interpreter to
       explore and visualize Twitter data
        ✤    Feel comfortable with your Python development environment
        ✤   Spend some time with the Twitter APIs and Graphviz
            ✤   Canviz project
                 ✤   Draw Graphviz graphs on a web browser <canvas> element.




11년	 10월	 20일	 목요일

Weitere ähnliche Inhalte

Was ist angesagt?

Introduction to Gremlin
Introduction to GremlinIntroduction to Gremlin
Introduction to Gremlin
Max De Marzi
 

Was ist angesagt? (20)

Data mangling with mongo db the right way [pyconit 2016]
Data mangling with mongo db the right way [pyconit 2016]Data mangling with mongo db the right way [pyconit 2016]
Data mangling with mongo db the right way [pyconit 2016]
 
Beyond php it's not (just) about the code
Beyond php   it's not (just) about the codeBeyond php   it's not (just) about the code
Beyond php it's not (just) about the code
 
Poly-paradigm Java
Poly-paradigm JavaPoly-paradigm Java
Poly-paradigm Java
 
The Ring programming language version 1.4 book - Part 12 of 30
The Ring programming language version 1.4 book - Part 12 of 30The Ring programming language version 1.4 book - Part 12 of 30
The Ring programming language version 1.4 book - Part 12 of 30
 
Finding a lost song with Node.js and async iterators
Finding a lost song with Node.js and async iteratorsFinding a lost song with Node.js and async iterators
Finding a lost song with Node.js and async iterators
 
Caching and tuning fun for high scalability @ LOAD2012
Caching and tuning fun for high scalability @ LOAD2012Caching and tuning fun for high scalability @ LOAD2012
Caching and tuning fun for high scalability @ LOAD2012
 
Dive into kotlins coroutines
Dive into kotlins coroutinesDive into kotlins coroutines
Dive into kotlins coroutines
 
WTF Oriented Programming, com Fabio Akita
WTF Oriented Programming, com Fabio AkitaWTF Oriented Programming, com Fabio Akita
WTF Oriented Programming, com Fabio Akita
 
Kotlin coroutines
Kotlin coroutines Kotlin coroutines
Kotlin coroutines
 
Java Unicode with Live GUI Examples
Java Unicode with Live GUI ExamplesJava Unicode with Live GUI Examples
Java Unicode with Live GUI Examples
 
Java Unicode with Cool GUI Examples
Java Unicode with Cool GUI ExamplesJava Unicode with Cool GUI Examples
Java Unicode with Cool GUI Examples
 
The Ring programming language version 1.3 book - Part 35 of 88
The Ring programming language version 1.3 book - Part 35 of 88The Ring programming language version 1.3 book - Part 35 of 88
The Ring programming language version 1.3 book - Part 35 of 88
 
Линзы - комбинаторная манипуляция данными Александр Гранин Dev2Dev v2.0 30.05...
Линзы - комбинаторная манипуляция данными Александр Гранин Dev2Dev v2.0 30.05...Линзы - комбинаторная манипуляция данными Александр Гранин Dev2Dev v2.0 30.05...
Линзы - комбинаторная манипуляция данными Александр Гранин Dev2Dev v2.0 30.05...
 
RではじめるTwitter解析
RではじめるTwitter解析RではじめるTwitter解析
RではじめるTwitter解析
 
Parallel Computing With Dask - PyDays 2017
Parallel Computing With Dask - PyDays 2017Parallel Computing With Dask - PyDays 2017
Parallel Computing With Dask - PyDays 2017
 
MongoDB Days Silicon Valley: Data Analysis and MapReduce with MongoDB
MongoDB Days Silicon Valley: Data Analysis and MapReduce with MongoDBMongoDB Days Silicon Valley: Data Analysis and MapReduce with MongoDB
MongoDB Days Silicon Valley: Data Analysis and MapReduce with MongoDB
 
twitteRで快適Rライフ!
twitteRで快適Rライフ!twitteRで快適Rライフ!
twitteRで快適Rライフ!
 
Fewer cables
Fewer cablesFewer cables
Fewer cables
 
Data exchange formats
Data exchange formatsData exchange formats
Data exchange formats
 
Introduction to Gremlin
Introduction to GremlinIntroduction to Gremlin
Introduction to Gremlin
 

Andere mochten auch

Abstract factory petterns
Abstract factory petternsAbstract factory petterns
Abstract factory petterns
HyeonSeok Choi
 
To become Open Source Contributor
To become Open Source ContributorTo become Open Source Contributor
To become Open Source Contributor
DaeMyung Kang
 
프로그래머로 사는 법 Ch6
프로그래머로 사는 법 Ch6프로그래머로 사는 법 Ch6
프로그래머로 사는 법 Ch6
HyeonSeok Choi
 
프로그래머로 사는 법 Ch1
프로그래머로 사는 법 Ch1프로그래머로 사는 법 Ch1
프로그래머로 사는 법 Ch1
HyeonSeok Choi
 
프로그래머로사는법 Ch10
프로그래머로사는법 Ch10프로그래머로사는법 Ch10
프로그래머로사는법 Ch10
HyeonSeok Choi
 
서버인프라를지탱하는기술3_2_3
서버인프라를지탱하는기술3_2_3서버인프라를지탱하는기술3_2_3
서버인프라를지탱하는기술3_2_3
HyeonSeok Choi
 
Refactoring 메소드 호출의 단순화
Refactoring 메소드 호출의 단순화Refactoring 메소드 호출의 단순화
Refactoring 메소드 호출의 단순화
HyeonSeok Choi
 
CODE Ch.21 버스에 올라 탑시다
CODE Ch.21 버스에 올라 탑시다CODE Ch.21 버스에 올라 탑시다
CODE Ch.21 버스에 올라 탑시다
HyeonSeok Choi
 
Domain driven design ch9
Domain driven design ch9Domain driven design ch9
Domain driven design ch9
HyeonSeok Choi
 

Andere mochten auch (20)

Abstract factory petterns
Abstract factory petternsAbstract factory petterns
Abstract factory petterns
 
MutiCore 19-20
MutiCore 19-20MutiCore 19-20
MutiCore 19-20
 
Elastic search 클러스터관리
Elastic search 클러스터관리Elastic search 클러스터관리
Elastic search 클러스터관리
 
7가지 동시성 모델 - 데이터 병렬성
7가지 동시성 모델 - 데이터 병렬성7가지 동시성 모델 - 데이터 병렬성
7가지 동시성 모델 - 데이터 병렬성
 
Clean code Chapter.2
Clean code Chapter.2Clean code Chapter.2
Clean code Chapter.2
 
Chean code chapter 1
Chean code chapter 1Chean code chapter 1
Chean code chapter 1
 
HTTP 완벽가이드 1장.
HTTP 완벽가이드 1장.HTTP 완벽가이드 1장.
HTTP 완벽가이드 1장.
 
함수적 사고 2장
함수적 사고 2장함수적 사고 2장
함수적 사고 2장
 
Ooa&d
Ooa&dOoa&d
Ooa&d
 
To become Open Source Contributor
To become Open Source ContributorTo become Open Source Contributor
To become Open Source Contributor
 
프로그래머로 사는 법 Ch6
프로그래머로 사는 법 Ch6프로그래머로 사는 법 Ch6
프로그래머로 사는 법 Ch6
 
Clean code ch15
Clean code ch15Clean code ch15
Clean code ch15
 
프로그래머로 사는 법 Ch1
프로그래머로 사는 법 Ch1프로그래머로 사는 법 Ch1
프로그래머로 사는 법 Ch1
 
프로그래머로사는법 Ch10
프로그래머로사는법 Ch10프로그래머로사는법 Ch10
프로그래머로사는법 Ch10
 
자바 병렬 프로그래밍 ch9
자바 병렬 프로그래밍 ch9자바 병렬 프로그래밍 ch9
자바 병렬 프로그래밍 ch9
 
C++ api design 품질
C++ api design 품질C++ api design 품질
C++ api design 품질
 
서버인프라를지탱하는기술3_2_3
서버인프라를지탱하는기술3_2_3서버인프라를지탱하는기술3_2_3
서버인프라를지탱하는기술3_2_3
 
Refactoring 메소드 호출의 단순화
Refactoring 메소드 호출의 단순화Refactoring 메소드 호출의 단순화
Refactoring 메소드 호출의 단순화
 
CODE Ch.21 버스에 올라 탑시다
CODE Ch.21 버스에 올라 탑시다CODE Ch.21 버스에 올라 탑시다
CODE Ch.21 버스에 올라 탑시다
 
Domain driven design ch9
Domain driven design ch9Domain driven design ch9
Domain driven design ch9
 

Ähnlich wie Mining the social web ch1

Python for High School Programmers
Python for High School ProgrammersPython for High School Programmers
Python for High School Programmers
Siva Arunachalam
 
Os Fetterupdated
Os FetterupdatedOs Fetterupdated
Os Fetterupdated
oscon2007
 

Ähnlich wie Mining the social web ch1 (20)

Unleashing Twitter Data for Fun and Insight
Unleashing Twitter Data for Fun and InsightUnleashing Twitter Data for Fun and Insight
Unleashing Twitter Data for Fun and Insight
 
Unleashing twitter data for fun and insight
Unleashing twitter data for fun and insightUnleashing twitter data for fun and insight
Unleashing twitter data for fun and insight
 
My First Rails Plugin - Usertext
My First Rails Plugin - UsertextMy First Rails Plugin - Usertext
My First Rails Plugin - Usertext
 
pa-pe-pi-po-pure Python Text Processing
pa-pe-pi-po-pure Python Text Processingpa-pe-pi-po-pure Python Text Processing
pa-pe-pi-po-pure Python Text Processing
 
TypeScript와 Flow: 
자바스크립트 개발에 정적 타이핑 도입하기
TypeScript와 Flow: 
자바스크립트 개발에 정적 타이핑 도입하기TypeScript와 Flow: 
자바스크립트 개발에 정적 타이핑 도입하기
TypeScript와 Flow: 
자바스크립트 개발에 정적 타이핑 도입하기
 
Mapping Online Publics (Part 2)
Mapping Online Publics (Part 2)Mapping Online Publics (Part 2)
Mapping Online Publics (Part 2)
 
Text Analysis with Machine Learning
Text Analysis with Machine LearningText Analysis with Machine Learning
Text Analysis with Machine Learning
 
Swift for tensorflow
Swift for tensorflowSwift for tensorflow
Swift for tensorflow
 
Facebook Sentiment Analysis - What is Facebook Saying about Nintendo?
Facebook Sentiment Analysis - What is Facebook Saying about Nintendo?Facebook Sentiment Analysis - What is Facebook Saying about Nintendo?
Facebook Sentiment Analysis - What is Facebook Saying about Nintendo?
 
Python Fundamentals - Basic
Python Fundamentals - BasicPython Fundamentals - Basic
Python Fundamentals - Basic
 
Mining the Geo Needles in the Social Haystack
Mining the Geo Needles in the Social HaystackMining the Geo Needles in the Social Haystack
Mining the Geo Needles in the Social Haystack
 
Python for High School Programmers
Python for High School ProgrammersPython for High School Programmers
Python for High School Programmers
 
"R & Text Analytics" (15 January 2013)
"R & Text Analytics" (15 January 2013)"R & Text Analytics" (15 January 2013)
"R & Text Analytics" (15 January 2013)
 
PyTorch 튜토리얼 (Touch to PyTorch)
PyTorch 튜토리얼 (Touch to PyTorch)PyTorch 튜토리얼 (Touch to PyTorch)
PyTorch 튜토리얼 (Touch to PyTorch)
 
Os Fetterupdated
Os FetterupdatedOs Fetterupdated
Os Fetterupdated
 
Kyo - Functional Scala 2023.pdf
Kyo - Functional Scala 2023.pdfKyo - Functional Scala 2023.pdf
Kyo - Functional Scala 2023.pdf
 
Helvetia
HelvetiaHelvetia
Helvetia
 
PMED Undergraduate Workshop - R Tutorial for PMED Undegraduate Workshop - Xi...
PMED Undergraduate Workshop - R Tutorial for PMED Undegraduate Workshop  - Xi...PMED Undergraduate Workshop - R Tutorial for PMED Undegraduate Workshop  - Xi...
PMED Undergraduate Workshop - R Tutorial for PMED Undegraduate Workshop - Xi...
 
Text Mining of Twitter in Data Mining
Text Mining of Twitter in Data MiningText Mining of Twitter in Data Mining
Text Mining of Twitter in Data Mining
 
The Dynamic Language is not Enough
The Dynamic Language is not EnoughThe Dynamic Language is not Enough
The Dynamic Language is not Enough
 

Mehr von HyeonSeok Choi

Mehr von HyeonSeok Choi (20)

밑바닥부터시작하는딥러닝 Ch05
밑바닥부터시작하는딥러닝 Ch05밑바닥부터시작하는딥러닝 Ch05
밑바닥부터시작하는딥러닝 Ch05
 
밑바닥부터시작하는딥러닝 Ch2
밑바닥부터시작하는딥러닝 Ch2밑바닥부터시작하는딥러닝 Ch2
밑바닥부터시작하는딥러닝 Ch2
 
프로그래머를위한선형대수학1.2
프로그래머를위한선형대수학1.2프로그래머를위한선형대수학1.2
프로그래머를위한선형대수학1.2
 
알고리즘 중심의 머신러닝 가이드 Ch04
알고리즘 중심의 머신러닝 가이드 Ch04알고리즘 중심의 머신러닝 가이드 Ch04
알고리즘 중심의 머신러닝 가이드 Ch04
 
딥러닝 제대로시작하기 Ch04
딥러닝 제대로시작하기 Ch04딥러닝 제대로시작하기 Ch04
딥러닝 제대로시작하기 Ch04
 
밑바닥부터시작하는딥러닝 Ch05
밑바닥부터시작하는딥러닝 Ch05밑바닥부터시작하는딥러닝 Ch05
밑바닥부터시작하는딥러닝 Ch05
 
7가지 동시성 모델 4장
7가지 동시성 모델 4장7가지 동시성 모델 4장
7가지 동시성 모델 4장
 
Bounded Context
Bounded ContextBounded Context
Bounded Context
 
DDD Repository
DDD RepositoryDDD Repository
DDD Repository
 
DDD Start Ch#3
DDD Start Ch#3DDD Start Ch#3
DDD Start Ch#3
 
실무로 배우는 시스템 성능 최적화 Ch8
실무로 배우는 시스템 성능 최적화 Ch8실무로 배우는 시스템 성능 최적화 Ch8
실무로 배우는 시스템 성능 최적화 Ch8
 
실무로 배우는 시스템 성능 최적화 Ch7
실무로 배우는 시스템 성능 최적화 Ch7실무로 배우는 시스템 성능 최적화 Ch7
실무로 배우는 시스템 성능 최적화 Ch7
 
실무로 배우는 시스템 성능 최적화 Ch6
실무로 배우는 시스템 성능 최적화 Ch6실무로 배우는 시스템 성능 최적화 Ch6
실무로 배우는 시스템 성능 최적화 Ch6
 
Logstash, ElasticSearch, Kibana
Logstash, ElasticSearch, KibanaLogstash, ElasticSearch, Kibana
Logstash, ElasticSearch, Kibana
 
실무로배우는시스템성능최적화 Ch1
실무로배우는시스템성능최적화 Ch1실무로배우는시스템성능최적화 Ch1
실무로배우는시스템성능최적화 Ch1
 
HTTP 완벽가이드 21장
HTTP 완벽가이드 21장HTTP 완벽가이드 21장
HTTP 완벽가이드 21장
 
HTTP 완벽가이드 16장
HTTP 완벽가이드 16장HTTP 완벽가이드 16장
HTTP 완벽가이드 16장
 
HTTPS
HTTPSHTTPS
HTTPS
 
HTTP 완벽가이드 6장.
HTTP 완벽가이드 6장.HTTP 완벽가이드 6장.
HTTP 완벽가이드 6장.
 
Cluster - spark
Cluster - sparkCluster - spark
Cluster - spark
 

Kürzlich hochgeladen

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 

Kürzlich hochgeladen (20)

What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 

Mining the social web ch1

  • 1. Ch1. Introduction: Hacking on Twitter Data chois79 2011.10.15 11년 10월 20일 목요일
  • 2. Installing Python Development Tools ✤ python ✤ http://www.python.org/download ✤ python package manager tools ✤ allow to effortlessly install Python packages ✤ easy_install ✤ http://pypi.python.org/pypi/setuptools ✤ pip ✤ http://www.pip-installer.org/en/latest/installing.html ✤ networkx ✤ creating and manipulating graphs and networks ✤ ex) easy_install networkx or pip install networkx 11년 10월 20일 목요일
  • 3. Collecting and Manipulating Twitter Data 11년 10월 20일 목요일
  • 4. Tinkering with Twitter’s API(1/2) ✤ Setup ✤ easy_install twitter ✤ but, Twitter’s apis was updated ✤ http://github.com/sixohsix/twitter/issues/56 ✤ The Minimalist Twitter API for Python is a Python API for Twitter ✤ Equivalent REST query ✤ http://search.twitter.com/trends.json 11년 10월 20일 목요일
  • 5. Tinkering with Twitter’s API(2/2) ✤ Retrieving Twitter search trends # ex.3 import twitter twitter_api = twitter.Twitter() WORLD_WOE_ID = 1 # The Yahoo! Where On Earth ID for the entire world world_trends = twitter_api.trends._(WORLD_WOE_ID) # get back a callable #[ trend["name"] for trend in world_trends()[0]['trends'] ] # call the callabl for trend in world_trends()[0]['trends']: # call the callabl print trend["name"] ✤ Paging through Twitter search results # ex.4 search_results = [] for page in range(1,6): search_results.append(twitter_api.search(q="Dennis Ritchie", rpp=20, page=page)) 11년 10월 20일 목요일
  • 6. Frequency Analysis and Lexical Diversity(1/5) ✤ Lexical diversity ✤ One of the most intuitive measurements that can be applied to unstructured text ✤ Expression of the number of unique tokens in the text divided by the total number of tokens >>> words = [] >>> for t in tweets: ... words += [ w for w in t.split() ] >>> len(words) # total words 7238 >>> len(set(words)) # unique words 1636 >>> 1.0*len(set(words))/len(words) # lexical diversity 0.22602928985907708 >>> 1.0*sum([ len(t.split()) for t in tweets ])/len(tweets) # avg words per tweet 14.476000000000001 ✤ Each tweet carries about 20 percent unique infomation 11년 10월 20일 목요일
  • 7. Frequency Analysis and Lexical Diversity(2/5) ✤ Frequency Analysis: Use NLTK or collections.Count ✤ Very simple, powerful tool >>> import nltk >>> import cPickle >>> words = cPickle.load(open("myData.pickle")) >>> freq_dist = nltk.FreqDist(words) >>> freq_dist.keys()[:50] # 50 most frequent tokens [u'snl', u'on', u'rt', u'is', u'to', u'i', u'watch', u'justin', u'@justinbieber', u'be', u'the', u'tonight', u'gonna', u'at', u'in', u'bieber', u'and', u'you', u'watching', u'tina', u'for', u'a', u'wait', u'fey', u'of', u'@justinbieber:', u'if', u'with', u'so', u"can't", u'who', u'great', u'it', u'going', u'im', u':)', u'snl...', u'2nite...', u'are', u'cant', u'dress', u'rehearsal', u'see', u'that', u'what', u'but', u'tonight!', u':d', u'2', u'will'] >>> freq_dist.keys()[-50:] # 50 least frequent tokens [u'what?!', u'whens', u'where', u'while', u'white', u'whoever', u'whoooo!!!!', u'whose', u'wiating', u'wii', u'wiig', u'win...', u'wink.', u'wknd.', u'wohh', u'won', u'wonder', u'wondering', u'wootwoot!', u'worked', u'worth', u'xo.', u'xx', u'ya', u'ya&lt;3miranda', u'yay', u'yay!', u'yau2665', u'yea', u'yea.', u'yeaa', u'yeah!', u'yeah.', u'yeahhh.', u'yes,', u'yes;)', u'yess', u'yess,', u'you!!!!!', u"you'll", u'you+snl=', u'you,' u'youll', u'youtube??', u'youu&lt;3', u'youuuuu', u'yum', u'yumyum', u'~', u'xacxac' ✤ Frequent tokens refer to entities such as people, times, activities ✤ Infrequent terms amount to mostly noise 11년 10월 20일 목요일
  • 8. Frequency Analysis and Lexical Diversity(3/5) ✤ Extracting relationships from the tweets ✤ The social web is foremost the linkages between people ✤ One high convenient format for storing social web data is graph ✤ Using regular expressions to find retweets ✤ RT followed by a username ✤ via followed by a username >>> import re >>> rt_patterns = re.compile(r"(RT|via)((?:bW*@w+)+)", re.IGNORECASE) >>> example_tweets = ["RT @SocialWebMining Justin Bieber is on SNL 2nite. w00t?!?", ... "Justin Bieber is on SNL 2nite. w00t?!? (via @SocialWebMining)"] >>> for t in example_tweets: ... rt_patterns.findall(t) [('RT', ' @SocialWebMining')] [('via', ' @SocialWebMining') 11년 10월 20일 목요일
  • 9. Frequency Analysis and Lexical Diversity(4/5) ✤ >>> import networkx as nx ✤ ... g.add_edge(rt_source, tweet["from_user"], {"tweet_id" : tweet["id"]}) ✤ >>> import re ✤ >>> g.number_of_nodes() ✤ >>> g = nx.DiGraph() ✤ 160 ✤ >>> ✤ >>> g.number_of_edges() ✤ >>> all_tweets = [ tweet ✤ 125 ✤ ... for page in search_results ✤ >>> g.edges(data=True)[0] ✤ ... for tweet in page["results"] ] ✤ (u'@ericastolte', u'bonitasworld', {'tweet_id': 11965974697L}) ✤ >>> def get_rt_sources(tweet): ✤ >>> len(nx.connected_components(g.to_undirected())) ✤ ... rt_patterns = re.compile(r"(RT|via)((?:bW*@w+)+)", re.IGNORECASE) ✤ 37 ✤ ... return [ source.strip() ✤ >>> sorted(nx.degree(g)) ✤ ... for tuple in rt_patterns.findall(tweet) ✤ [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ✤ ... for source in tuple ✤ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ✤ ... if source not in ("RT", "via") ] ✤ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ✤ >>> for tweet in all_tweets: ✤ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ✤ ... rt_sources = get_rt_sources(tweet["text"]) ✤ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, ✤ ... if not rt_sources: continue ✤ 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 4, 4, 4, 5, 6, 6, 9, 37] ✤ ... for rt_source in rt_sources: 11년 10월 20일 목요일
  • 10. Frequency Analysis and Lexical Diversity(5/5) ✤ Analysis ✤ 500 tweets ✤ 160 users: number of nodes ✤ 160 users involved in retweet relationships with one another ✤ 125 edges connected ✤ 1.28(160/125): some nodes are connected to more than one node ✤ 37: The graph consists of 32 subgraphs and is not fully connected ✤ The output of degree ✤ node are connected to anywhere 11년 10월 20일 목요일
  • 11. Visualizing Tweet Graphs(1/3) ✤ Dot language ✤ Text graph description language ✤ Support simple way of describing graphs that both humans and computer programs can use ✤ Graphviz ✤ install from source: http://www.graphviz.org/ ✤ pygraphviz ✤ easy_install pygraphviz ✤ setup.py: library_path, include_path 11년 10월 20일 목요일
  • 12. Visualizing Tweet Graphs(2/3) ✤ Generating DOT language output OUT = "snl_search_results.dot" try: nx.drawing.write_dot(g, OUT) except ImportError, e: # Help for Windows users: # Not a general-purpose method, but representative of # the same output write_dot would provide for this graph # if installed and easy to implement dot = ['"%s" -> "%s" [tweet_id=%s]' % (n1, n2, g[n1][n2]['tweet_id']) for n1, n2 in g.edges()] f = open(OUT, 'w') f.write('strict digraph {n%sn}' % (';n'.join(dot),)) f.close() ✤ Output strict digraph { "@ericastolte" -> "bonitasworld" [tweet_id=11965974697]; "@mpcoelho" -> "Lil_Amaral" [tweet_id=11965954427]; "@BieberBelle123" -> "BELIEBE4EVER" [tweet_id=11966261062]; "@BieberBelle123" -> "sabrina9451" [tweet_id=11966197327]; ✤ } 11년 10월 20일 목요일
  • 13. Visualizing Tweet Graphs(3/3) ✤ Convert ✤ $circo -Tpng -Osnl_search_results snl_search_results.dot ✤ 11년 10월 20일 목요일
  • 14. Closing Remarks ✤ Illustrated how easy it is to use Python’s interactive interpreter to explore and visualize Twitter data ✤ Feel comfortable with your Python development environment ✤ Spend some time with the Twitter APIs and Graphviz ✤ Canviz project ✤ Draw Graphviz graphs on a web browser <canvas> element. 11년 10월 20일 목요일