SlideShare ist ein Scribd-Unternehmen logo
1 von 47
Downloaden Sie, um offline zu lesen
Mining Social Data
     FOSDEM 2013
Credits
 Speaker    Romeu "@malk_zameth" MOURA
Company     @linagora
  License   CC-BY-SA 3.0
SlideShar   j.mp/XXgBAn
        e   ● Mining Graph Data
  Sources   ● Mining the Social Web
            ● Social Network Analysis for
               startups
            ● Social Media Mining and Social
               Network Analysis
            ● Graph Mining
I work at Linagora, a french FLOSS
co.
EloData &
      OpenGraphMiner
Linagora's foray into ESN, DataStorage, Graphs &
                      Mining.
Why mine social data at
         all?
    Without being a creepy stalker
To see what humans
       can't.
  Influence, centers of interest.
To remeber what humans
        can't.
What worked in the past? Objectively how did I behave
                     until now?
To discover what humans
         won't.
Serendipity
Find what tou were not looking for
Real life social data
   What is so specific about it?
Always graphs
Dense substructures
  Every Vertex is an unique entity (someone).
Several dense subgraphs: Relations of poaches of
                     people
Usually it has no good cuts
  Even the best partition algorithms cannot find
        partitions that are just not there
There will be
errors & unknowns
 Exact matching is not an option
Plenty of vanity metrics
       pollution.
    Sometimes very surprising ones.
Number of followers is a
    vanity metric
@GuyKawasaki (~1.5M followers) is much more
 retweeted than the user with most followers
           (@justinbieber, ~34M)
Why use graphs?
What is the itch with Inductive Logic that Inductive
                  Graphs scratch?
'Classic' Data Mining
       Pros and cons
pro: Solid known
   techniques
   of good performance
con: Complex structures
     are translated
 Into Bayesian Networks or Multi-Relational tables:
Incurring either data loss or combinatory explosion.
Graph Mining
  'The new deal'
pro: Expressiveness
         and simplicity
The input and output are graphs, no conversions, graph
                algorithms all around.
con: The unit of operation
       is comparing
      isomorphisms
         NP-Complete
Extraction
 Getting the data
Is the easy part
  A commodity really.
Social networks provide
          API
Facebook Graph api, Twitter REST api, yammer api
                      etc.
Worst case:
Crawl the website
Crawling The Web For Fun And Profit:
   http://youtu.be/eQtxbaw__W8
import sys
import json
import twitter
import networkx as nx
from recipe__get_rt_origins import get_rt_origins

def create_rt_graph(tweets):
  g = nx.DiGraph()
  for tweet in tweets:
     rt_origins = get_rt_origins(tweet)
     if not rt_origins:
        continue
     for rt_origin in rt_origins:
        g.add_edge(rt_origin.encode('ascii', 'ignore'),
                tweet['from_user'].encode('ascii', 'ignore'),
                {'tweet_id': tweet['id']}
        )
  return g

if __name__ == '__main__':
   Q = ' '.join(sys.argv[1])
   MAX_PAGES = 15
   RESULTS_PER_PAGE = 100
   twitter_search = twitter.Twitter(domain='search.twitter.com')
   search_results = []
   for page in range(1,MAX_PAGES+1):
      search_results.append(
         twitter_search.search(q=Q, rpp=RESULTS_PER_PAGE, page=page)
      )
   all_tweets = [tweet for page in search_results for tweet in page['results']]
   g = create_rt_graph(all_tweets)

  print >> sys.stderr, "Number nodes:", g.number_of_nodes()
   print >> sys.stderr, "Num edges:", g.number_of_edges()
   print >> sys.stderr, "Num connected components:",
                len(nx.connected_components(g.to_undirected()))
   print >> sys.stderr, "Node degrees:", sorted(nx.degree(g))
Finding patterns
  substructures that repeat
Older options
Apriori-based, Pattern growth
Stepwise pair expansion
Separate the graph by pairs, count frequencies, keep
   most frequent, augment them by one repeat.
"Chunk": Separate the graph by pairs
Keep only the frequent ones
Expand them
Find your frequent pattern
con: Chunkiness
"ChunkingLess"
Graph Based Induction
     CL-CBI [Cook et. al.]
Inputs needed
1. Minimal frequency where we consider a
   conformation to be a pattern : threshold
2. Number of most frequent pattern we will
   retain : beam size
3. Arbitrary number of times we will iterate:
   levels
1. "Chunk": Separate the graph by
pairs
2. Select beam-size most frequent
ones
3. Turn selected pairs into pseudo-
nodes
4. Expand & Rechunk
Keep going back to step 2
    Until you have done it levels times.
Decision Trees
A Tree of patterns
Finding a pattern on a branch yields a decision
DT-CLGBI
DT-CLGBI(graph: D)
begin
 create_node DT in D
 if thresold-attained
    return DT
 else
   P <- select_most_discriminative(CL-CBI(D))
    (Dy, Dn) <- branch_DT_on_predicate(p)
   for Di <- Dy
     DT.branch_yes.add-child(DT-CLGBI(Di))
   for Di <- Dn
     DT.branch_no.add-child(DT-CLGBI(Di))

Weitere ähnliche Inhalte

Was ist angesagt?

Social Media Data Mining
Social Media Data MiningSocial Media Data Mining
Social Media Data Mining
Ryan Reede
 
Community analysis using graph representation learning on social networks
Community analysis using graph representation learning on social networksCommunity analysis using graph representation learning on social networks
Community analysis using graph representation learning on social networks
Marco Brambilla
 
Social Information & Browsing March 6
Social Information & Browsing   March 6Social Information & Browsing   March 6
Social Information & Browsing March 6
sritikumar
 

Was ist angesagt? (20)

Social Media Data Mining
Social Media Data MiningSocial Media Data Mining
Social Media Data Mining
 
Social Media Mining: An Introduction
Social Media Mining: An IntroductionSocial Media Mining: An Introduction
Social Media Mining: An Introduction
 
Social Targeting: Understanding Social Media Data Mining & Analysis
Social Targeting: Understanding Social Media Data Mining & AnalysisSocial Targeting: Understanding Social Media Data Mining & Analysis
Social Targeting: Understanding Social Media Data Mining & Analysis
 
30 Tools and Tips to Speed Up Your Digital Workflow
30 Tools and Tips to Speed Up Your Digital Workflow 30 Tools and Tips to Speed Up Your Digital Workflow
30 Tools and Tips to Speed Up Your Digital Workflow
 
Introduction to the Responsible Use of Social Media Monitoring and SOCMINT Tools
Introduction to the Responsible Use of Social Media Monitoring and SOCMINT ToolsIntroduction to the Responsible Use of Social Media Monitoring and SOCMINT Tools
Introduction to the Responsible Use of Social Media Monitoring and SOCMINT Tools
 
Sampling of User Behavior Using Online Social Network
Sampling of User Behavior Using Online Social NetworkSampling of User Behavior Using Online Social Network
Sampling of User Behavior Using Online Social Network
 
Big Data Analytics : A Social Network Approach
Big Data Analytics : A Social Network ApproachBig Data Analytics : A Social Network Approach
Big Data Analytics : A Social Network Approach
 
Social network analysis
Social network analysisSocial network analysis
Social network analysis
 
nm
nmnm
nm
 
Social Media Mining - Chapter 2 (Graph Essentials)
Social Media Mining - Chapter 2 (Graph Essentials)Social Media Mining - Chapter 2 (Graph Essentials)
Social Media Mining - Chapter 2 (Graph Essentials)
 
Avoiding Anonymous Users in Multiple Social Media Networks (SMN)
Avoiding Anonymous Users in Multiple Social Media Networks (SMN)Avoiding Anonymous Users in Multiple Social Media Networks (SMN)
Avoiding Anonymous Users in Multiple Social Media Networks (SMN)
 
Social Media Mining - Chapter 6 (Community Analysis)
Social Media Mining - Chapter 6 (Community Analysis)Social Media Mining - Chapter 6 (Community Analysis)
Social Media Mining - Chapter 6 (Community Analysis)
 
Predicting Social Interactions from Different Sources of Location-based Knowl...
Predicting Social Interactions from Different Sources of Location-based Knowl...Predicting Social Interactions from Different Sources of Location-based Knowl...
Predicting Social Interactions from Different Sources of Location-based Knowl...
 
Social Network Analysis (SNA)
Social Network Analysis (SNA)Social Network Analysis (SNA)
Social Network Analysis (SNA)
 
Community analysis using graph representation learning on social networks
Community analysis using graph representation learning on social networksCommunity analysis using graph representation learning on social networks
Community analysis using graph representation learning on social networks
 
Social Information & Browsing March 6
Social Information & Browsing   March 6Social Information & Browsing   March 6
Social Information & Browsing March 6
 
NE7012- SOCIAL NETWORK ANALYSIS
NE7012- SOCIAL NETWORK ANALYSISNE7012- SOCIAL NETWORK ANALYSIS
NE7012- SOCIAL NETWORK ANALYSIS
 
Social Media Mining - Chapter 10 (Behavior Analytics)
Social Media Mining - Chapter 10 (Behavior Analytics)Social Media Mining - Chapter 10 (Behavior Analytics)
Social Media Mining - Chapter 10 (Behavior Analytics)
 
Conversation graphs in Online Social Media
Conversation graphs in Online Social MediaConversation graphs in Online Social Media
Conversation graphs in Online Social Media
 
Identification of inference attacks on private Information from Social Networks
Identification of inference attacks on private Information from Social NetworksIdentification of inference attacks on private Information from Social Networks
Identification of inference attacks on private Information from Social Networks
 

Ähnlich wie Mining social data

Apache Spark GraphX highlights.
Apache Spark GraphX highlights. Apache Spark GraphX highlights.
Apache Spark GraphX highlights.
Doug Needham
 
Machine Learning ICS 273A
Machine Learning ICS 273AMachine Learning ICS 273A
Machine Learning ICS 273A
butest
 
Machine Learning ICS 273A
Machine Learning ICS 273AMachine Learning ICS 273A
Machine Learning ICS 273A
butest
 
Intelligent Ruby + Machine Learning
Intelligent Ruby + Machine LearningIntelligent Ruby + Machine Learning
Intelligent Ruby + Machine Learning
Ilya Grigorik
 

Ähnlich wie Mining social data (20)

Unleashing Twitter Data for Fun and Insight
Unleashing Twitter Data for Fun and InsightUnleashing Twitter Data for Fun and Insight
Unleashing Twitter Data for Fun and Insight
 
Unleashing twitter data for fun and insight
Unleashing twitter data for fun and insightUnleashing twitter data for fun and insight
Unleashing twitter data for fun and insight
 
Tokens, Complex Systems, and Nature
Tokens, Complex Systems, and NatureTokens, Complex Systems, and Nature
Tokens, Complex Systems, and Nature
 
Angular and Deep Learning
Angular and Deep LearningAngular and Deep Learning
Angular and Deep Learning
 
Apache Spark GraphX highlights.
Apache Spark GraphX highlights. Apache Spark GraphX highlights.
Apache Spark GraphX highlights.
 
Computer investigatroy project c++ class 12
Computer investigatroy project c++ class 12Computer investigatroy project c++ class 12
Computer investigatroy project c++ class 12
 
The math behind big systems analysis.
The math behind big systems analysis.The math behind big systems analysis.
The math behind big systems analysis.
 
Deep Learning Demystified
Deep Learning DemystifiedDeep Learning Demystified
Deep Learning Demystified
 
Machine Learning ICS 273A
Machine Learning ICS 273AMachine Learning ICS 273A
Machine Learning ICS 273A
 
Machine Learning ICS 273A
Machine Learning ICS 273AMachine Learning ICS 273A
Machine Learning ICS 273A
 
Social Network Analysis Introduction including Data Structure Graph overview.
Social Network Analysis Introduction including Data Structure Graph overview. Social Network Analysis Introduction including Data Structure Graph overview.
Social Network Analysis Introduction including Data Structure Graph overview.
 
Our Data Ourselves, Pydata 2015
Our Data Ourselves, Pydata 2015Our Data Ourselves, Pydata 2015
Our Data Ourselves, Pydata 2015
 
Big Data & Artificial Intelligence
Big Data & Artificial IntelligenceBig Data & Artificial Intelligence
Big Data & Artificial Intelligence
 
Intelligent Ruby + Machine Learning
Intelligent Ruby + Machine LearningIntelligent Ruby + Machine Learning
Intelligent Ruby + Machine Learning
 
From Data to Visualization, what happens in between?
From Data to Visualization, what happens in between?From Data to Visualization, what happens in between?
From Data to Visualization, what happens in between?
 
A leap around AI
A leap around AIA leap around AI
A leap around AI
 
Neo4j GraphDay Seattle- Sept19- graphs are ai
Neo4j GraphDay Seattle- Sept19-  graphs are aiNeo4j GraphDay Seattle- Sept19-  graphs are ai
Neo4j GraphDay Seattle- Sept19- graphs are ai
 
Tokens and Complex Systems
Tokens and Complex SystemsTokens and Complex Systems
Tokens and Complex Systems
 
Deep learning from scratch
Deep learning from scratch Deep learning from scratch
Deep learning from scratch
 
The Art of Social Media Analysis with Twitter & Python
The Art of Social Media Analysis with Twitter & PythonThe Art of Social Media Analysis with Twitter & Python
The Art of Social Media Analysis with Twitter & Python
 

Kürzlich hochgeladen

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 

Kürzlich hochgeladen (20)

Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 

Mining social data