SlideShare ist ein Scribd-Unternehmen logo
1 von 76
Downloaden Sie, um offline zu lesen
1

Mining Social Web APIs
with IPython Notebook
Matthew A. Russell - @ptwobrussell - http://MiningTheSocialWeb.com
New York City - 28 October 2013
2

Intro
3

Hello, My Name Is ... Matthew
Background in Computer Science
Data mining & machine learning
CTO @ Digital Reasoning Systems
Data mining; machine learning
Author @ O'Reilly Media
5 published books on technology
Principal @ Zaffra
Selective boutique consulting
4

Transforming Curiosity Into Insight
An open source software (OSS) project
http://bit.ly/MiningTheSocialWeb2E
A book
http://bit.ly/135dHfs
Accessible to (virtually) everyone
Virtual machine with turn-key coding
templates for data science experiments
Think of the book as "premium" support for the
OSS project
5

The Social Web Is All the Rage
World population: ~7B people
Facebook: 1.15B users
Twitter: 500M users
Google+ 343M users
LinkedIn: 238M users
~200M+ blogs (conservative estimate)
6

Overview
Intro (5 mins)
Module 1 - Virtual Machine Setup (10 mins)
Module 2 - Mining Twitter (40 mins)
Module 3 - Mining Facebook (35 mins)
BREAK (30 mins)
Module 4 - Mining LinkedIn (40 mins)
Module 5 - Open Hack (40 mins)
Final Q&A; Wrap Up (10 mins)
7

Module Format
~10-15 minutes of exposition
I talk; you listen

~25-30 minutes of independent (or collaborative) work
You hack while I walk around and help you

~5 minutes of Q&A
You ask; I try to answer
8

Workshop Objective

To send you away as a social web hacker
Broad working knowledge popular social web APIs
Hands-on experience hacking on social web data with a common toolkit

Not to listen to me talk to you for 3 hours
9

Just a Few More Things
This workshop is...
An adaptation of Mining the Social Web, 2nd Edition
More of a guided hacking session where you follow along (vs a preso)
Wider than it is deeper
There's only so much you can do in a few hours

I'm available 24/7 this week (and beyond) to help you be successful
10

Assumptions
At some point in your life, you have
Programmed with Python
Worked with JSON
Made requests and processed responses to/from web servers

Or you want to learn to do these things now...
And you're a quick learner
11

Module 1: Virtual Machine Setup
12

Why do you need a VM?
To save time
Because installation and configuration management is harder than it first
appears
So that you can focus on the task at hand instead
So that I can support you regardless of your hardware and operating
system
13

But I can do all of that myself...
True...
If you would rather troubleshoot unexpected installation/configuration issues
instead of immediately focusing on the real task at hand

At least give it a shot before resorting to your own devices so that you
don't have to install specific versions of ~40 Python packages
Including scientific computing tools that require underlying C/C++ code to
be compiled
Which requires specific versions of developer libraries to be installed

You get the idea...
14

The Virtual Machine Experience
Vagrant
A nice abstraction around virtual machine providers
One ring to rule them all
Virtualbox, VMWare, AWS, ...

IPython Notebook
The easiest way to program with Python
A better REPL (interpreter)
Great for hacking
15

What happens when you vagrant up?
Vagrant follows the instructions in your Vagrantfile
Starts up a Virtualbox instance
Uses Chef to provision it
Installs OS patches/updates
Installs MTSW software dependencies
Starts IPython Notebook server on port 8888
16

Why Should I Use IPython Notebook?
Because it's great for hacking
And hacking is usually the first step

Because it's great for collaboration
Sharing/publishing results is trivial

Because the UX is as easy as working in a notepad
Think of it as "executable paper"
17
18
19

VM Quick Start Instructions
Go to http://MiningTheSocialWeb.com/quick-start/
Follow the instructions
And watch the screencasts!

Basically:
Install Virtualbox & Vagrant
Run "vagrant up" in a terminal to start a guest VM
Then, go to http://localhost:8888 on your host machine's web browser
20

What Could Be Easier?
A hosted version of the VM!
But only for a few hours during this workshop
Because it costs money to run these servers

Go to <the URL provided in the session> and pick a machine
Do not share the URLs outside of this workshop!
Please don't try to hack the machines
I'll verbally provide the connection details (port and password)
21

A Hosted Virtual Machine
Yes, please.
Is it free?
Perhaps...
...Sign-up for the AWS free tier at http://aws.amazon.com/free/
But not right now. Do it later

Standby for the step-by-step instructions on how to do it
I'll publish a post on it in the next day or so
22
23

Module 2: Mining Twitter
24

Objectives
Be able to identify Twitter primitives
Understand tweet metadata and how to use it
Learn how to extract entities such as user mentions, hashtags, and URLs
from tweets
Apply techniques for performing frequency analysis with Python
Be able to plot histograms of Twitter data with IPython Notebook
25

Twitter Primitives
Accounts Types: "Anything"
"Following" Relationships
Favorites
Retweets
Replies
(Almost) No Privacy Controls
26

API Requests
RESTful requests
Everything is a "resource"
You GET, PUT, POST, and DELETE resources
Standard HTTP "verbs"

Example: GET https://api.twitter.com/1.1/statuses/user_timeline.json?
screen_name=SocialWebMining

Streaming API filters
JSON responses
Cursors (not quite pagination)
27

Twitter is an Interest Graph
Johnny
Araya
Roberto

Mercedes

Rodolfo
Hernández

Ana

Jorge

Nina
28

What's in a Tweet?
140 Characters ...
... Plus ~5KB of metadata!
Authorship
Time & location
Tweet "entities"
Replying, retweeting, favoriting, etc.
29

What are Tweet Entities?
Essentially, the "easy to get at" data in the 140 characters
@usermentions
#hashtags
URLs
multiple variations

(financial) symbols
stock tickers

media
30

Data Mining Is...

Counting
Comparing
Filtering
Ranking
31

Histograms

A chart that is handy for frequency analysis
They look like bar charts...except they're not bar charts
Each value on the x-axis is a range (or "bin") of values
Not categorical data
Each value on the y-axis is the combined frequency of values in each range
32

Plotting with IPython Notebook
33

Example: Histogram of Retweets
34

Social Media Analysis Framework
A memorable four step process to guide data science experiments:
Aspire
To test a hypothesis (answer a question)

Acquire
Get the data

Analyze
Count things

Summarize
Plot the results
35

Exercises
Review Python idioms in the "Appendix C (Python Tips & Tricks)" notebook
Follow the setup instructions in the "Chapter 1 (Mining Twitter)" notebook
Fill in Example 1-1 with credentials and begin work
Execute each example sequentially
Customize queries
Explore tweet metadata; count tweet entities; plot histograms of results
Explore the "Chapter 9 (Twitter Cookbook)" notebook
Think of it as a collection of building blocks
36

Module 3: Mining Facebook
37

Objectives

Be able to identify Facebook primitives
Learn about Facebook’s Social Graph API and how to make API requests
Understand how Open Graph protocol extends Facebook's Social Graph
API

Be able to analyze likes from Facebook pages and friends
38

Facebook Primitives

Account Types: People & Pages
Mutual Connections
Likes
Shares
Comments
Extensive Privacy Controls
39

API Requests
Social Graph API requests
Not RESTful but easy to learn and use
Special "field expansion" syntax
Example: GET http://graph.facebook.com/ptwobrussell/?
fields=id,name,friends.fields(likes.limit(10))

JSON responses
Traditional pagination
40

Facebook is an Interest Graph
Johnny
Araya
Roberto

Mercedes

Rodolfo
Hernández

Ana

Jorge

Nina
41

Facebook API Explorer

Go to https://developers.facebook.com/tools/explorer
Really, go there right now...
42

Retrieve Your Likes
43

Facebook Permissions
44

Facebook Permissions
45

Explore Facebook Pages
Names of pages
MiningTheSocialWeb
CrossFit
OReilly

Web URLs (OGP extensions to Facebook's Social Graph)
http://www.imdb.com/title/tt0117500
46

Social Media Analysis Framework

Recall the same four step process to guide data science experiments:
Aspire
Acquire
Analyze
Summarize
47

Embedded Visualizations with IPython NB
48

Social Network Diagram with D3
49

Exercises
Copy/paste your access token from the Graph API Explorer into the "Chapter 2
(Mining Facebook)" notebook
Paste the value and execute the cell just before Example 2-1
Execute examples sequentially (try to at least make it to Example 2-10)
Analyze your likes, your friends and likes from pages of interest
If you have time...
Remaining examples
50

Module 4: Mining LinkedIn
51

Objectives
Learn about LinkedIn’s Developer Platform
Understand how clustering works
A fundamental type of machine learning

Be able to employ geocoding services to arrive at a set of coordinates
from a textual reference to a location
Visualize geographic data with cartograms
52

LinkedIn Primitives
Account Types: People, Companies
The data seems "more closely held" than Facebook or Twitter
No FOAF visibility
Richest data source
Profile descriptions from mutual connections
A little messier than it first appears
Not necessarily a bad thing
53

API Requests

(Weirdly) RESTful Requests
Not really RESTful
Field selector syntax
http://api.linkedin.com/v1/people/~:(first-name,last-name,headline,picture-url)

XML responses
CSV address book download
54

Is LinkedIn an Interest Graph?
Fundamentally: yes. But not so much at the developer API level
Less trivial to find some of the "pivots"
No Skills API (yet)
But the data is there (mostly in profile descriptions) for your direct connections
Companies, job titles, job descriptions
Lots of richness is tucked away in human language data
55

Clustering

An unsupervised machine learning learning technique
Think: an algorithm that organizes the data into partitions
56

Example: Clustered Job Titles
57

3 Steps to Clustering Your Data
Normalization
Compare (similarity/distance measurement)
n-grams, edit distance, and Jaccard are common, but your imagination is the limit
Why can't you just compare everything to everything?
Dimensionality Reduction
Ideally, your clustering algorithm will mitigate the pain
k-means is among the most common clustering techniques in use
58

Jaccard Similarity
59

k-Means Explained
1. Randomly pick k points in the data space as initial values that will be used to
compute the k clusters: K1, K2, ..., Kk.
2. Assign each of the n points to a cluster by finding the nearest Kn—effectively
creating k clusters and requiring k*n comparisons.
3. For each of the k clusters, calculate the centroid, or the mean of the cluster, and
reassign its Ki value to be that value. (Hence, you’re computing “k-means” during each
iteration of the algorithm.)
4. Repeat steps 2–3 until the members of the clusters do not change between
iterations. Generally speaking, relatively few iterations are required for convergence.
60

k-Means: Initialize
61

k-Means: Step 1
62

k-Means: Step 2
63

k-Means: Step 3
64

k-Means: (Fast-Forward) Step 9
65

Geocoding
Transforming a location to a set of coordinates
Nashville, TN => (36.16783905029297, -86.77816009521484)
A harder problem than it first appears
The Bing API is especially generous
Requires an account sign up: http://bingmapsportal.com
Use the API key with the geopy package
66

Cartograms
67

Unless you use a Dorling Cartogram
68

Social Media Analysis Framework

Remember: Use the same four step process to guide data science experiments:
Aspire
Acquire
Analyze
Summarize
69

Exercises
Follow the instructions in the "Chapter 3 (Mining LinkedIn)" notebook to create an API
connection and follow along with the first few examples
Download your connections as a CSV file from http://www.linkedin.com/people/
export-settings and save them to your VM
A deviation from instructions in Example 3-6 is necessary for remote VMs
See http://bit.ly/mtsw-ch03-helper-code

Create a Bing Maps portal account and get your API key for Examples 3-8 and
beyond
Try clustering your contacts in Example 3-12
Try Example 3-13 (visualizing data in Google Earth) at home...
70

Social Media Is All the Rage
World population: ~7B people
Facebook: 1.15B users
Twitter: 500M users
Google+ 343M users
LinkedIn: 238M users
~200M+ blogs (conservative estimate)
71

Module 5: Open Hack
72

Objectives

To work on "loose ends" or areas of interest from previous modules
To hack on code in notebooks not yet encountered
To setup the virtual machine on your own box if you haven't yet
To collaborate/talk and otherwise make the most of our togetherness
73

Social Media Analysis Framework

Remember:
Aspire
Acquire
Analyze
Summarize
74

Recommendations
Setup your own development environment if you haven't already
Appendix A
Text Mining & Natural Language Processing
Chapter 4 (Mining Google+) & Chapter 5 (Mining Web Pages)
Graph Mining
Chapter 7 (Mining GitHub)
Analyzing Semantic Markup
Chapter 8 (Mining the Semantically Marked-Up Web)
75

Final Q&A; Wrap Up
76

Free Stuff
http://MiningTheSocialWeb.com
Mining the Social Web 2E Chapter 1 (Chimera)
http://bit.ly/13XgNWR
Source Code (GitHub)
http://bit.ly/MiningTheSocialWeb2E
http://bit.ly/1fVf5ej (numbered examples)
Screencasts (Vimeo)
http://bit.ly/mtsw2e-screencasts

Weitere ähnliche Inhalte

Ähnlich wie Mining Social Web APIs with IPython Notebook (Strata 2013)

UKSG - Just Do IT Yourself
UKSG  - Just Do IT YourselfUKSG  - Just Do IT Yourself
UKSG - Just Do IT Yourself
Tony Hirst
 
Building an Open Source iOS app: lessons learned
Building an Open Source iOS app: lessons learnedBuilding an Open Source iOS app: lessons learned
Building an Open Source iOS app: lessons learned
Wojciech Koszek
 
Lessons Learned from Building Machine Learning Software at Netflix
Lessons Learned from Building Machine Learning Software at NetflixLessons Learned from Building Machine Learning Software at Netflix
Lessons Learned from Building Machine Learning Software at Netflix
Justin Basilico
 

Ähnlich wie Mining Social Web APIs with IPython Notebook (Strata 2013) (20)

OpenWhisk by Example - Auto Retweeting Example in Python
OpenWhisk by Example - Auto Retweeting Example in PythonOpenWhisk by Example - Auto Retweeting Example in Python
OpenWhisk by Example - Auto Retweeting Example in Python
 
What does OOP stand for?
What does OOP stand for?What does OOP stand for?
What does OOP stand for?
 
Managing Phone Dev Projects
Managing Phone Dev ProjectsManaging Phone Dev Projects
Managing Phone Dev Projects
 
MySQL for Python_ Nho Vĩnh Share.pdf
MySQL for Python_ Nho Vĩnh Share.pdfMySQL for Python_ Nho Vĩnh Share.pdf
MySQL for Python_ Nho Vĩnh Share.pdf
 
A tale of two proxies
A tale of two proxiesA tale of two proxies
A tale of two proxies
 
All in AI: LLM Landscape & RAG in 2024 with Mark Ryan (Google) & Jerry Liu (L...
All in AI: LLM Landscape & RAG in 2024 with Mark Ryan (Google) & Jerry Liu (L...All in AI: LLM Landscape & RAG in 2024 with Mark Ryan (Google) & Jerry Liu (L...
All in AI: LLM Landscape & RAG in 2024 with Mark Ryan (Google) & Jerry Liu (L...
 
Data Workflows for Machine Learning - Seattle DAML
Data Workflows for Machine Learning - Seattle DAMLData Workflows for Machine Learning - Seattle DAML
Data Workflows for Machine Learning - Seattle DAML
 
Introduction to python
Introduction to pythonIntroduction to python
Introduction to python
 
Samsung SDS OpeniT - The possibility of Python
Samsung SDS OpeniT - The possibility of PythonSamsung SDS OpeniT - The possibility of Python
Samsung SDS OpeniT - The possibility of Python
 
UKSG - Just Do IT Yourself
UKSG  - Just Do IT YourselfUKSG  - Just Do IT Yourself
UKSG - Just Do IT Yourself
 
Ardian Haxha- Flying with Python (OSCAL2014)
Ardian Haxha- Flying with Python  (OSCAL2014)Ardian Haxha- Flying with Python  (OSCAL2014)
Ardian Haxha- Flying with Python (OSCAL2014)
 
What is Python? An overview of Python for science.
What is Python? An overview of Python for science.What is Python? An overview of Python for science.
What is Python? An overview of Python for science.
 
Going open source with small teams
Going open source with small teamsGoing open source with small teams
Going open source with small teams
 
The Rise of the DataOps - Dataiku - J On the Beach 2016
The Rise of the DataOps - Dataiku - J On the Beach 2016 The Rise of the DataOps - Dataiku - J On the Beach 2016
The Rise of the DataOps - Dataiku - J On the Beach 2016
 
Building an Open Source iOS app: lessons learned
Building an Open Source iOS app: lessons learnedBuilding an Open Source iOS app: lessons learned
Building an Open Source iOS app: lessons learned
 
python programming.pptx
python programming.pptxpython programming.pptx
python programming.pptx
 
antrikshindutrialmachinelearningPPT.pptx
antrikshindutrialmachinelearningPPT.pptxantrikshindutrialmachinelearningPPT.pptx
antrikshindutrialmachinelearningPPT.pptx
 
Machine learning in cybersecutiry
Machine learning in cybersecutiryMachine learning in cybersecutiry
Machine learning in cybersecutiry
 
3stages Wdn08 V3
3stages Wdn08 V33stages Wdn08 V3
3stages Wdn08 V3
 
Lessons Learned from Building Machine Learning Software at Netflix
Lessons Learned from Building Machine Learning Software at NetflixLessons Learned from Building Machine Learning Software at Netflix
Lessons Learned from Building Machine Learning Software at Netflix
 

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 

Mining Social Web APIs with IPython Notebook (Strata 2013)

  • 1. 1 Mining Social Web APIs with IPython Notebook Matthew A. Russell - @ptwobrussell - http://MiningTheSocialWeb.com New York City - 28 October 2013
  • 3. 3 Hello, My Name Is ... Matthew Background in Computer Science Data mining & machine learning CTO @ Digital Reasoning Systems Data mining; machine learning Author @ O'Reilly Media 5 published books on technology Principal @ Zaffra Selective boutique consulting
  • 4. 4 Transforming Curiosity Into Insight An open source software (OSS) project http://bit.ly/MiningTheSocialWeb2E A book http://bit.ly/135dHfs Accessible to (virtually) everyone Virtual machine with turn-key coding templates for data science experiments Think of the book as "premium" support for the OSS project
  • 5. 5 The Social Web Is All the Rage World population: ~7B people Facebook: 1.15B users Twitter: 500M users Google+ 343M users LinkedIn: 238M users ~200M+ blogs (conservative estimate)
  • 6. 6 Overview Intro (5 mins) Module 1 - Virtual Machine Setup (10 mins) Module 2 - Mining Twitter (40 mins) Module 3 - Mining Facebook (35 mins) BREAK (30 mins) Module 4 - Mining LinkedIn (40 mins) Module 5 - Open Hack (40 mins) Final Q&A; Wrap Up (10 mins)
  • 7. 7 Module Format ~10-15 minutes of exposition I talk; you listen ~25-30 minutes of independent (or collaborative) work You hack while I walk around and help you ~5 minutes of Q&A You ask; I try to answer
  • 8. 8 Workshop Objective To send you away as a social web hacker Broad working knowledge popular social web APIs Hands-on experience hacking on social web data with a common toolkit Not to listen to me talk to you for 3 hours
  • 9. 9 Just a Few More Things This workshop is... An adaptation of Mining the Social Web, 2nd Edition More of a guided hacking session where you follow along (vs a preso) Wider than it is deeper There's only so much you can do in a few hours I'm available 24/7 this week (and beyond) to help you be successful
  • 10. 10 Assumptions At some point in your life, you have Programmed with Python Worked with JSON Made requests and processed responses to/from web servers Or you want to learn to do these things now... And you're a quick learner
  • 11. 11 Module 1: Virtual Machine Setup
  • 12. 12 Why do you need a VM? To save time Because installation and configuration management is harder than it first appears So that you can focus on the task at hand instead So that I can support you regardless of your hardware and operating system
  • 13. 13 But I can do all of that myself... True... If you would rather troubleshoot unexpected installation/configuration issues instead of immediately focusing on the real task at hand At least give it a shot before resorting to your own devices so that you don't have to install specific versions of ~40 Python packages Including scientific computing tools that require underlying C/C++ code to be compiled Which requires specific versions of developer libraries to be installed You get the idea...
  • 14. 14 The Virtual Machine Experience Vagrant A nice abstraction around virtual machine providers One ring to rule them all Virtualbox, VMWare, AWS, ... IPython Notebook The easiest way to program with Python A better REPL (interpreter) Great for hacking
  • 15. 15 What happens when you vagrant up? Vagrant follows the instructions in your Vagrantfile Starts up a Virtualbox instance Uses Chef to provision it Installs OS patches/updates Installs MTSW software dependencies Starts IPython Notebook server on port 8888
  • 16. 16 Why Should I Use IPython Notebook? Because it's great for hacking And hacking is usually the first step Because it's great for collaboration Sharing/publishing results is trivial Because the UX is as easy as working in a notepad Think of it as "executable paper"
  • 17. 17
  • 18. 18
  • 19. 19 VM Quick Start Instructions Go to http://MiningTheSocialWeb.com/quick-start/ Follow the instructions And watch the screencasts! Basically: Install Virtualbox & Vagrant Run "vagrant up" in a terminal to start a guest VM Then, go to http://localhost:8888 on your host machine's web browser
  • 20. 20 What Could Be Easier? A hosted version of the VM! But only for a few hours during this workshop Because it costs money to run these servers Go to <the URL provided in the session> and pick a machine Do not share the URLs outside of this workshop! Please don't try to hack the machines I'll verbally provide the connection details (port and password)
  • 21. 21 A Hosted Virtual Machine Yes, please. Is it free? Perhaps... ...Sign-up for the AWS free tier at http://aws.amazon.com/free/ But not right now. Do it later Standby for the step-by-step instructions on how to do it I'll publish a post on it in the next day or so
  • 22. 22
  • 24. 24 Objectives Be able to identify Twitter primitives Understand tweet metadata and how to use it Learn how to extract entities such as user mentions, hashtags, and URLs from tweets Apply techniques for performing frequency analysis with Python Be able to plot histograms of Twitter data with IPython Notebook
  • 25. 25 Twitter Primitives Accounts Types: "Anything" "Following" Relationships Favorites Retweets Replies (Almost) No Privacy Controls
  • 26. 26 API Requests RESTful requests Everything is a "resource" You GET, PUT, POST, and DELETE resources Standard HTTP "verbs" Example: GET https://api.twitter.com/1.1/statuses/user_timeline.json? screen_name=SocialWebMining Streaming API filters JSON responses Cursors (not quite pagination)
  • 27. 27 Twitter is an Interest Graph Johnny Araya Roberto Mercedes Rodolfo Hernández Ana Jorge Nina
  • 28. 28 What's in a Tweet? 140 Characters ... ... Plus ~5KB of metadata! Authorship Time & location Tweet "entities" Replying, retweeting, favoriting, etc.
  • 29. 29 What are Tweet Entities? Essentially, the "easy to get at" data in the 140 characters @usermentions #hashtags URLs multiple variations (financial) symbols stock tickers media
  • 31. 31 Histograms A chart that is handy for frequency analysis They look like bar charts...except they're not bar charts Each value on the x-axis is a range (or "bin") of values Not categorical data Each value on the y-axis is the combined frequency of values in each range
  • 34. 34 Social Media Analysis Framework A memorable four step process to guide data science experiments: Aspire To test a hypothesis (answer a question) Acquire Get the data Analyze Count things Summarize Plot the results
  • 35. 35 Exercises Review Python idioms in the "Appendix C (Python Tips & Tricks)" notebook Follow the setup instructions in the "Chapter 1 (Mining Twitter)" notebook Fill in Example 1-1 with credentials and begin work Execute each example sequentially Customize queries Explore tweet metadata; count tweet entities; plot histograms of results Explore the "Chapter 9 (Twitter Cookbook)" notebook Think of it as a collection of building blocks
  • 37. 37 Objectives Be able to identify Facebook primitives Learn about Facebook’s Social Graph API and how to make API requests Understand how Open Graph protocol extends Facebook's Social Graph API Be able to analyze likes from Facebook pages and friends
  • 38. 38 Facebook Primitives Account Types: People & Pages Mutual Connections Likes Shares Comments Extensive Privacy Controls
  • 39. 39 API Requests Social Graph API requests Not RESTful but easy to learn and use Special "field expansion" syntax Example: GET http://graph.facebook.com/ptwobrussell/? fields=id,name,friends.fields(likes.limit(10)) JSON responses Traditional pagination
  • 40. 40 Facebook is an Interest Graph Johnny Araya Roberto Mercedes Rodolfo Hernández Ana Jorge Nina
  • 41. 41 Facebook API Explorer Go to https://developers.facebook.com/tools/explorer Really, go there right now...
  • 45. 45 Explore Facebook Pages Names of pages MiningTheSocialWeb CrossFit OReilly Web URLs (OGP extensions to Facebook's Social Graph) http://www.imdb.com/title/tt0117500
  • 46. 46 Social Media Analysis Framework Recall the same four step process to guide data science experiments: Aspire Acquire Analyze Summarize
  • 49. 49 Exercises Copy/paste your access token from the Graph API Explorer into the "Chapter 2 (Mining Facebook)" notebook Paste the value and execute the cell just before Example 2-1 Execute examples sequentially (try to at least make it to Example 2-10) Analyze your likes, your friends and likes from pages of interest If you have time... Remaining examples
  • 51. 51 Objectives Learn about LinkedIn’s Developer Platform Understand how clustering works A fundamental type of machine learning Be able to employ geocoding services to arrive at a set of coordinates from a textual reference to a location Visualize geographic data with cartograms
  • 52. 52 LinkedIn Primitives Account Types: People, Companies The data seems "more closely held" than Facebook or Twitter No FOAF visibility Richest data source Profile descriptions from mutual connections A little messier than it first appears Not necessarily a bad thing
  • 53. 53 API Requests (Weirdly) RESTful Requests Not really RESTful Field selector syntax http://api.linkedin.com/v1/people/~:(first-name,last-name,headline,picture-url) XML responses CSV address book download
  • 54. 54 Is LinkedIn an Interest Graph? Fundamentally: yes. But not so much at the developer API level Less trivial to find some of the "pivots" No Skills API (yet) But the data is there (mostly in profile descriptions) for your direct connections Companies, job titles, job descriptions Lots of richness is tucked away in human language data
  • 55. 55 Clustering An unsupervised machine learning learning technique Think: an algorithm that organizes the data into partitions
  • 57. 57 3 Steps to Clustering Your Data Normalization Compare (similarity/distance measurement) n-grams, edit distance, and Jaccard are common, but your imagination is the limit Why can't you just compare everything to everything? Dimensionality Reduction Ideally, your clustering algorithm will mitigate the pain k-means is among the most common clustering techniques in use
  • 59. 59 k-Means Explained 1. Randomly pick k points in the data space as initial values that will be used to compute the k clusters: K1, K2, ..., Kk. 2. Assign each of the n points to a cluster by finding the nearest Kn—effectively creating k clusters and requiring k*n comparisons. 3. For each of the k clusters, calculate the centroid, or the mean of the cluster, and reassign its Ki value to be that value. (Hence, you’re computing “k-means” during each iteration of the algorithm.) 4. Repeat steps 2–3 until the members of the clusters do not change between iterations. Generally speaking, relatively few iterations are required for convergence.
  • 65. 65 Geocoding Transforming a location to a set of coordinates Nashville, TN => (36.16783905029297, -86.77816009521484) A harder problem than it first appears The Bing API is especially generous Requires an account sign up: http://bingmapsportal.com Use the API key with the geopy package
  • 67. 67 Unless you use a Dorling Cartogram
  • 68. 68 Social Media Analysis Framework Remember: Use the same four step process to guide data science experiments: Aspire Acquire Analyze Summarize
  • 69. 69 Exercises Follow the instructions in the "Chapter 3 (Mining LinkedIn)" notebook to create an API connection and follow along with the first few examples Download your connections as a CSV file from http://www.linkedin.com/people/ export-settings and save them to your VM A deviation from instructions in Example 3-6 is necessary for remote VMs See http://bit.ly/mtsw-ch03-helper-code Create a Bing Maps portal account and get your API key for Examples 3-8 and beyond Try clustering your contacts in Example 3-12 Try Example 3-13 (visualizing data in Google Earth) at home...
  • 70. 70 Social Media Is All the Rage World population: ~7B people Facebook: 1.15B users Twitter: 500M users Google+ 343M users LinkedIn: 238M users ~200M+ blogs (conservative estimate)
  • 72. 72 Objectives To work on "loose ends" or areas of interest from previous modules To hack on code in notebooks not yet encountered To setup the virtual machine on your own box if you haven't yet To collaborate/talk and otherwise make the most of our togetherness
  • 73. 73 Social Media Analysis Framework Remember: Aspire Acquire Analyze Summarize
  • 74. 74 Recommendations Setup your own development environment if you haven't already Appendix A Text Mining & Natural Language Processing Chapter 4 (Mining Google+) & Chapter 5 (Mining Web Pages) Graph Mining Chapter 7 (Mining GitHub) Analyzing Semantic Markup Chapter 8 (Mining the Semantically Marked-Up Web)
  • 76. 76 Free Stuff http://MiningTheSocialWeb.com Mining the Social Web 2E Chapter 1 (Chimera) http://bit.ly/13XgNWR Source Code (GitHub) http://bit.ly/MiningTheSocialWeb2E http://bit.ly/1fVf5ej (numbered examples) Screencasts (Vimeo) http://bit.ly/mtsw2e-screencasts