SlideShare ist ein Scribd-Unternehmen logo
1 von 35
Downloaden Sie, um offline zu lesen
Mining the Social Web
for Fun and Profit:
A Getting Started Guide
Matthew A. Russell - @ptwobrussell - http://MiningTheSocialWeb.com
Front Range PyData Meetup - 21 May 2014
1
Overview
Intro (5 mins)
Virtual Machine Experience (10 mins)
Virtual Machine and IPython Notebook Demonstration (10 mins)
Mining Twitter: A Primer (20 mins)
Wrap Up/Final Q&A (10 mins)
2
Intro
3
Hello, My Name Is ... Matthew
4
Background in Computer Science
Data mining & machine learning
CTO @ Digital Reasoning Systems
Data mining; machine learning
Author @ O'Reilly Media
5 published books on technology
Principal @ Zaffra
Selective boutique consulting
Transforming Curiosity Into Insight
5
An open source software (OSS) project
http://bit.ly/MiningTheSocialWeb2E
A (rewritten) book
http://bit.ly/135dHfs
Accessible to (virtually) everyone
Virtual machine with turn-key coding
templates for data science experiments
Think of the book as "premium" support for the
OSS project
The Social Web Is All the Rage
World population: ~7B people
Facebook: 1.15B users
Twitter: 500M users
Google+ 343M users
LinkedIn: 238M users
~200M+ blogs (conservative estimate)
6
Table of Contents (1/2)
Chapter 1 - Mining Twitter: Exploring Trending Topics, Discovering What People Are Talking
About, and More
Chapter 2 - Mining Facebook: Analyzing Fan Pages, Examining Friendships, and More
Chapter 3 - Mining LinkedIn: Faceting Job Titles, Clustering Colleagues, and More
Chapter 4 - Mining Google+: Computing Document Similarity, Extracting Collocations, and
More
Chapter 5 - Mining Web Pages: Using Natural Language Processing to Understand Human
Language, Summarize Blog Posts, and More
Chapter 6 - Mining Mailboxes: Analyzing Who's Talking to Whom About What, How Often, and
More
7
Table of Contents (2/2)
Chapter 7 - Mining GitHub: Inspecting Software Collaboration Habits, Building Interest Graphs,
and More
Chapter 8 - Mining the Semantically Marked-Up Web: Extracting Microformats, Inferencing
over RDF, and More
Chapter 9 - Twitter Cookbook
Appendix A - Information About This Machine's Virtual Machine Experience
Appendix B - OAuth Primer
Appendix C - Python and IPython Notebook Tips & Tricks
8
Anatomy of Each Chapter
Brief Intro
Objectives
API Primer
Analysis Technique(s)
Data Visualization
Recap
Suggested Exercises
Recommended Resources
9
The Virtual Machine Experience
10
Why do you need a VM?
11
To save time
Because installation and configuration management is harder than it first
appears
So that you can focus on the task at hand instead
So that I can support you regardless of your hardware and operating
system
Arguably, it's even a best practice for a dev environment
But I can do all of that myself...
True...
If you would rather troubleshoot unexpected installation/configuration issues
instead of immediately focusing on the real task at hand
At least give it a shot before resorting to your own devices so that you
don't have to install specific versions of ~40 Python packages
Including scientific computing tools that require underlying C/C++ code to
be compiled
Which requires specific versions of developer libraries to be installed
You get the idea...
12
The Virtual Machine Experience
Vagrant
A nice abstraction around virtual machine providers
One ring to rule them all
Virtualbox, VMWare, AWS, ...
IPython Notebook
The easiest way to program with Python
A better REPL (interpreter)
Great for hacking
13
What happens when you vagrant up?
Vagrant follows the instructions in your Vagrantfile
Starts up a Virtualbox instance
Uses Chef to provision it
Installs OS patches/updates
Installs MTSW software dependencies
Starts IPython Notebook server on port 8888
14
Why Should I Use IPython Notebook?
Because it's great for hacking
And hacking is usually the first step
Because it's great for collaboration
Sharing/publishing results is trivial
Because the UX is as easy as working in a notepad
Think of it as "executable paper"
15
16
17
VM Quick Start Instructions
Go to http://MiningTheSocialWeb.com/quick-start/
Follow the instructions
And watch the screencasts!
Basically:
Install Virtualbox & Vagrant
Run "vagrant up" in a terminal to start a guest VM
Then, go to http://localhost:8888 on your host machine's web browser
18
An (AWS) Hosted Virtual Machine
Is it free?
Perhaps...
...Sign-up for the AWS free tier at http://aws.amazon.com/free/
But not right now. Do it later
See this blog post for some inspiration on how to easily build your own
AMI from Vagrant boxes
http://wp.me/p3QiJd-3T
19
Virtual Machine and IPython
Notebook Demonstration
20
Demonstration of Virtual Machine
http://nbviewer.ipython.org
http://MiningTheSocialWeb.com/quick-start/
Your first "vagrant up"
21
Mining Twitter: A Primer
22
Objectives
23
Be able to identify Twitter primitives
Understand tweet metadata and how to use it
Learn how to extract entities such as user mentions, hashtags, and URLs
from tweets
Apply techniques for performing frequency analysis with Python
Be able to plot histograms of Twitter data with IPython Notebook
Twitter Primitives
24
Accounts Types: "Anything"
"Following" Relationships
Favorites
Retweets
Replies
(Almost) No Privacy Controls
API Requests
RESTful requests
Everything is a "resource"
You GET, PUT, POST, and DELETE resources
Standard HTTP "verbs"
Example: GET https://api.twitter.com/1.1/statuses/user_timeline.json?
screen_name=SocialWebMining
Streaming API filters
JSON responses
Cursors (not quite pagination)
25
Twitter is an Interest Graph
26
Roberto Mercedes
Jorge
Ana
Nina
Johnny
Araya
Rodolfo
Hernández
What's in a Tweet?
27
140 Characters ...
... Plus ~5KB of metadata!
Authorship
Time & location
Tweet "entities"
Replying, retweeting, favoriting, etc.
What are Tweet Entities?
Essentially, the "easy to get at" data in the 140 characters
@usermentions
#hashtags
URLs
multiple variations
(financial) symbols
stock tickers
media
28
Data Mining Is...
Counting
Comparing
Filtering
Ranking
29
Histograms
A chart that is handy for frequency analysis
They look like bar charts...except they're not bar charts
Each value on the x-axis is a range (or "bin") of values
Not categorical data
Each value on the y-axis is the combined frequency of values in each range
30
31
Example: Histogram of Retweets
Social Media Analysis Framework
A memorable four step process to guide data science experiments:
Aspire
To test a hypothesis (answer a question)
Acquire
Get the data
Analyze
Count things
Summarize
Plot the results
32
Recommended Exercises
Review Python idioms in the "Appendix C (Python Tips & Tricks)" notebook
Follow the setup instructions in the "Chapter 1 (Mining Twitter)" notebook
Fill in Example 1-1 with credentials and begin work
Execute each example sequentially
Customize queries
Explore tweet metadata; count tweet entities; plot histograms of results
Explore the "Chapter 9 (Twitter Cookbook)" notebook
Think of it as a collection of building blocks
33
Final Q&A; Wrap Up
34
Recommended Resources
http://MiningTheSocialWeb.com
Mining the Social Web 2E Chapter 1 (Chimera)
http://bit.ly/13XgNWR
Source Code (GitHub)
http://bit.ly/MiningTheSocialWeb2E
http://bit.ly/1fVf5ej (numbered examples)
Screencasts (Vimeo)
http://bit.ly/mtsw2e-screencasts
35

Weitere ähnliche Inhalte

Andere mochten auch

Privacy, Ethics, and Future Uses of the Social Web
Privacy, Ethics, and Future Uses of the Social WebPrivacy, Ethics, and Future Uses of the Social Web
Privacy, Ethics, and Future Uses of the Social WebMatthew Russell
 
Unleashing twitter data for fun and insight
Unleashing twitter data for fun and insightUnleashing twitter data for fun and insight
Unleashing twitter data for fun and insightDigital Reasoning
 
Tim Estes - Generating dynamic social networks from large scale unstructured ...
Tim Estes - Generating dynamic social networks from large scale unstructured ...Tim Estes - Generating dynamic social networks from large scale unstructured ...
Tim Estes - Generating dynamic social networks from large scale unstructured ...Digital Reasoning
 
Tim Estes - Information Systems in an Entity Centric World
Tim Estes - Information Systems in an Entity Centric WorldTim Estes - Information Systems in an Entity Centric World
Tim Estes - Information Systems in an Entity Centric WorldDigital Reasoning
 
Using cognitive computing to better analyze human communication
Using cognitive computing to better analyze human communicationUsing cognitive computing to better analyze human communication
Using cognitive computing to better analyze human communicationDigital Reasoning
 
Mining the Social Web for Fun & Profit Within Your Organization
Mining the Social Web for Fun & Profit Within Your OrganizationMining the Social Web for Fun & Profit Within Your Organization
Mining the Social Web for Fun & Profit Within Your OrganizationDigital Reasoning
 
Got Chaos? Extracting Business Intelligence from Email with Natural Language ...
Got Chaos? Extracting Business Intelligence from Email with Natural Language ...Got Chaos? Extracting Business Intelligence from Email with Natural Language ...
Got Chaos? Extracting Business Intelligence from Email with Natural Language ...Digital Reasoning
 
Mining the Geo Needles in the Social Haystack
Mining the Geo Needles in the Social HaystackMining the Geo Needles in the Social Haystack
Mining the Geo Needles in the Social HaystackMatthew Russell
 
Building Tooling And Culture Together
Building Tooling And Culture TogetherBuilding Tooling And Culture Together
Building Tooling And Culture TogetherNishan Subedi
 
NYAI #7 - Using Data Science to Operationalize Machine Learning by Matthew Ru...
NYAI #7 - Using Data Science to Operationalize Machine Learning by Matthew Ru...NYAI #7 - Using Data Science to Operationalize Machine Learning by Matthew Ru...
NYAI #7 - Using Data Science to Operationalize Machine Learning by Matthew Ru...Rizwan Habib
 
Why Twitter Is All The Rage: A Data Miner's Perspective (PyTN 2014)
Why Twitter Is All The Rage: A Data Miner's Perspective (PyTN 2014)Why Twitter Is All The Rage: A Data Miner's Perspective (PyTN 2014)
Why Twitter Is All The Rage: A Data Miner's Perspective (PyTN 2014)Matthew Russell
 
NYAI #7 - Top-down vs. Bottom-up Computational Creativity by Dr. Cole D. Ingr...
NYAI #7 - Top-down vs. Bottom-up Computational Creativity by Dr. Cole D. Ingr...NYAI #7 - Top-down vs. Bottom-up Computational Creativity by Dr. Cole D. Ingr...
NYAI #7 - Top-down vs. Bottom-up Computational Creativity by Dr. Cole D. Ingr...Rizwan Habib
 
NYAI #5 - Fun With Neural Nets by Jason Yosinski
NYAI #5 - Fun With Neural Nets by Jason YosinskiNYAI #5 - Fun With Neural Nets by Jason Yosinski
NYAI #5 - Fun With Neural Nets by Jason YosinskiRizwan Habib
 
NYAI #8 - HOLIDAY PARTY + NYC AI OVERVIEW with NYC's Chief Digital Officer Sr...
NYAI #8 - HOLIDAY PARTY + NYC AI OVERVIEW with NYC's Chief Digital Officer Sr...NYAI #8 - HOLIDAY PARTY + NYC AI OVERVIEW with NYC's Chief Digital Officer Sr...
NYAI #8 - HOLIDAY PARTY + NYC AI OVERVIEW with NYC's Chief Digital Officer Sr...Rizwan Habib
 
NYAI #9: Concepts and Questions As Programs by Brenden Lake
NYAI #9: Concepts and Questions As Programs by Brenden LakeNYAI #9: Concepts and Questions As Programs by Brenden Lake
NYAI #9: Concepts and Questions As Programs by Brenden LakeRizwan Habib
 
NYAI - Understanding Music Through Machine Learning by Brian McFee
NYAI - Understanding Music Through Machine Learning by Brian McFeeNYAI - Understanding Music Through Machine Learning by Brian McFee
NYAI - Understanding Music Through Machine Learning by Brian McFeeRizwan Habib
 
Virtual Madness @ Etsy
Virtual Madness @ EtsyVirtual Madness @ Etsy
Virtual Madness @ EtsyNishan Subedi
 
NYAI - Commodity Machine Learning & Beyond by Andreas Mueller
NYAI - Commodity Machine Learning & Beyond by Andreas MuellerNYAI - Commodity Machine Learning & Beyond by Andreas Mueller
NYAI - Commodity Machine Learning & Beyond by Andreas MuellerRizwan Habib
 
Machine Learning with scikit-learn
Machine Learning with scikit-learnMachine Learning with scikit-learn
Machine Learning with scikit-learnodsc
 

Andere mochten auch (20)

Privacy, Ethics, and Future Uses of the Social Web
Privacy, Ethics, and Future Uses of the Social WebPrivacy, Ethics, and Future Uses of the Social Web
Privacy, Ethics, and Future Uses of the Social Web
 
Unleashing twitter data for fun and insight
Unleashing twitter data for fun and insightUnleashing twitter data for fun and insight
Unleashing twitter data for fun and insight
 
Tim Estes - Generating dynamic social networks from large scale unstructured ...
Tim Estes - Generating dynamic social networks from large scale unstructured ...Tim Estes - Generating dynamic social networks from large scale unstructured ...
Tim Estes - Generating dynamic social networks from large scale unstructured ...
 
Tim Estes - Information Systems in an Entity Centric World
Tim Estes - Information Systems in an Entity Centric WorldTim Estes - Information Systems in an Entity Centric World
Tim Estes - Information Systems in an Entity Centric World
 
Using cognitive computing to better analyze human communication
Using cognitive computing to better analyze human communicationUsing cognitive computing to better analyze human communication
Using cognitive computing to better analyze human communication
 
Mining the Social Web for Fun & Profit Within Your Organization
Mining the Social Web for Fun & Profit Within Your OrganizationMining the Social Web for Fun & Profit Within Your Organization
Mining the Social Web for Fun & Profit Within Your Organization
 
Got Chaos? Extracting Business Intelligence from Email with Natural Language ...
Got Chaos? Extracting Business Intelligence from Email with Natural Language ...Got Chaos? Extracting Business Intelligence from Email with Natural Language ...
Got Chaos? Extracting Business Intelligence from Email with Natural Language ...
 
How to Build a Tech Team
How to Build a Tech TeamHow to Build a Tech Team
How to Build a Tech Team
 
Mining the Geo Needles in the Social Haystack
Mining the Geo Needles in the Social HaystackMining the Geo Needles in the Social Haystack
Mining the Geo Needles in the Social Haystack
 
Building Tooling And Culture Together
Building Tooling And Culture TogetherBuilding Tooling And Culture Together
Building Tooling And Culture Together
 
NYAI #7 - Using Data Science to Operationalize Machine Learning by Matthew Ru...
NYAI #7 - Using Data Science to Operationalize Machine Learning by Matthew Ru...NYAI #7 - Using Data Science to Operationalize Machine Learning by Matthew Ru...
NYAI #7 - Using Data Science to Operationalize Machine Learning by Matthew Ru...
 
Why Twitter Is All The Rage: A Data Miner's Perspective (PyTN 2014)
Why Twitter Is All The Rage: A Data Miner's Perspective (PyTN 2014)Why Twitter Is All The Rage: A Data Miner's Perspective (PyTN 2014)
Why Twitter Is All The Rage: A Data Miner's Perspective (PyTN 2014)
 
NYAI #7 - Top-down vs. Bottom-up Computational Creativity by Dr. Cole D. Ingr...
NYAI #7 - Top-down vs. Bottom-up Computational Creativity by Dr. Cole D. Ingr...NYAI #7 - Top-down vs. Bottom-up Computational Creativity by Dr. Cole D. Ingr...
NYAI #7 - Top-down vs. Bottom-up Computational Creativity by Dr. Cole D. Ingr...
 
NYAI #5 - Fun With Neural Nets by Jason Yosinski
NYAI #5 - Fun With Neural Nets by Jason YosinskiNYAI #5 - Fun With Neural Nets by Jason Yosinski
NYAI #5 - Fun With Neural Nets by Jason Yosinski
 
NYAI #8 - HOLIDAY PARTY + NYC AI OVERVIEW with NYC's Chief Digital Officer Sr...
NYAI #8 - HOLIDAY PARTY + NYC AI OVERVIEW with NYC's Chief Digital Officer Sr...NYAI #8 - HOLIDAY PARTY + NYC AI OVERVIEW with NYC's Chief Digital Officer Sr...
NYAI #8 - HOLIDAY PARTY + NYC AI OVERVIEW with NYC's Chief Digital Officer Sr...
 
NYAI #9: Concepts and Questions As Programs by Brenden Lake
NYAI #9: Concepts and Questions As Programs by Brenden LakeNYAI #9: Concepts and Questions As Programs by Brenden Lake
NYAI #9: Concepts and Questions As Programs by Brenden Lake
 
NYAI - Understanding Music Through Machine Learning by Brian McFee
NYAI - Understanding Music Through Machine Learning by Brian McFeeNYAI - Understanding Music Through Machine Learning by Brian McFee
NYAI - Understanding Music Through Machine Learning by Brian McFee
 
Virtual Madness @ Etsy
Virtual Madness @ EtsyVirtual Madness @ Etsy
Virtual Madness @ Etsy
 
NYAI - Commodity Machine Learning & Beyond by Andreas Mueller
NYAI - Commodity Machine Learning & Beyond by Andreas MuellerNYAI - Commodity Machine Learning & Beyond by Andreas Mueller
NYAI - Commodity Machine Learning & Beyond by Andreas Mueller
 
Machine Learning with scikit-learn
Machine Learning with scikit-learnMachine Learning with scikit-learn
Machine Learning with scikit-learn
 

Ähnlich wie Mining the Social Web for Fun and Profit: A Getting Started Guide

Why apache Flink is the 4G of Big Data Analytics Frameworks
Why apache Flink is the 4G of Big Data Analytics FrameworksWhy apache Flink is the 4G of Big Data Analytics Frameworks
Why apache Flink is the 4G of Big Data Analytics FrameworksSlim Baltagi
 
The Rise of the DataOps - Dataiku - J On the Beach 2016
The Rise of the DataOps - Dataiku - J On the Beach 2016 The Rise of the DataOps - Dataiku - J On the Beach 2016
The Rise of the DataOps - Dataiku - J On the Beach 2016 Dataiku
 
Samsung SDS OpeniT - The possibility of Python
Samsung SDS OpeniT - The possibility of PythonSamsung SDS OpeniT - The possibility of Python
Samsung SDS OpeniT - The possibility of PythonInsuk (Chris) Cho
 
MySQL for Python_ Nho Vĩnh Share.pdf
MySQL for Python_ Nho Vĩnh Share.pdfMySQL for Python_ Nho Vĩnh Share.pdf
MySQL for Python_ Nho Vĩnh Share.pdfNho Vĩnh
 
Puppet for SysAdmins
Puppet for SysAdminsPuppet for SysAdmins
Puppet for SysAdminsPuppet
 
Intro To Spring Python
Intro To Spring PythonIntro To Spring Python
Intro To Spring Pythongturnquist
 
TwtBot9/28/17SD
TwtBot9/28/17SDTwtBot9/28/17SD
TwtBot9/28/17SDThinkful
 
python programming.pptx
python programming.pptxpython programming.pptx
python programming.pptxKaviya452563
 
API Documentation Workshop tcworld India 2015
API Documentation Workshop tcworld India 2015API Documentation Workshop tcworld India 2015
API Documentation Workshop tcworld India 2015Tom Johnson
 
OpenWhisk by Example - Auto Retweeting Example in Python
OpenWhisk by Example - Auto Retweeting Example in PythonOpenWhisk by Example - Auto Retweeting Example in Python
OpenWhisk by Example - Auto Retweeting Example in PythonCodeOps Technologies LLP
 
WebHooks in 10 Minutes
WebHooks in 10 MinutesWebHooks in 10 Minutes
WebHooks in 10 MinutesJeff Lindsay
 
A tale of two proxies
A tale of two proxiesA tale of two proxies
A tale of two proxiesSensePost
 
Using Data Science & Serverless Python to find apartment in Toronto
Using Data Science & Serverless Python to find apartment in TorontoUsing Data Science & Serverless Python to find apartment in Toronto
Using Data Science & Serverless Python to find apartment in TorontoDaniel Zivkovic
 
Machine Learning Platform in LINE Fukuoka
Machine Learning Platform in LINE FukuokaMachine Learning Platform in LINE Fukuoka
Machine Learning Platform in LINE FukuokaLINE Corporation
 
Automated Deployment using Open Source
Automated Deployment using Open SourceAutomated Deployment using Open Source
Automated Deployment using Open Sourceduskglow
 
API Documentation presentation to East Bay STC Chapter
API Documentation presentation to East Bay STC ChapterAPI Documentation presentation to East Bay STC Chapter
API Documentation presentation to East Bay STC ChapterTom Johnson
 
API Documentation -- Presentation to East Bay STC Chapter
API Documentation -- Presentation to East Bay STC ChapterAPI Documentation -- Presentation to East Bay STC Chapter
API Documentation -- Presentation to East Bay STC ChapterTom Johnson
 
Summer training report priyanka
Summer  training  report priyankaSummer  training  report priyanka
Summer training report priyankapriyanka kumari
 

Ähnlich wie Mining the Social Web for Fun and Profit: A Getting Started Guide (20)

Why apache Flink is the 4G of Big Data Analytics Frameworks
Why apache Flink is the 4G of Big Data Analytics FrameworksWhy apache Flink is the 4G of Big Data Analytics Frameworks
Why apache Flink is the 4G of Big Data Analytics Frameworks
 
The Rise of the DataOps - Dataiku - J On the Beach 2016
The Rise of the DataOps - Dataiku - J On the Beach 2016 The Rise of the DataOps - Dataiku - J On the Beach 2016
The Rise of the DataOps - Dataiku - J On the Beach 2016
 
Samsung SDS OpeniT - The possibility of Python
Samsung SDS OpeniT - The possibility of PythonSamsung SDS OpeniT - The possibility of Python
Samsung SDS OpeniT - The possibility of Python
 
MySQL for Python_ Nho Vĩnh Share.pdf
MySQL for Python_ Nho Vĩnh Share.pdfMySQL for Python_ Nho Vĩnh Share.pdf
MySQL for Python_ Nho Vĩnh Share.pdf
 
Puppet for SysAdmins
Puppet for SysAdminsPuppet for SysAdmins
Puppet for SysAdmins
 
Intro To Spring Python
Intro To Spring PythonIntro To Spring Python
Intro To Spring Python
 
TwtBot9/28/17SD
TwtBot9/28/17SDTwtBot9/28/17SD
TwtBot9/28/17SD
 
python programming.pptx
python programming.pptxpython programming.pptx
python programming.pptx
 
Pecha Kucha
Pecha KuchaPecha Kucha
Pecha Kucha
 
API Documentation Workshop tcworld India 2015
API Documentation Workshop tcworld India 2015API Documentation Workshop tcworld India 2015
API Documentation Workshop tcworld India 2015
 
Introduction to python
Introduction to pythonIntroduction to python
Introduction to python
 
OpenWhisk by Example - Auto Retweeting Example in Python
OpenWhisk by Example - Auto Retweeting Example in PythonOpenWhisk by Example - Auto Retweeting Example in Python
OpenWhisk by Example - Auto Retweeting Example in Python
 
WebHooks in 10 Minutes
WebHooks in 10 MinutesWebHooks in 10 Minutes
WebHooks in 10 Minutes
 
A tale of two proxies
A tale of two proxiesA tale of two proxies
A tale of two proxies
 
Using Data Science & Serverless Python to find apartment in Toronto
Using Data Science & Serverless Python to find apartment in TorontoUsing Data Science & Serverless Python to find apartment in Toronto
Using Data Science & Serverless Python to find apartment in Toronto
 
Machine Learning Platform in LINE Fukuoka
Machine Learning Platform in LINE FukuokaMachine Learning Platform in LINE Fukuoka
Machine Learning Platform in LINE Fukuoka
 
Automated Deployment using Open Source
Automated Deployment using Open SourceAutomated Deployment using Open Source
Automated Deployment using Open Source
 
API Documentation presentation to East Bay STC Chapter
API Documentation presentation to East Bay STC ChapterAPI Documentation presentation to East Bay STC Chapter
API Documentation presentation to East Bay STC Chapter
 
API Documentation -- Presentation to East Bay STC Chapter
API Documentation -- Presentation to East Bay STC ChapterAPI Documentation -- Presentation to East Bay STC Chapter
API Documentation -- Presentation to East Bay STC Chapter
 
Summer training report priyanka
Summer  training  report priyankaSummer  training  report priyanka
Summer training report priyanka
 

Kürzlich hochgeladen

When-technology-and-Humanity-Cross-1.pptx
When-technology-and-Humanity-Cross-1.pptxWhen-technology-and-Humanity-Cross-1.pptx
When-technology-and-Humanity-Cross-1.pptxReaper61
 
The--Fraud: Netflix Original Media Pitch
The--Fraud: Netflix Original Media PitchThe--Fraud: Netflix Original Media Pitch
The--Fraud: Netflix Original Media Pitch17mos052
 
Unlock Your Social Media Potential with IndianLikes - IndianLikes.com
Unlock Your Social Media Potential with IndianLikes - IndianLikes.comUnlock Your Social Media Potential with IndianLikes - IndianLikes.com
Unlock Your Social Media Potential with IndianLikes - IndianLikes.comSagar Sinha
 
Amplify Your Brand with Our Tailored Social Media Marketing Services
Amplify Your Brand with Our Tailored Social Media Marketing ServicesAmplify Your Brand with Our Tailored Social Media Marketing Services
Amplify Your Brand with Our Tailored Social Media Marketing ServicesNetqom Solutions
 
THE FRAUD NETFLIX ORIGINAL MEDIA PITCH PROJECT
THE FRAUD NETFLIX ORIGINAL MEDIA PITCH PROJECTTHE FRAUD NETFLIX ORIGINAL MEDIA PITCH PROJECT
THE FRAUD NETFLIX ORIGINAL MEDIA PITCH PROJECT17mos052
 
Upgrade Your Twitter Presence with Socio Cosmos
Upgrade Your Twitter Presence with Socio CosmosUpgrade Your Twitter Presence with Socio Cosmos
Upgrade Your Twitter Presence with Socio CosmosSocioCosmos
 
Call Girls In Dwarka ⏩7838079806 ⏩Escort Service In Patel Nagar Delhi
Call Girls In Dwarka ⏩7838079806 ⏩Escort Service In Patel Nagar DelhiCall Girls In Dwarka ⏩7838079806 ⏩Escort Service In Patel Nagar Delhi
Call Girls In Dwarka ⏩7838079806 ⏩Escort Service In Patel Nagar Delhidelhiescort
 
Models Call Girls Shettihalli - 7001305949 Escorts Service 50% Off with Cash ...
Models Call Girls Shettihalli - 7001305949 Escorts Service 50% Off with Cash ...Models Call Girls Shettihalli - 7001305949 Escorts Service 50% Off with Cash ...
Models Call Girls Shettihalli - 7001305949 Escorts Service 50% Off with Cash ...jicagig173
 
Cosmic Conversations with Sociocosmos...
Cosmic Conversations with Sociocosmos...Cosmic Conversations with Sociocosmos...
Cosmic Conversations with Sociocosmos...SocioCosmos
 
YouScan Company Overview - Social Media Listening with Visual Insights.pdf
YouScan Company Overview - Social Media Listening with Visual Insights.pdfYouScan Company Overview - Social Media Listening with Visual Insights.pdf
YouScan Company Overview - Social Media Listening with Visual Insights.pdfAlexander Sirach
 
O9654467111 Call Girls In Shahdara Women Seeking Men
O9654467111 Call Girls In Shahdara Women Seeking MenO9654467111 Call Girls In Shahdara Women Seeking Men
O9654467111 Call Girls In Shahdara Women Seeking MenSapana Sha
 
VIP Moti Bagh Call Girls Free Doorstep Delivery 9873777170
VIP Moti Bagh Call Girls Free Doorstep Delivery 9873777170VIP Moti Bagh Call Girls Free Doorstep Delivery 9873777170
VIP Moti Bagh Call Girls Free Doorstep Delivery 9873777170Komal Khan
 
定制(ENU毕业证书)英国爱丁堡龙比亚大学毕业证成绩单原版一比一
定制(ENU毕业证书)英国爱丁堡龙比亚大学毕业证成绩单原版一比一定制(ENU毕业证书)英国爱丁堡龙比亚大学毕业证成绩单原版一比一
定制(ENU毕业证书)英国爱丁堡龙比亚大学毕业证成绩单原版一比一ra6e69ou
 
办理伯明翰大学毕业证书文凭学位证书
办理伯明翰大学毕业证书文凭学位证书办理伯明翰大学毕业证书文凭学位证书
办理伯明翰大学毕业证书文凭学位证书saphesg8
 
Protecting Your Little Explorer at Home!
Protecting Your Little Explorer at Home!Protecting Your Little Explorer at Home!
Protecting Your Little Explorer at Home!andrekr997
 

Kürzlich hochgeladen (20)

When-technology-and-Humanity-Cross-1.pptx
When-technology-and-Humanity-Cross-1.pptxWhen-technology-and-Humanity-Cross-1.pptx
When-technology-and-Humanity-Cross-1.pptx
 
Enjoy ➥8448380779▻ Call Girls In Noida Sector 93 Escorts Delhi NCR
Enjoy ➥8448380779▻ Call Girls In Noida Sector 93 Escorts Delhi NCREnjoy ➥8448380779▻ Call Girls In Noida Sector 93 Escorts Delhi NCR
Enjoy ➥8448380779▻ Call Girls In Noida Sector 93 Escorts Delhi NCR
 
The--Fraud: Netflix Original Media Pitch
The--Fraud: Netflix Original Media PitchThe--Fraud: Netflix Original Media Pitch
The--Fraud: Netflix Original Media Pitch
 
Unlock Your Social Media Potential with IndianLikes - IndianLikes.com
Unlock Your Social Media Potential with IndianLikes - IndianLikes.comUnlock Your Social Media Potential with IndianLikes - IndianLikes.com
Unlock Your Social Media Potential with IndianLikes - IndianLikes.com
 
Hot Sexy call girls in Ramesh Nagar🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Ramesh Nagar🔝 9953056974 🔝 Delhi escort ServiceHot Sexy call girls in Ramesh Nagar🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Ramesh Nagar🔝 9953056974 🔝 Delhi escort Service
 
Amplify Your Brand with Our Tailored Social Media Marketing Services
Amplify Your Brand with Our Tailored Social Media Marketing ServicesAmplify Your Brand with Our Tailored Social Media Marketing Services
Amplify Your Brand with Our Tailored Social Media Marketing Services
 
FULL ENJOY Call Girls In Mohammadpur (Delhi) Call Us 9953056974
FULL ENJOY Call Girls In Mohammadpur  (Delhi) Call Us 9953056974FULL ENJOY Call Girls In Mohammadpur  (Delhi) Call Us 9953056974
FULL ENJOY Call Girls In Mohammadpur (Delhi) Call Us 9953056974
 
young call girls in Greater Noida 🔝 9953056974 🔝 Delhi escort Service
young call girls in  Greater Noida 🔝 9953056974 🔝 Delhi escort Serviceyoung call girls in  Greater Noida 🔝 9953056974 🔝 Delhi escort Service
young call girls in Greater Noida 🔝 9953056974 🔝 Delhi escort Service
 
THE FRAUD NETFLIX ORIGINAL MEDIA PITCH PROJECT
THE FRAUD NETFLIX ORIGINAL MEDIA PITCH PROJECTTHE FRAUD NETFLIX ORIGINAL MEDIA PITCH PROJECT
THE FRAUD NETFLIX ORIGINAL MEDIA PITCH PROJECT
 
Upgrade Your Twitter Presence with Socio Cosmos
Upgrade Your Twitter Presence with Socio CosmosUpgrade Your Twitter Presence with Socio Cosmos
Upgrade Your Twitter Presence with Socio Cosmos
 
Call Girls In Dwarka ⏩7838079806 ⏩Escort Service In Patel Nagar Delhi
Call Girls In Dwarka ⏩7838079806 ⏩Escort Service In Patel Nagar DelhiCall Girls In Dwarka ⏩7838079806 ⏩Escort Service In Patel Nagar Delhi
Call Girls In Dwarka ⏩7838079806 ⏩Escort Service In Patel Nagar Delhi
 
Models Call Girls Shettihalli - 7001305949 Escorts Service 50% Off with Cash ...
Models Call Girls Shettihalli - 7001305949 Escorts Service 50% Off with Cash ...Models Call Girls Shettihalli - 7001305949 Escorts Service 50% Off with Cash ...
Models Call Girls Shettihalli - 7001305949 Escorts Service 50% Off with Cash ...
 
Cosmic Conversations with Sociocosmos...
Cosmic Conversations with Sociocosmos...Cosmic Conversations with Sociocosmos...
Cosmic Conversations with Sociocosmos...
 
YouScan Company Overview - Social Media Listening with Visual Insights.pdf
YouScan Company Overview - Social Media Listening with Visual Insights.pdfYouScan Company Overview - Social Media Listening with Visual Insights.pdf
YouScan Company Overview - Social Media Listening with Visual Insights.pdf
 
O9654467111 Call Girls In Shahdara Women Seeking Men
O9654467111 Call Girls In Shahdara Women Seeking MenO9654467111 Call Girls In Shahdara Women Seeking Men
O9654467111 Call Girls In Shahdara Women Seeking Men
 
VIP Moti Bagh Call Girls Free Doorstep Delivery 9873777170
VIP Moti Bagh Call Girls Free Doorstep Delivery 9873777170VIP Moti Bagh Call Girls Free Doorstep Delivery 9873777170
VIP Moti Bagh Call Girls Free Doorstep Delivery 9873777170
 
young Call girls in Dwarka sector 23🔝 9953056974 🔝 Delhi escort Service
young Call girls in Dwarka sector 23🔝 9953056974 🔝 Delhi escort Serviceyoung Call girls in Dwarka sector 23🔝 9953056974 🔝 Delhi escort Service
young Call girls in Dwarka sector 23🔝 9953056974 🔝 Delhi escort Service
 
定制(ENU毕业证书)英国爱丁堡龙比亚大学毕业证成绩单原版一比一
定制(ENU毕业证书)英国爱丁堡龙比亚大学毕业证成绩单原版一比一定制(ENU毕业证书)英国爱丁堡龙比亚大学毕业证成绩单原版一比一
定制(ENU毕业证书)英国爱丁堡龙比亚大学毕业证成绩单原版一比一
 
办理伯明翰大学毕业证书文凭学位证书
办理伯明翰大学毕业证书文凭学位证书办理伯明翰大学毕业证书文凭学位证书
办理伯明翰大学毕业证书文凭学位证书
 
Protecting Your Little Explorer at Home!
Protecting Your Little Explorer at Home!Protecting Your Little Explorer at Home!
Protecting Your Little Explorer at Home!
 

Mining the Social Web for Fun and Profit: A Getting Started Guide

  • 1. Mining the Social Web for Fun and Profit: A Getting Started Guide Matthew A. Russell - @ptwobrussell - http://MiningTheSocialWeb.com Front Range PyData Meetup - 21 May 2014 1
  • 2. Overview Intro (5 mins) Virtual Machine Experience (10 mins) Virtual Machine and IPython Notebook Demonstration (10 mins) Mining Twitter: A Primer (20 mins) Wrap Up/Final Q&A (10 mins) 2
  • 4. Hello, My Name Is ... Matthew 4 Background in Computer Science Data mining & machine learning CTO @ Digital Reasoning Systems Data mining; machine learning Author @ O'Reilly Media 5 published books on technology Principal @ Zaffra Selective boutique consulting
  • 5. Transforming Curiosity Into Insight 5 An open source software (OSS) project http://bit.ly/MiningTheSocialWeb2E A (rewritten) book http://bit.ly/135dHfs Accessible to (virtually) everyone Virtual machine with turn-key coding templates for data science experiments Think of the book as "premium" support for the OSS project
  • 6. The Social Web Is All the Rage World population: ~7B people Facebook: 1.15B users Twitter: 500M users Google+ 343M users LinkedIn: 238M users ~200M+ blogs (conservative estimate) 6
  • 7. Table of Contents (1/2) Chapter 1 - Mining Twitter: Exploring Trending Topics, Discovering What People Are Talking About, and More Chapter 2 - Mining Facebook: Analyzing Fan Pages, Examining Friendships, and More Chapter 3 - Mining LinkedIn: Faceting Job Titles, Clustering Colleagues, and More Chapter 4 - Mining Google+: Computing Document Similarity, Extracting Collocations, and More Chapter 5 - Mining Web Pages: Using Natural Language Processing to Understand Human Language, Summarize Blog Posts, and More Chapter 6 - Mining Mailboxes: Analyzing Who's Talking to Whom About What, How Often, and More 7
  • 8. Table of Contents (2/2) Chapter 7 - Mining GitHub: Inspecting Software Collaboration Habits, Building Interest Graphs, and More Chapter 8 - Mining the Semantically Marked-Up Web: Extracting Microformats, Inferencing over RDF, and More Chapter 9 - Twitter Cookbook Appendix A - Information About This Machine's Virtual Machine Experience Appendix B - OAuth Primer Appendix C - Python and IPython Notebook Tips & Tricks 8
  • 9. Anatomy of Each Chapter Brief Intro Objectives API Primer Analysis Technique(s) Data Visualization Recap Suggested Exercises Recommended Resources 9
  • 10. The Virtual Machine Experience 10
  • 11. Why do you need a VM? 11 To save time Because installation and configuration management is harder than it first appears So that you can focus on the task at hand instead So that I can support you regardless of your hardware and operating system Arguably, it's even a best practice for a dev environment
  • 12. But I can do all of that myself... True... If you would rather troubleshoot unexpected installation/configuration issues instead of immediately focusing on the real task at hand At least give it a shot before resorting to your own devices so that you don't have to install specific versions of ~40 Python packages Including scientific computing tools that require underlying C/C++ code to be compiled Which requires specific versions of developer libraries to be installed You get the idea... 12
  • 13. The Virtual Machine Experience Vagrant A nice abstraction around virtual machine providers One ring to rule them all Virtualbox, VMWare, AWS, ... IPython Notebook The easiest way to program with Python A better REPL (interpreter) Great for hacking 13
  • 14. What happens when you vagrant up? Vagrant follows the instructions in your Vagrantfile Starts up a Virtualbox instance Uses Chef to provision it Installs OS patches/updates Installs MTSW software dependencies Starts IPython Notebook server on port 8888 14
  • 15. Why Should I Use IPython Notebook? Because it's great for hacking And hacking is usually the first step Because it's great for collaboration Sharing/publishing results is trivial Because the UX is as easy as working in a notepad Think of it as "executable paper" 15
  • 16. 16
  • 17. 17
  • 18. VM Quick Start Instructions Go to http://MiningTheSocialWeb.com/quick-start/ Follow the instructions And watch the screencasts! Basically: Install Virtualbox & Vagrant Run "vagrant up" in a terminal to start a guest VM Then, go to http://localhost:8888 on your host machine's web browser 18
  • 19. An (AWS) Hosted Virtual Machine Is it free? Perhaps... ...Sign-up for the AWS free tier at http://aws.amazon.com/free/ But not right now. Do it later See this blog post for some inspiration on how to easily build your own AMI from Vagrant boxes http://wp.me/p3QiJd-3T 19
  • 20. Virtual Machine and IPython Notebook Demonstration 20
  • 21. Demonstration of Virtual Machine http://nbviewer.ipython.org http://MiningTheSocialWeb.com/quick-start/ Your first "vagrant up" 21
  • 22. Mining Twitter: A Primer 22
  • 23. Objectives 23 Be able to identify Twitter primitives Understand tweet metadata and how to use it Learn how to extract entities such as user mentions, hashtags, and URLs from tweets Apply techniques for performing frequency analysis with Python Be able to plot histograms of Twitter data with IPython Notebook
  • 24. Twitter Primitives 24 Accounts Types: "Anything" "Following" Relationships Favorites Retweets Replies (Almost) No Privacy Controls
  • 25. API Requests RESTful requests Everything is a "resource" You GET, PUT, POST, and DELETE resources Standard HTTP "verbs" Example: GET https://api.twitter.com/1.1/statuses/user_timeline.json? screen_name=SocialWebMining Streaming API filters JSON responses Cursors (not quite pagination) 25
  • 26. Twitter is an Interest Graph 26 Roberto Mercedes Jorge Ana Nina Johnny Araya Rodolfo Hernández
  • 27. What's in a Tweet? 27 140 Characters ... ... Plus ~5KB of metadata! Authorship Time & location Tweet "entities" Replying, retweeting, favoriting, etc.
  • 28. What are Tweet Entities? Essentially, the "easy to get at" data in the 140 characters @usermentions #hashtags URLs multiple variations (financial) symbols stock tickers media 28
  • 30. Histograms A chart that is handy for frequency analysis They look like bar charts...except they're not bar charts Each value on the x-axis is a range (or "bin") of values Not categorical data Each value on the y-axis is the combined frequency of values in each range 30
  • 32. Social Media Analysis Framework A memorable four step process to guide data science experiments: Aspire To test a hypothesis (answer a question) Acquire Get the data Analyze Count things Summarize Plot the results 32
  • 33. Recommended Exercises Review Python idioms in the "Appendix C (Python Tips & Tricks)" notebook Follow the setup instructions in the "Chapter 1 (Mining Twitter)" notebook Fill in Example 1-1 with credentials and begin work Execute each example sequentially Customize queries Explore tweet metadata; count tweet entities; plot histograms of results Explore the "Chapter 9 (Twitter Cookbook)" notebook Think of it as a collection of building blocks 33
  • 35. Recommended Resources http://MiningTheSocialWeb.com Mining the Social Web 2E Chapter 1 (Chimera) http://bit.ly/13XgNWR Source Code (GitHub) http://bit.ly/MiningTheSocialWeb2E http://bit.ly/1fVf5ej (numbered examples) Screencasts (Vimeo) http://bit.ly/mtsw2e-screencasts 35