All Things Cloud Developer Meetup.
Filtering From the Firehose: Real Time Social Media Streaming with Jim Moffitt from Gnip. Gnip is the world's largest and most trusted provider of social data.
Learn about collecting and filtering social media data with streaming APIs. Jim will cover best practices, use case examples and live demos of filtering data from Twitter.
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Filtering From the Firehose: Real Time Social Media Streaming
1. Filtering from the Firehose !
Real-time streaming of social network data!
!
!
Jim Moffitt – Developer Advocate @gnip
@jimmoffitt
2. Who is this guy and what is he going to talk about?
• Introduc)on
• Social
media
firehoses
• Data
sources
• Use-‐cases
• Needle
in
the
haystack
• Filtering
from
the
firehose
• Example
use-‐case
• Server-‐side
• Apache
KaCa
• Apache
Cassandra
• Client-‐side
• HTTP
streaming
code
examples
• Live
streaming
and
search
3. What is a firehose?
•
Con)nuous
stream
of
flexibly
structured
(JSON)
social
media
ac)vi)es
in
near-‐real
)me.
•
Poten)ally
extreme
amounts
of
data.
5. Accessing Social Data for Analytics:!
Crawling/Scraping!
Licensed Access: !
Publisher provides
data “firehose”!
It’s Free!
Open Access!
No rate limits,
compliant,
reliable!
Rate limits, not
guaranteed!
TOS issues,
high latency,
fragile!
Financial
investment, not
all publishers
are covered!
Public API’s!
Pros
Cons
6. Example firehose volumes
Publisher
Daily
Ac0vity
TwiQer
450
M
Tumblr
96
M
+
54
M
votes
Foursquare
4.3
M
Disqus
1.9
M
Wordpress
Comments
1.4
M
Wordpress
Posts
0.6
M
GetGlue
0.6
M
7. Daily Tweet Activity Count
2006
5k
4k
3k
2k
1k
0
2007
200 k
100 k
0
Tweets/Day
2008
1.6 M
1.2 M
800.0 k
400.0 k
2009
25 M
20 M
15 M
10 M
5M
2010
80 M
60 M
40 M
20 M
2011
250 M
200 M
150 M
100 M
Jan
Feb
Mar
Apr
May
Jun
Jul
Date
Aug
Sep
Oct
Nov
Dec
Jan
8. Use-cases for Social Media Analysis
•
•
•
•
•
•
Sales
&
Marke)ng
Brand
monitoring
Customer
Service
Public
Rela)ons
Emergency
Response
All
kinds
of
academic
research…
9. So you are building something around social media?
Some
business
considera)ons:
• Objec)ve
–
what
are
the
ques)ons
that
you
are
trying
to
answer?
•
Timeframe
–
real-‐)me
or
historical
use-‐case
(or
both)?
•
Coverage
–
do
I
need
all
the
data
or
some
sta)s)cal
sample?
• Licensing
and
Terms
of
Service
• Budgets
• Data
costs.
• Sofware
development.
• Infrastructure
(bandwidth,
servers,
storage).
10. So you are building something around social media?
Some
technical
considera)ons:
• Data
transfer
protocols:
RESTful
or
‘keep-‐alive’
Streaming?
• What
sofware
language?
• Bandwidth:
what
does
your
peak
volume
need
to
be?
• Data
storage
• How
and
where
are
you
storing
the
data?
• What
metadata
do
you
need
to
store?*
• Redundant
streams?
11. What data comes with a tweet?
{"id":"tag:search.twiQer.com,2005:388326436685103105","objectType":"ac)vity","actor":{"objectType":"person","id":"id:twiQer.com:
17200003","link":"hQp://www.twiQer.com/jimmoffiQ","displayName":"jimmoffiQ","postedTime":"2008-‐11-‐05T23:06:37.000Z","image":"hQps://
si0.twimg.com/profile_images/3678478654/6aac91cc6bd5711b82c83ebab0a55de0_normal.jpeg","summary":"Once
studied
snow
hydrology.
Recently
developed
real-‐)me
weather
monitoring
and
flood
warning
sofware.
Have
started
a
new
adventure
at
an
amazing
company...","links":
[{"href":null,"rel":"me"}],"friendsCount":69,"followersCount":71,"listedCount":1,"statusesCount":189,"twiQerTimeZone":"Mountain
Time
(US
&
Canada)","verified":false,"utcOffset":"-‐21600","preferredUsername":"jimmoffiQ","languages":["en"],"loca)on":
{"objectType":"place","displayName":"Longmont,
Colorado"},"favoritesCount":17},"verb":"post","postedTime":"2013-‐10-‐10T15:33:31.000Z","generator":
{"displayName":"TweetDeck","link":"hQp://www.tweetdeck.com"},"provider":{"objectType":"service","displayName":"TwiQer","link":"hQp://
www.twiQer.com"},"link":"hQp://twiQer.com/jimmoffiQ/statuses/388326436685103105","body":"Looking
forward
to
this
"All
Things
Cloud"
meet-‐up
in
Denver
next
Tuesday
10/15
hGp://t.co/EQSCWMW4hL
@gnip","object":{"objectType":"note","id":"object:search.twiQer.com,
2005:388326436685103105","summary":"Looking
forward
to
this
"All
Things
Cloud"
meet-‐up
in
Denver
next
Tuesday
10/15
hQp://t.co/EQSCWMW4hL
@gnip","link":"hQp://twiQer.com/jimmoffiQ/statuses/388326436685103105","postedTime":"2013-‐10-‐10T15:33:31.000Z"},"favoritesCount":
0,"twiQer_en))es":{"hashtags":[],"symbols":[],"urls":[{"url":"hQp://t.co/EQSCWMW4hL","expanded_url":"hQp://meetu.ps/
1Fywpg","display_url":"meetu.ps/1Fywpg","indices":[80,102]}],"user_men)ons":[{"screen_name":"gnip","name":"Gnip,
Inc.","id":
16958875,"id_str":"16958875","indices":[103,108]}]},"twiQer_filter_level":"medium","twiQer_lang":"en","retweetCount":0,"gnip":{"matching_rules":
[{"value":""All
Things
Cloud"","tag":null},{"value":"from:jimmoffiQ","tag":null}],"urls":[{"url":"hQp://t.co/EQSCWMW4hL","expanded_url":"hQp://
www.meetup.com/All-‐things-‐Cloud-‐PaaS-‐SaaS-‐PaaS-‐XaaS/events/124584092/"}],"klout_score":49,"klout_profile":{"topics":
[{"klout_topic_id":"10000000000000000020","displayName":"Tablets","link":"hQp://klout.com/topic/id/
10000000000000000020"}],"klout_user_id":"26177177599171892","link":"hQp://klout.com/user/id/26177177599171892"},"language":
{"value":"en"},"profileLoca)ons":[{"objectType":"place","geo":{"type":"point","coordinates":[-‐105.10193,40.16721]},"address":{"country":"United
States","countryCode":"US","locality":"Longmont","region":"Colorado"},"displayName":"Longmont,
Colorado,
United
States"}]}}
12. Methods for filtering data
• Token
filter
(e.g.
"pizza",
"beer"
)
• Substrings
(contains:sport)
• Exact
phrases
("all
things
cloud”)
• Operators:
metadata
(geo,
language,
profiles,
account
stats,
...
)
• Operators:
sampling
(e.g.
sample:10%)
• Publisher-‐specific
Operators:
hashtags,
user
men)ons/from/to,
retweets,
...
Examples:
(pizza
beer)
"all
things
cloud"
profile_region:colorado
twins
(baseball
OR
minnesota
OR
sports
OR
“small
market”)
–(cute
OR
baby
OR
olsen
OR
olson)
13. !
Example use-case: Early-warning systems
Is
there
a
TwiQer
‘signal’
around
local
rain
and
flood
events?
Business
logic:
rain
OR
raining
OR
rained
OR
pouring
OR
weather
OR
hail
OR
lightning
OR
contains:flood
OR
"cats
and
dogs"
OR
wxreport
OR
contains:storm
OR
contains:precip
See
h
Qp://blog.gnip.com/twee)ng-‐in-‐the-‐rain
Parts
1,
2
&
3
14. Social media and early-warning systems
There
are
generally
three
methods
for
geo-‐referencing
TwiQer
data:
• Ac)vity
Loca)on:
tweets
that
are
geo-‐tagged.
• Men)oned
Loca)on:
parsing
the
tweet
message
for
geographic
loca)on.
• Profile
Loca)on:
parsing
the
TwiQer
Account
Profile
loca)on
provided
by
the
user.
• User
account
profile:
82%
• Tweet
text:
17%
• Tweet
geo-‐tagging:
1%
See
hQp://blog.gnip.com/twee)ng-‐in-‐the-‐rain
Parts
1,
2
&
3
15. Social media and early-warning systems
• Profile
Loca)on
(old):
• bio_loca)on_contains:louisville
-‐(bio_loca)on_contains:"co
"
OR
bio_loca)on_contains:colorado)
-‐(bio_loca)on_contains:"tn
"
OR
bio_loca)on_contains:tennessee)
• Profile
Loca)on
(new):
• profile_locality:louisville
profile_region:kentucky
See
hQp://blog.gnip.com/twee)ng-‐in-‐the-‐rain
Parts
1,
2
&
3
16. Social media and early-warning systems
See
hQp://blog.gnip.com/twee)ng-‐in-‐the-‐rain
Parts
1,
2
&
3
17. Social media and early-warning systems
See
hQp://blog.gnip.com/twee)ng-‐in-‐the-‐rain
Parts
1,
2
&
3
18. Apache Kafka @ Gnip
KaCa
is
used
to
help
manage
streaming
traffic
with
the
outside
world.
First
applica)on
was
with
outbound
streams
Gnip
à
Customer
Helps
provide
a
“on-‐disk”
buffer
for
client
streams.
Write
data
to
disk
for
a
short
period.
If
client
disconnects,
when
they
reconnect
their
data
buffer
is
“backfilled.”
19. Apache Kafka @ Gnip
Next
applied
to
inbound
Publisher
streams
Publisher
à
Gnip
Buffers
incoming
data
and
helps
manage
massive
volume
spikes.
Spikes
are
isolated
to
this
ingest
)er.
Downstream
applica)ons
read
data
as
fast
as
they
can.
20. Apache Cassandra @ Gnip!
Serves
a
moving
window
of
TwiQer
day
(currently
30
days).
Will
grow.
Chosen
for
its
• Write-‐speeds
• Reliability
• Redundancy
• Scalability
21. Apache Cassandra @ Gnip!
• Serves
a
variety
of
data
services,
products
and
use-‐cases.
• For
Search
we
have
an
Apache
Lucene
index
helping
to
quickly
find
Cassandra
data.
• Nearly
50
Cassandra
servers
across
test/staging/produc)on
environments.
22. Streaming social media
curl
-‐ujmoffiQ@gnipcentral.com
hQps://api.gnip.com:443/accounts/jim/publishers/twiQer/
streams/track/dev/rules.json
curl
-‐v
-‐X
POST
-‐ujmoffiQ@gnipcentral.com
"hQps://api.gnip.com:443/accounts/jim/publishers/twiQer/streams/track/dev/rules.json"
-‐d
'{"rules":[{"tag":"demo","value":"weather
OR
rain
OR
snow"}]}'
curl
-‐-‐compressed
-‐v
-‐ujmoffiQ@gnipcentral.com
"hQps://stream.gnip.com:443/accounts/jim/publishers/twiQer/streams/track/dev.json"
23. Code examples
Search
GitHub
for
“TwiQer
Stream”
Python
Streaming
Connec)on
We've
found
793
repository
results
HERE
Ruby
Streaming
Connec)on
(using
‘curb’
libcurl
gem)
HERE
Ruby
Streaming
Connec)on
(using
EventMachine
gem)
HERE