SlideShare ist ein Scribd-Unternehmen logo
1 von 25
Filtering from the Firehose !
Real-time streaming of social network data!

!
!

Jim Moffitt – Developer Advocate @gnip
@jimmoffitt
Who is this guy and what is he going to talk about?
•  Introduc)on	
  
•  Social	
  media	
  firehoses	
  
•  Data	
  sources	
  
•  Use-­‐cases	
  
•  Needle	
  in	
  the	
  haystack	
  
•  Filtering	
  from	
  the	
  firehose	
  
•  Example	
  use-­‐case	
  
•  Server-­‐side	
  
•  Apache	
  KaCa	
  	
  	
  
•  Apache	
  Cassandra	
  
•  Client-­‐side	
  
•  HTTP	
  streaming	
  code	
  examples	
  
•  Live	
  streaming	
  and	
  search	
  	
  

	
  
	
  
What is a firehose?

• 

Con)nuous	
  stream	
  of	
  flexibly	
  structured	
  
(JSON)	
  social	
  media	
  ac)vi)es	
  in	
  near-­‐real	
  
)me.	
  

• 

Poten)ally	
  extreme	
  amounts	
  of	
  data.	
  
Available firehoses and public APIs
Accessing Social Data for Analytics:!

Crawling/Scraping!

Licensed Access: !
Publisher provides
data “firehose”!

It’s Free!

Open Access!

No rate limits,
compliant,
reliable!

Rate limits, not
guaranteed!

TOS issues,
high latency,
fragile!

Financial
investment, not
all publishers
are covered!

Public API’s!

Pros

Cons
Example firehose volumes
Publisher	
  

Daily	
  Ac0vity	
  

TwiQer	
  

450	
  M	
  

Tumblr	
  

96	
  M	
  +	
  54	
  M	
  votes	
  

Foursquare	
  

4.3	
  M	
  

Disqus	
  

1.9	
  M	
  

Wordpress	
  Comments	
  

1.4	
  M	
  

Wordpress	
  Posts	
  

0.6	
  M	
  

GetGlue	
  

0.6	
  M	
  
Daily Tweet Activity Count
2006

5k
4k
3k
2k
1k
0

2007
200 k
100 k
0

Tweets/Day

2008
1.6 M
1.2 M
800.0 k
400.0 k

2009

25 M
20 M
15 M
10 M
5M

2010
80 M
60 M
40 M
20 M
2011

250 M
200 M
150 M
100 M
Jan

Feb

Mar

Apr

May

Jun

Jul

Date

Aug

Sep

Oct

Nov

Dec

Jan
Use-cases for Social Media Analysis
• 
• 
• 
• 
• 
• 

Sales	
  &	
  Marke)ng	
  
Brand	
  monitoring	
  
Customer	
  Service	
  	
  
Public	
  Rela)ons	
  
Emergency	
  Response	
  
All	
  kinds	
  of	
  academic	
  research…	
  
So you are building something around social media?
Some	
  business	
  considera)ons:	
  

	
  
•  Objec)ve	
  –	
  what	
  are	
  the	
  ques)ons	
  that	
  you	
  are	
  trying	
  to	
  answer?	
  
•  	
   Timeframe	
  –	
  real-­‐)me	
  or	
  historical	
  use-­‐case	
  (or	
  both)?	
  
•  	
   Coverage	
  –	
  do	
  I	
  need	
  all	
  the	
  data	
  or	
  some	
  sta)s)cal	
  sample?	
  
•  Licensing	
  and	
  Terms	
  of	
  Service	
  	
  
•  Budgets	
  
•  Data	
  costs.	
  
•  Sofware	
  development.	
  
•  Infrastructure	
  (bandwidth,	
  servers,	
  storage).	
  
	
  
	
  
So you are building something around social media?
Some	
  technical	
  considera)ons:	
  
	
  
•  Data	
  transfer	
  protocols:	
  RESTful	
  or	
  ‘keep-­‐alive’	
  Streaming?	
  
•  What	
  sofware	
  language?	
  
•  Bandwidth:	
  what	
  does	
  your	
  peak	
  volume	
  need	
  to	
  be?	
  
•  Data	
  storage	
  
•  How	
  and	
  where	
  are	
  you	
  storing	
  the	
  data?	
  
•  What	
  metadata	
  do	
  you	
  need	
  to	
  store?*	
  
•  Redundant	
  streams?	
  
	
  
	
  
What data comes with a tweet?
{"id":"tag:search.twiQer.com,2005:388326436685103105","objectType":"ac)vity","actor":{"objectType":"person","id":"id:twiQer.com:
17200003","link":"hQp://www.twiQer.com/jimmoffiQ","displayName":"jimmoffiQ","postedTime":"2008-­‐11-­‐05T23:06:37.000Z","image":"hQps://
si0.twimg.com/profile_images/3678478654/6aac91cc6bd5711b82c83ebab0a55de0_normal.jpeg","summary":"Once	
  studied	
  snow	
  hydrology.	
  	
  Recently	
  
developed	
  real-­‐)me	
  weather	
  monitoring	
  and	
  flood	
  warning	
  sofware.	
  	
  Have	
  started	
  a	
  new	
  adventure	
  at	
  an	
  amazing	
  company...","links":
[{"href":null,"rel":"me"}],"friendsCount":69,"followersCount":71,"listedCount":1,"statusesCount":189,"twiQerTimeZone":"Mountain	
  Time	
  (US	
  &	
  
Canada)","verified":false,"utcOffset":"-­‐21600","preferredUsername":"jimmoffiQ","languages":["en"],"loca)on":
{"objectType":"place","displayName":"Longmont,	
  Colorado"},"favoritesCount":17},"verb":"post","postedTime":"2013-­‐10-­‐10T15:33:31.000Z","generator":
{"displayName":"TweetDeck","link":"hQp://www.tweetdeck.com"},"provider":{"objectType":"service","displayName":"TwiQer","link":"hQp://
www.twiQer.com"},"link":"hQp://twiQer.com/jimmoffiQ/statuses/388326436685103105","body":"Looking	
  forward	
  to	
  this	
  "All	
  Things	
  Cloud"	
  meet-­‐up	
  in	
  
Denver	
  next	
  Tuesday	
  10/15	
  hGp://t.co/EQSCWMW4hL	
  @gnip","object":{"objectType":"note","id":"object:search.twiQer.com,
2005:388326436685103105","summary":"Looking	
  forward	
  to	
  this	
  "All	
  Things	
  Cloud"	
  meet-­‐up	
  in	
  Denver	
  next	
  Tuesday	
  10/15	
  hQp://t.co/EQSCWMW4hL	
  
@gnip","link":"hQp://twiQer.com/jimmoffiQ/statuses/388326436685103105","postedTime":"2013-­‐10-­‐10T15:33:31.000Z"},"favoritesCount":
0,"twiQer_en))es":{"hashtags":[],"symbols":[],"urls":[{"url":"hQp://t.co/EQSCWMW4hL","expanded_url":"hQp://meetu.ps/
1Fywpg","display_url":"meetu.ps/1Fywpg","indices":[80,102]}],"user_men)ons":[{"screen_name":"gnip","name":"Gnip,	
  Inc.","id":
16958875,"id_str":"16958875","indices":[103,108]}]},"twiQer_filter_level":"medium","twiQer_lang":"en","retweetCount":0,"gnip":{"matching_rules":
[{"value":""All	
  Things	
  Cloud"","tag":null},{"value":"from:jimmoffiQ","tag":null}],"urls":[{"url":"hQp://t.co/EQSCWMW4hL","expanded_url":"hQp://
www.meetup.com/All-­‐things-­‐Cloud-­‐PaaS-­‐SaaS-­‐PaaS-­‐XaaS/events/124584092/"}],"klout_score":49,"klout_profile":{"topics":
[{"klout_topic_id":"10000000000000000020","displayName":"Tablets","link":"hQp://klout.com/topic/id/
10000000000000000020"}],"klout_user_id":"26177177599171892","link":"hQp://klout.com/user/id/26177177599171892"},"language":
{"value":"en"},"profileLoca)ons":[{"objectType":"place","geo":{"type":"point","coordinates":[-­‐105.10193,40.16721]},"address":{"country":"United	
  
States","countryCode":"US","locality":"Longmont","region":"Colorado"},"displayName":"Longmont,	
  Colorado,	
  United	
  States"}]}}	
  
Methods for filtering data
•  Token	
  filter	
  (e.g.	
  "pizza",	
  "beer"	
  )	
  
•  Substrings	
  (contains:sport)	
  
•  Exact	
  phrases	
  ("all	
  things	
  cloud”)	
  
•  Operators:	
  metadata	
  (geo,	
  language,	
  profiles,	
  account	
  stats,	
  ...	
  )	
  
•  Operators:	
  sampling	
  (e.g.	
  sample:10%)	
  
•  Publisher-­‐specific	
  Operators:	
  hashtags,	
  user	
  men)ons/from/to,	
  retweets,	
  ...	
  
	
  
	
  	
  	
  Examples:	
  	
  	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  (pizza	
  beer)	
  "all	
  things	
  cloud"	
  profile_region:colorado	
  	
  

	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  twins	
  (baseball	
  OR	
  minnesota	
  OR	
  sports	
  OR	
  “small	
  market”)	
  –(cute	
  OR	
  baby	
  OR	
  	
  olsen	
  OR	
  olson)	
  

	
  
!

Example use-case: Early-warning systems
	
  Is	
  there	
  a	
  TwiQer	
  ‘signal’	
  around	
  local	
  rain	
  and	
  flood	
  events?	
  
Business	
  logic:	
  
	
  
rain	
  OR	
  raining	
  OR	
  rained	
  OR	
  pouring	
  OR	
  weather	
  OR	
  hail	
  OR	
  lightning	
  OR	
  
contains:flood	
  OR	
  "cats	
  and	
  dogs"	
  OR	
  wxreport	
  OR	
  contains:storm	
  OR	
  
contains:precip	
  
	
  
	
  
	
  
	
  
See	
  h	
  
Qp://blog.gnip.com/twee)ng-­‐in-­‐the-­‐rain	
  Parts	
  1,	
  2	
  &	
  3	
  	
  
Social media and early-warning systems
There	
  are	
  generally	
  three	
  methods	
  for	
  geo-­‐referencing	
  TwiQer	
  data:	
  
	
  

•  Ac)vity	
  Loca)on:	
  tweets	
  that	
  are	
  geo-­‐tagged.	
  
•  Men)oned	
  Loca)on:	
  parsing	
  the	
  tweet	
  message	
  for	
  geographic	
  loca)on.	
  
•  Profile	
  Loca)on:	
  parsing	
  the	
  TwiQer	
  Account	
  Profile	
  loca)on	
  provided	
  by	
  the	
  user.	
  	
  
	
  

•  User	
  account	
  profile:	
  82%	
  
•  Tweet	
  text:	
  17%	
  
•  Tweet	
  geo-­‐tagging:	
  1%	
  

See	
  hQp://blog.gnip.com/twee)ng-­‐in-­‐the-­‐rain	
  Parts	
  1,	
  2	
  &	
  3	
  	
  
Social media and early-warning systems
•  Profile	
  Loca)on	
  (old):	
  
•  bio_loca)on_contains:louisville	
  -­‐(bio_loca)on_contains:"co	
  "	
  OR	
  
bio_loca)on_contains:colorado)	
  -­‐(bio_loca)on_contains:"tn	
  "	
  
OR	
  bio_loca)on_contains:tennessee)	
  
•  Profile	
  Loca)on	
  (new):	
  
•  profile_locality:louisville	
  profile_region:kentucky	
  
	
  
	
  
	
  
See	
  hQp://blog.gnip.com/twee)ng-­‐in-­‐the-­‐rain	
  Parts	
  1,	
  2	
  &	
  3	
  	
  
Social media and early-warning systems
	
  
	
  
	
  
	
  

See	
  hQp://blog.gnip.com/twee)ng-­‐in-­‐the-­‐rain	
  Parts	
  1,	
  2	
  &	
  3	
  	
  
Social media and early-warning systems

See	
  hQp://blog.gnip.com/twee)ng-­‐in-­‐the-­‐rain	
  Parts	
  1,	
  2	
  &	
  3	
  	
  
Apache Kafka @ Gnip
KaCa	
  is	
  used	
  to	
  help	
  manage	
  streaming	
  traffic	
  with	
  the	
  outside	
  world.	
  	
  	
  
	
  
First	
  applica)on	
  was	
  with	
  outbound	
  streams	
  
	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  Gnip	
  à	
  Customer	
  
	
  	
  
Helps	
  provide	
  a	
  “on-­‐disk”	
  buffer	
  for	
  client	
  streams.	
  Write	
  data	
  to	
  disk	
  for	
  a	
  
short	
  period.	
  	
  If	
  client	
  disconnects,	
  when	
  they	
  reconnect	
  their	
  data	
  buffer	
  is	
  	
  
“backfilled.”	
  
	
  
Apache Kafka @ Gnip
Next	
  applied	
  to	
  inbound	
  Publisher	
  streams	
  
	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  Publisher	
  	
  à	
  	
  Gnip	
  
	
  
Buffers	
  incoming	
  data	
  and	
  helps	
  manage	
  massive	
  volume	
  spikes.	
  	
  
	
  
Spikes	
  are	
  isolated	
  to	
  this	
  ingest	
  )er.	
  
	
  
Downstream	
  applica)ons	
  read	
  data	
  as	
  fast	
  as	
  they	
  can.	
  
	
  
Apache Cassandra @ Gnip!

	
  
Serves	
  a	
  moving	
  window	
  of	
  TwiQer	
  day	
  (currently	
  30	
  days).	
  	
  Will	
  grow.	
  
	
  
Chosen	
  for	
  its	
  	
  
•  Write-­‐speeds	
  	
  
•  Reliability	
  
•  Redundancy	
  
•  Scalability	
  
	
  
Apache Cassandra @ Gnip!

	
  
•  Serves	
  a	
  variety	
  of	
  data	
  services,	
  products	
  and	
  use-­‐cases.	
  	
  	
  
•  For	
  Search	
  we	
  have	
  an	
  Apache	
  Lucene	
  index	
  helping	
  to	
  quickly	
  find	
  Cassandra	
  data.	
  
•  Nearly	
  50	
  Cassandra	
  servers	
  across	
  test/staging/produc)on	
  environments.	
  
Streaming social media
curl	
  -­‐ujmoffiQ@gnipcentral.com	
  hQps://api.gnip.com:443/accounts/jim/publishers/twiQer/
streams/track/dev/rules.json	
  
	
  
curl	
  -­‐v	
  -­‐X	
  POST	
  -­‐ujmoffiQ@gnipcentral.com	
  	
  
"hQps://api.gnip.com:443/accounts/jim/publishers/twiQer/streams/track/dev/rules.json"	
  	
  
-­‐d	
  '{"rules":[{"tag":"demo","value":"weather	
  OR	
  rain	
  OR	
  snow"}]}'	
  
curl	
  -­‐-­‐compressed	
  -­‐v	
  -­‐ujmoffiQ@gnipcentral.com	
  	
  
"hQps://stream.gnip.com:443/accounts/jim/publishers/twiQer/streams/track/dev.json"	
  
Code examples
Search	
  GitHub	
  for	
  “TwiQer	
  Stream”	
  	
  
Python	
  Streaming	
  Connec)on	
  

We've	
  found	
  793	
  repository	
  results	
  

HERE	
  

Ruby	
  Streaming	
  Connec)on	
  (using	
  ‘curb’	
  libcurl	
  gem)	
  

HERE	
  

Ruby	
  Streaming	
  Connec)on	
  (using	
  EventMachine	
  gem)	
   HERE	
  
Live Search Demo

hQps://search-­‐demo.prod.gnip.com:8443	
  

hQps://github.com/gnip/gnip-­‐search-­‐demo	
  
Questions?

Weitere ähnliche Inhalte

Was ist angesagt?

Social Security Company Nexgate's Success Relies on Apache Cassandra
Social Security Company Nexgate's Success Relies on Apache CassandraSocial Security Company Nexgate's Success Relies on Apache Cassandra
Social Security Company Nexgate's Success Relies on Apache CassandraDataStax Academy
 
Big Data Analytics - Best of the Worst : Anti-patterns & Antidotes
Big Data Analytics - Best of the Worst : Anti-patterns & AntidotesBig Data Analytics - Best of the Worst : Anti-patterns & Antidotes
Big Data Analytics - Best of the Worst : Anti-patterns & AntidotesKrishna Sankar
 
Big data and APIs for PHP developers - SXSW 2011
Big data and APIs for PHP developers - SXSW 2011Big data and APIs for PHP developers - SXSW 2011
Big data and APIs for PHP developers - SXSW 2011Eli White
 
Performing network security analytics
Performing network security analyticsPerforming network security analytics
Performing network security analyticsDataWorks Summit
 
Enterprise Password Worst Practices
Enterprise Password Worst PracticesEnterprise Password Worst Practices
Enterprise Password Worst PracticesImperva
 
Warcbase: Building a Scalable Platform on HBase and Hadoop - Part Two, Histor...
Warcbase: Building a Scalable Platform on HBase and Hadoop - Part Two, Histor...Warcbase: Building a Scalable Platform on HBase and Hadoop - Part Two, Histor...
Warcbase: Building a Scalable Platform on HBase and Hadoop - Part Two, Histor...Ian Milligan
 

Was ist angesagt? (6)

Social Security Company Nexgate's Success Relies on Apache Cassandra
Social Security Company Nexgate's Success Relies on Apache CassandraSocial Security Company Nexgate's Success Relies on Apache Cassandra
Social Security Company Nexgate's Success Relies on Apache Cassandra
 
Big Data Analytics - Best of the Worst : Anti-patterns & Antidotes
Big Data Analytics - Best of the Worst : Anti-patterns & AntidotesBig Data Analytics - Best of the Worst : Anti-patterns & Antidotes
Big Data Analytics - Best of the Worst : Anti-patterns & Antidotes
 
Big data and APIs for PHP developers - SXSW 2011
Big data and APIs for PHP developers - SXSW 2011Big data and APIs for PHP developers - SXSW 2011
Big data and APIs for PHP developers - SXSW 2011
 
Performing network security analytics
Performing network security analyticsPerforming network security analytics
Performing network security analytics
 
Enterprise Password Worst Practices
Enterprise Password Worst PracticesEnterprise Password Worst Practices
Enterprise Password Worst Practices
 
Warcbase: Building a Scalable Platform on HBase and Hadoop - Part Two, Histor...
Warcbase: Building a Scalable Platform on HBase and Hadoop - Part Two, Histor...Warcbase: Building a Scalable Platform on HBase and Hadoop - Part Two, Histor...
Warcbase: Building a Scalable Platform on HBase and Hadoop - Part Two, Histor...
 

Andere mochten auch

Seattle bot + Twitter data prezo
Seattle bot + Twitter data prezoSeattle bot + Twitter data prezo
Seattle bot + Twitter data prezoHarrison Neff
 
Floods of Twitter Data - StampedeCon 2016
Floods of Twitter Data - StampedeCon 2016Floods of Twitter Data - StampedeCon 2016
Floods of Twitter Data - StampedeCon 2016StampedeCon
 
C* Summit 2013: Dude, Where's My Tweet? Taming the Twitter Firehose by Andrew...
C* Summit 2013: Dude, Where's My Tweet? Taming the Twitter Firehose by Andrew...C* Summit 2013: Dude, Where's My Tweet? Taming the Twitter Firehose by Andrew...
C* Summit 2013: Dude, Where's My Tweet? Taming the Twitter Firehose by Andrew...DataStax Academy
 
Twitter Tips for Beginners
Twitter Tips for BeginnersTwitter Tips for Beginners
Twitter Tips for BeginnersBuffer
 

Andere mochten auch (7)

Seattle bot + Twitter data prezo
Seattle bot + Twitter data prezoSeattle bot + Twitter data prezo
Seattle bot + Twitter data prezo
 
Floods of Twitter Data - StampedeCon 2016
Floods of Twitter Data - StampedeCon 2016Floods of Twitter Data - StampedeCon 2016
Floods of Twitter Data - StampedeCon 2016
 
C* Summit 2013: Dude, Where's My Tweet? Taming the Twitter Firehose by Andrew...
C* Summit 2013: Dude, Where's My Tweet? Taming the Twitter Firehose by Andrew...C* Summit 2013: Dude, Where's My Tweet? Taming the Twitter Firehose by Andrew...
C* Summit 2013: Dude, Where's My Tweet? Taming the Twitter Firehose by Andrew...
 
storm at twitter
storm at twitterstorm at twitter
storm at twitter
 
Gnip
GnipGnip
Gnip
 
Twitter Tips for Beginners
Twitter Tips for BeginnersTwitter Tips for Beginners
Twitter Tips for Beginners
 
Digital, Social & Mobile in 2015
Digital, Social & Mobile in 2015Digital, Social & Mobile in 2015
Digital, Social & Mobile in 2015
 

Ähnlich wie Filtering From the Firehose: Real Time Social Media Streaming

Big Data made easy in the era of the Cloud - Demi Ben-Ari
Big Data made easy in the era of the Cloud - Demi Ben-AriBig Data made easy in the era of the Cloud - Demi Ben-Ari
Big Data made easy in the era of the Cloud - Demi Ben-AriDemi Ben-Ari
 
Big Data to SMART Data : Process Scenario
Big Data to SMART Data : Process ScenarioBig Data to SMART Data : Process Scenario
Big Data to SMART Data : Process ScenarioCHAKER ALLAOUI
 
The Real-time Web in the Age of Agents
The Real-time Web in the Age of AgentsThe Real-time Web in the Age of Agents
The Real-time Web in the Age of AgentsJoshua Shinavier
 
Kostas Tzoumas - Stream Processing with Apache Flink®
Kostas Tzoumas - Stream Processing with Apache Flink®Kostas Tzoumas - Stream Processing with Apache Flink®
Kostas Tzoumas - Stream Processing with Apache Flink®Ververica
 
Debunking Common Myths in Stream Processing
Debunking Common Myths in Stream ProcessingDebunking Common Myths in Stream Processing
Debunking Common Myths in Stream ProcessingKostas Tzoumas
 
Apache Flink: Real-World Use Cases for Streaming Analytics
Apache Flink: Real-World Use Cases for Streaming AnalyticsApache Flink: Real-World Use Cases for Streaming Analytics
Apache Flink: Real-World Use Cases for Streaming AnalyticsSlim Baltagi
 
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...confluent
 
Data Stream Algorithms in Storm and R
Data Stream Algorithms in Storm and RData Stream Algorithms in Storm and R
Data Stream Algorithms in Storm and RRadek Maciaszek
 
Introduction to Streaming Analytics
Introduction to Streaming AnalyticsIntroduction to Streaming Analytics
Introduction to Streaming AnalyticsGuido Schmutz
 
WSO2Con Asia 2014 - Simultaneous Analysis of Massive Data Streams in real-tim...
WSO2Con Asia 2014 - Simultaneous Analysis of Massive Data Streams in real-tim...WSO2Con Asia 2014 - Simultaneous Analysis of Massive Data Streams in real-tim...
WSO2Con Asia 2014 - Simultaneous Analysis of Massive Data Streams in real-tim...WSO2
 
What Open Data and Open Source can do for Sri Lanka?
What Open Data and Open Source can do for Sri Lanka?What Open Data and Open Source can do for Sri Lanka?
What Open Data and Open Source can do for Sri Lanka?Srinath Perera
 
Hadoop application architectures - Fraud detection tutorial
Hadoop application architectures - Fraud detection tutorialHadoop application architectures - Fraud detection tutorial
Hadoop application architectures - Fraud detection tutorialhadooparchbook
 
Hadoop application architectures - Fraud detection tutorial
Hadoop application architectures - Fraud detection tutorialHadoop application architectures - Fraud detection tutorial
Hadoop application architectures - Fraud detection tutorialhadooparchbook
 
Network Security Data Visualization
Network Security Data VisualizationNetwork Security Data Visualization
Network Security Data Visualizationssusercb4686
 
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...Shirshanka Das
 
Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...
Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...
Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...Yael Garten
 
Data Gloveboxes: A Philosophy of Data Science Data Security
Data Gloveboxes: A Philosophy of Data Science Data SecurityData Gloveboxes: A Philosophy of Data Science Data Security
Data Gloveboxes: A Philosophy of Data Science Data SecurityDataWorks Summit
 
Managing your black friday logs Voxxed Luxembourg
Managing your black friday logs Voxxed LuxembourgManaging your black friday logs Voxxed Luxembourg
Managing your black friday logs Voxxed LuxembourgDavid Pilato
 

Ähnlich wie Filtering From the Firehose: Real Time Social Media Streaming (20)

Observability at Spotify
Observability at SpotifyObservability at Spotify
Observability at Spotify
 
Big Data made easy in the era of the Cloud - Demi Ben-Ari
Big Data made easy in the era of the Cloud - Demi Ben-AriBig Data made easy in the era of the Cloud - Demi Ben-Ari
Big Data made easy in the era of the Cloud - Demi Ben-Ari
 
Big Data to SMART Data : Process Scenario
Big Data to SMART Data : Process ScenarioBig Data to SMART Data : Process Scenario
Big Data to SMART Data : Process Scenario
 
Mesoscon 2015
Mesoscon 2015Mesoscon 2015
Mesoscon 2015
 
The Real-time Web in the Age of Agents
The Real-time Web in the Age of AgentsThe Real-time Web in the Age of Agents
The Real-time Web in the Age of Agents
 
Kostas Tzoumas - Stream Processing with Apache Flink®
Kostas Tzoumas - Stream Processing with Apache Flink®Kostas Tzoumas - Stream Processing with Apache Flink®
Kostas Tzoumas - Stream Processing with Apache Flink®
 
Debunking Common Myths in Stream Processing
Debunking Common Myths in Stream ProcessingDebunking Common Myths in Stream Processing
Debunking Common Myths in Stream Processing
 
Apache Flink: Real-World Use Cases for Streaming Analytics
Apache Flink: Real-World Use Cases for Streaming AnalyticsApache Flink: Real-World Use Cases for Streaming Analytics
Apache Flink: Real-World Use Cases for Streaming Analytics
 
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...
 
Data Stream Algorithms in Storm and R
Data Stream Algorithms in Storm and RData Stream Algorithms in Storm and R
Data Stream Algorithms in Storm and R
 
Introduction to Streaming Analytics
Introduction to Streaming AnalyticsIntroduction to Streaming Analytics
Introduction to Streaming Analytics
 
WSO2Con Asia 2014 - Simultaneous Analysis of Massive Data Streams in real-tim...
WSO2Con Asia 2014 - Simultaneous Analysis of Massive Data Streams in real-tim...WSO2Con Asia 2014 - Simultaneous Analysis of Massive Data Streams in real-tim...
WSO2Con Asia 2014 - Simultaneous Analysis of Massive Data Streams in real-tim...
 
What Open Data and Open Source can do for Sri Lanka?
What Open Data and Open Source can do for Sri Lanka?What Open Data and Open Source can do for Sri Lanka?
What Open Data and Open Source can do for Sri Lanka?
 
Hadoop application architectures - Fraud detection tutorial
Hadoop application architectures - Fraud detection tutorialHadoop application architectures - Fraud detection tutorial
Hadoop application architectures - Fraud detection tutorial
 
Hadoop application architectures - Fraud detection tutorial
Hadoop application architectures - Fraud detection tutorialHadoop application architectures - Fraud detection tutorial
Hadoop application architectures - Fraud detection tutorial
 
Network Security Data Visualization
Network Security Data VisualizationNetwork Security Data Visualization
Network Security Data Visualization
 
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
 
Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...
Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...
Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...
 
Data Gloveboxes: A Philosophy of Data Science Data Security
Data Gloveboxes: A Philosophy of Data Science Data SecurityData Gloveboxes: A Philosophy of Data Science Data Security
Data Gloveboxes: A Philosophy of Data Science Data Security
 
Managing your black friday logs Voxxed Luxembourg
Managing your black friday logs Voxxed LuxembourgManaging your black friday logs Voxxed Luxembourg
Managing your black friday logs Voxxed Luxembourg
 

Mehr von Cloud Elements

10 Step Guide to API Integrations
10 Step Guide to API Integrations10 Step Guide to API Integrations
10 Step Guide to API IntegrationsCloud Elements
 
How to Build Platforms, Not Products
How to Build Platforms, Not ProductsHow to Build Platforms, Not Products
How to Build Platforms, Not ProductsCloud Elements
 
State of API Integration Report 2017
State of API Integration Report 2017State of API Integration Report 2017
State of API Integration Report 2017Cloud Elements
 
Cloud Elements | State of API Integration Report 2018
Cloud Elements | State of API Integration Report 2018Cloud Elements | State of API Integration Report 2018
Cloud Elements | State of API Integration Report 2018Cloud Elements
 
All Things API Presentation - Gordon Weakleim [HomeAway]
All Things API Presentation - Gordon Weakleim [HomeAway]All Things API Presentation - Gordon Weakleim [HomeAway]
All Things API Presentation - Gordon Weakleim [HomeAway]Cloud Elements
 
Atlassianconnectadd onsforeveryplatform-tanguycrusson-140925195129-phpapp01
Atlassianconnectadd onsforeveryplatform-tanguycrusson-140925195129-phpapp01Atlassianconnectadd onsforeveryplatform-tanguycrusson-140925195129-phpapp01
Atlassianconnectadd onsforeveryplatform-tanguycrusson-140925195129-phpapp01Cloud Elements
 
Lean Product Development 101
Lean Product Development 101Lean Product Development 101
Lean Product Development 101Cloud Elements
 
Building Event Driven API Services Using Webhooks
Building Event Driven API Services Using WebhooksBuilding Event Driven API Services Using Webhooks
Building Event Driven API Services Using WebhooksCloud Elements
 
Lean Product Development 101
Lean Product Development 101Lean Product Development 101
Lean Product Development 101Cloud Elements
 
'Scalable Logging and Analytics with LogStash'
'Scalable Logging and Analytics with LogStash''Scalable Logging and Analytics with LogStash'
'Scalable Logging and Analytics with LogStash'Cloud Elements
 
The Entrepreneurial Methodology: How engineers can harness the madness in a n...
The Entrepreneurial Methodology: How engineers can harness the madness in a n...The Entrepreneurial Methodology: How engineers can harness the madness in a n...
The Entrepreneurial Methodology: How engineers can harness the madness in a n...Cloud Elements
 
Cloud Elements Documents Hub
Cloud Elements Documents HubCloud Elements Documents Hub
Cloud Elements Documents HubCloud Elements
 
Data normalization across API interactions
Data normalization across API interactionsData normalization across API interactions
Data normalization across API interactionsCloud Elements
 
Lean Product Development for Startups- Denver Startup Week
Lean Product Development for Startups- Denver Startup Week Lean Product Development for Startups- Denver Startup Week
Lean Product Development for Startups- Denver Startup Week Cloud Elements
 
Lean product development for startups
Lean product development for startupsLean product development for startups
Lean product development for startupsCloud Elements
 
Using a simple Ruby program to interface with quickly provisioned cloud appli...
Using a simple Ruby program to interface with quickly provisioned cloud appli...Using a simple Ruby program to interface with quickly provisioned cloud appli...
Using a simple Ruby program to interface with quickly provisioned cloud appli...Cloud Elements
 
Money & Bitcoin & the Cloud: It's all just data streams, anyway!
Money & Bitcoin & the Cloud: It's all just data streams, anyway!Money & Bitcoin & the Cloud: It's all just data streams, anyway!
Money & Bitcoin & the Cloud: It's all just data streams, anyway!Cloud Elements
 
API Versioning in the Cloud
API Versioning in the CloudAPI Versioning in the Cloud
API Versioning in the CloudCloud Elements
 

Mehr von Cloud Elements (20)

10 Step Guide to API Integrations
10 Step Guide to API Integrations10 Step Guide to API Integrations
10 Step Guide to API Integrations
 
How to Build Platforms, Not Products
How to Build Platforms, Not ProductsHow to Build Platforms, Not Products
How to Build Platforms, Not Products
 
State of API Integration Report 2017
State of API Integration Report 2017State of API Integration Report 2017
State of API Integration Report 2017
 
Cloud Elements | State of API Integration Report 2018
Cloud Elements | State of API Integration Report 2018Cloud Elements | State of API Integration Report 2018
Cloud Elements | State of API Integration Report 2018
 
All Things API Presentation - Gordon Weakleim [HomeAway]
All Things API Presentation - Gordon Weakleim [HomeAway]All Things API Presentation - Gordon Weakleim [HomeAway]
All Things API Presentation - Gordon Weakleim [HomeAway]
 
Atlassianconnectadd onsforeveryplatform-tanguycrusson-140925195129-phpapp01
Atlassianconnectadd onsforeveryplatform-tanguycrusson-140925195129-phpapp01Atlassianconnectadd onsforeveryplatform-tanguycrusson-140925195129-phpapp01
Atlassianconnectadd onsforeveryplatform-tanguycrusson-140925195129-phpapp01
 
Email As A Datasource
Email As A DatasourceEmail As A Datasource
Email As A Datasource
 
Lean Product Development 101
Lean Product Development 101Lean Product Development 101
Lean Product Development 101
 
Building Event Driven API Services Using Webhooks
Building Event Driven API Services Using WebhooksBuilding Event Driven API Services Using Webhooks
Building Event Driven API Services Using Webhooks
 
Lean Product Development 101
Lean Product Development 101Lean Product Development 101
Lean Product Development 101
 
'Scalable Logging and Analytics with LogStash'
'Scalable Logging and Analytics with LogStash''Scalable Logging and Analytics with LogStash'
'Scalable Logging and Analytics with LogStash'
 
The Entrepreneurial Methodology: How engineers can harness the madness in a n...
The Entrepreneurial Methodology: How engineers can harness the madness in a n...The Entrepreneurial Methodology: How engineers can harness the madness in a n...
The Entrepreneurial Methodology: How engineers can harness the madness in a n...
 
Cloud Elements Documents Hub
Cloud Elements Documents HubCloud Elements Documents Hub
Cloud Elements Documents Hub
 
Data normalization across API interactions
Data normalization across API interactionsData normalization across API interactions
Data normalization across API interactions
 
Lean Product Development for Startups- Denver Startup Week
Lean Product Development for Startups- Denver Startup Week Lean Product Development for Startups- Denver Startup Week
Lean Product Development for Startups- Denver Startup Week
 
Appx for Developers
Appx for Developers   Appx for Developers
Appx for Developers
 
Lean product development for startups
Lean product development for startupsLean product development for startups
Lean product development for startups
 
Using a simple Ruby program to interface with quickly provisioned cloud appli...
Using a simple Ruby program to interface with quickly provisioned cloud appli...Using a simple Ruby program to interface with quickly provisioned cloud appli...
Using a simple Ruby program to interface with quickly provisioned cloud appli...
 
Money & Bitcoin & the Cloud: It's all just data streams, anyway!
Money & Bitcoin & the Cloud: It's all just data streams, anyway!Money & Bitcoin & the Cloud: It's all just data streams, anyway!
Money & Bitcoin & the Cloud: It's all just data streams, anyway!
 
API Versioning in the Cloud
API Versioning in the CloudAPI Versioning in the Cloud
API Versioning in the Cloud
 

Kürzlich hochgeladen

08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 

Kürzlich hochgeladen (20)

08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 

Filtering From the Firehose: Real Time Social Media Streaming

  • 1. Filtering from the Firehose ! Real-time streaming of social network data! ! ! Jim Moffitt – Developer Advocate @gnip @jimmoffitt
  • 2. Who is this guy and what is he going to talk about? •  Introduc)on   •  Social  media  firehoses   •  Data  sources   •  Use-­‐cases   •  Needle  in  the  haystack   •  Filtering  from  the  firehose   •  Example  use-­‐case   •  Server-­‐side   •  Apache  KaCa       •  Apache  Cassandra   •  Client-­‐side   •  HTTP  streaming  code  examples   •  Live  streaming  and  search        
  • 3. What is a firehose? •  Con)nuous  stream  of  flexibly  structured   (JSON)  social  media  ac)vi)es  in  near-­‐real   )me.   •  Poten)ally  extreme  amounts  of  data.  
  • 5. Accessing Social Data for Analytics:! Crawling/Scraping! Licensed Access: ! Publisher provides data “firehose”! It’s Free! Open Access! No rate limits, compliant, reliable! Rate limits, not guaranteed! TOS issues, high latency, fragile! Financial investment, not all publishers are covered! Public API’s! Pros Cons
  • 6. Example firehose volumes Publisher   Daily  Ac0vity   TwiQer   450  M   Tumblr   96  M  +  54  M  votes   Foursquare   4.3  M   Disqus   1.9  M   Wordpress  Comments   1.4  M   Wordpress  Posts   0.6  M   GetGlue   0.6  M  
  • 7. Daily Tweet Activity Count 2006 5k 4k 3k 2k 1k 0 2007 200 k 100 k 0 Tweets/Day 2008 1.6 M 1.2 M 800.0 k 400.0 k 2009 25 M 20 M 15 M 10 M 5M 2010 80 M 60 M 40 M 20 M 2011 250 M 200 M 150 M 100 M Jan Feb Mar Apr May Jun Jul Date Aug Sep Oct Nov Dec Jan
  • 8. Use-cases for Social Media Analysis •  •  •  •  •  •  Sales  &  Marke)ng   Brand  monitoring   Customer  Service     Public  Rela)ons   Emergency  Response   All  kinds  of  academic  research…  
  • 9. So you are building something around social media? Some  business  considera)ons:     •  Objec)ve  –  what  are  the  ques)ons  that  you  are  trying  to  answer?   •    Timeframe  –  real-­‐)me  or  historical  use-­‐case  (or  both)?   •    Coverage  –  do  I  need  all  the  data  or  some  sta)s)cal  sample?   •  Licensing  and  Terms  of  Service     •  Budgets   •  Data  costs.   •  Sofware  development.   •  Infrastructure  (bandwidth,  servers,  storage).      
  • 10. So you are building something around social media? Some  technical  considera)ons:     •  Data  transfer  protocols:  RESTful  or  ‘keep-­‐alive’  Streaming?   •  What  sofware  language?   •  Bandwidth:  what  does  your  peak  volume  need  to  be?   •  Data  storage   •  How  and  where  are  you  storing  the  data?   •  What  metadata  do  you  need  to  store?*   •  Redundant  streams?      
  • 11. What data comes with a tweet? {"id":"tag:search.twiQer.com,2005:388326436685103105","objectType":"ac)vity","actor":{"objectType":"person","id":"id:twiQer.com: 17200003","link":"hQp://www.twiQer.com/jimmoffiQ","displayName":"jimmoffiQ","postedTime":"2008-­‐11-­‐05T23:06:37.000Z","image":"hQps:// si0.twimg.com/profile_images/3678478654/6aac91cc6bd5711b82c83ebab0a55de0_normal.jpeg","summary":"Once  studied  snow  hydrology.    Recently   developed  real-­‐)me  weather  monitoring  and  flood  warning  sofware.    Have  started  a  new  adventure  at  an  amazing  company...","links": [{"href":null,"rel":"me"}],"friendsCount":69,"followersCount":71,"listedCount":1,"statusesCount":189,"twiQerTimeZone":"Mountain  Time  (US  &   Canada)","verified":false,"utcOffset":"-­‐21600","preferredUsername":"jimmoffiQ","languages":["en"],"loca)on": {"objectType":"place","displayName":"Longmont,  Colorado"},"favoritesCount":17},"verb":"post","postedTime":"2013-­‐10-­‐10T15:33:31.000Z","generator": {"displayName":"TweetDeck","link":"hQp://www.tweetdeck.com"},"provider":{"objectType":"service","displayName":"TwiQer","link":"hQp:// www.twiQer.com"},"link":"hQp://twiQer.com/jimmoffiQ/statuses/388326436685103105","body":"Looking  forward  to  this  "All  Things  Cloud"  meet-­‐up  in   Denver  next  Tuesday  10/15  hGp://t.co/EQSCWMW4hL  @gnip","object":{"objectType":"note","id":"object:search.twiQer.com, 2005:388326436685103105","summary":"Looking  forward  to  this  "All  Things  Cloud"  meet-­‐up  in  Denver  next  Tuesday  10/15  hQp://t.co/EQSCWMW4hL   @gnip","link":"hQp://twiQer.com/jimmoffiQ/statuses/388326436685103105","postedTime":"2013-­‐10-­‐10T15:33:31.000Z"},"favoritesCount": 0,"twiQer_en))es":{"hashtags":[],"symbols":[],"urls":[{"url":"hQp://t.co/EQSCWMW4hL","expanded_url":"hQp://meetu.ps/ 1Fywpg","display_url":"meetu.ps/1Fywpg","indices":[80,102]}],"user_men)ons":[{"screen_name":"gnip","name":"Gnip,  Inc.","id": 16958875,"id_str":"16958875","indices":[103,108]}]},"twiQer_filter_level":"medium","twiQer_lang":"en","retweetCount":0,"gnip":{"matching_rules": [{"value":""All  Things  Cloud"","tag":null},{"value":"from:jimmoffiQ","tag":null}],"urls":[{"url":"hQp://t.co/EQSCWMW4hL","expanded_url":"hQp:// www.meetup.com/All-­‐things-­‐Cloud-­‐PaaS-­‐SaaS-­‐PaaS-­‐XaaS/events/124584092/"}],"klout_score":49,"klout_profile":{"topics": [{"klout_topic_id":"10000000000000000020","displayName":"Tablets","link":"hQp://klout.com/topic/id/ 10000000000000000020"}],"klout_user_id":"26177177599171892","link":"hQp://klout.com/user/id/26177177599171892"},"language": {"value":"en"},"profileLoca)ons":[{"objectType":"place","geo":{"type":"point","coordinates":[-­‐105.10193,40.16721]},"address":{"country":"United   States","countryCode":"US","locality":"Longmont","region":"Colorado"},"displayName":"Longmont,  Colorado,  United  States"}]}}  
  • 12. Methods for filtering data •  Token  filter  (e.g.  "pizza",  "beer"  )   •  Substrings  (contains:sport)   •  Exact  phrases  ("all  things  cloud”)   •  Operators:  metadata  (geo,  language,  profiles,  account  stats,  ...  )   •  Operators:  sampling  (e.g.  sample:10%)   •  Publisher-­‐specific  Operators:  hashtags,  user  men)ons/from/to,  retweets,  ...          Examples:                        (pizza  beer)  "all  things  cloud"  profile_region:colorado                        twins  (baseball  OR  minnesota  OR  sports  OR  “small  market”)  –(cute  OR  baby  OR    olsen  OR  olson)    
  • 13. ! Example use-case: Early-warning systems  Is  there  a  TwiQer  ‘signal’  around  local  rain  and  flood  events?   Business  logic:     rain  OR  raining  OR  rained  OR  pouring  OR  weather  OR  hail  OR  lightning  OR   contains:flood  OR  "cats  and  dogs"  OR  wxreport  OR  contains:storm  OR   contains:precip           See  h   Qp://blog.gnip.com/twee)ng-­‐in-­‐the-­‐rain  Parts  1,  2  &  3    
  • 14. Social media and early-warning systems There  are  generally  three  methods  for  geo-­‐referencing  TwiQer  data:     •  Ac)vity  Loca)on:  tweets  that  are  geo-­‐tagged.   •  Men)oned  Loca)on:  parsing  the  tweet  message  for  geographic  loca)on.   •  Profile  Loca)on:  parsing  the  TwiQer  Account  Profile  loca)on  provided  by  the  user.       •  User  account  profile:  82%   •  Tweet  text:  17%   •  Tweet  geo-­‐tagging:  1%   See  hQp://blog.gnip.com/twee)ng-­‐in-­‐the-­‐rain  Parts  1,  2  &  3    
  • 15. Social media and early-warning systems •  Profile  Loca)on  (old):   •  bio_loca)on_contains:louisville  -­‐(bio_loca)on_contains:"co  "  OR   bio_loca)on_contains:colorado)  -­‐(bio_loca)on_contains:"tn  "   OR  bio_loca)on_contains:tennessee)   •  Profile  Loca)on  (new):   •  profile_locality:louisville  profile_region:kentucky         See  hQp://blog.gnip.com/twee)ng-­‐in-­‐the-­‐rain  Parts  1,  2  &  3    
  • 16. Social media and early-warning systems         See  hQp://blog.gnip.com/twee)ng-­‐in-­‐the-­‐rain  Parts  1,  2  &  3    
  • 17. Social media and early-warning systems See  hQp://blog.gnip.com/twee)ng-­‐in-­‐the-­‐rain  Parts  1,  2  &  3    
  • 18. Apache Kafka @ Gnip KaCa  is  used  to  help  manage  streaming  traffic  with  the  outside  world.         First  applica)on  was  with  outbound  streams                                              Gnip  à  Customer       Helps  provide  a  “on-­‐disk”  buffer  for  client  streams.  Write  data  to  disk  for  a   short  period.    If  client  disconnects,  when  they  reconnect  their  data  buffer  is     “backfilled.”    
  • 19. Apache Kafka @ Gnip Next  applied  to  inbound  Publisher  streams                                                    Publisher    à    Gnip     Buffers  incoming  data  and  helps  manage  massive  volume  spikes.       Spikes  are  isolated  to  this  ingest  )er.     Downstream  applica)ons  read  data  as  fast  as  they  can.    
  • 20. Apache Cassandra @ Gnip!   Serves  a  moving  window  of  TwiQer  day  (currently  30  days).    Will  grow.     Chosen  for  its     •  Write-­‐speeds     •  Reliability   •  Redundancy   •  Scalability    
  • 21. Apache Cassandra @ Gnip!   •  Serves  a  variety  of  data  services,  products  and  use-­‐cases.       •  For  Search  we  have  an  Apache  Lucene  index  helping  to  quickly  find  Cassandra  data.   •  Nearly  50  Cassandra  servers  across  test/staging/produc)on  environments.  
  • 22. Streaming social media curl  -­‐ujmoffiQ@gnipcentral.com  hQps://api.gnip.com:443/accounts/jim/publishers/twiQer/ streams/track/dev/rules.json     curl  -­‐v  -­‐X  POST  -­‐ujmoffiQ@gnipcentral.com     "hQps://api.gnip.com:443/accounts/jim/publishers/twiQer/streams/track/dev/rules.json"     -­‐d  '{"rules":[{"tag":"demo","value":"weather  OR  rain  OR  snow"}]}'   curl  -­‐-­‐compressed  -­‐v  -­‐ujmoffiQ@gnipcentral.com     "hQps://stream.gnip.com:443/accounts/jim/publishers/twiQer/streams/track/dev.json"  
  • 23. Code examples Search  GitHub  for  “TwiQer  Stream”     Python  Streaming  Connec)on   We've  found  793  repository  results   HERE   Ruby  Streaming  Connec)on  (using  ‘curb’  libcurl  gem)   HERE   Ruby  Streaming  Connec)on  (using  EventMachine  gem)   HERE  
  • 24. Live Search Demo hQps://search-­‐demo.prod.gnip.com:8443   hQps://github.com/gnip/gnip-­‐search-­‐demo