The document discusses real-time analytics for big data using Twitter as an example. It describes how Twitter processes hundreds of millions of tweets daily and the need for analytics systems to handle streaming data in real-time. Storm is presented as a framework for real-time analytics that can be enhanced with Hadoop and a data grid like GigaSpaces to provide high performance, scalability, and reliability. The combination of these technologies allows for both real-time and batch analytics on large, streaming datasets.
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bigdata analytics-twitter
1. Real Time Analytics for Big Data –
Lessons from Twitter (and beyond)..
DeWayne Filppi
@dfilppi
2. Big Data Predictions
“Over the next few years we'll see the adoption of scalable
frameworks and platforms for handling
streaming, or near real-time, analysis and processing. In the
same way that Hadoop has been borne out of large-scale web
applications, these platforms will be driven by the needs of large-
scale location-aware mobile, social and sensor use.”
Edd Dumbill, O’REILLY
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved
2
3. The Two Vs of Big Data
Velocity Volume
3 ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved
4. We’re Living in a Real Time World…
Social User Tracking & Homeland Security
Engagement
eCommerce Financial Services Real Time Search
4 ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved
5. The Flavors of Big Data Analytics
Counting Correlating Research
5 ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved
6. Analytics @ Twitter – Counting
How many signups,
tweets, retweets for a
topic?
What’s the average
latency?
Demographics
Countries and cities
Gender
Age groups
Device types
…
6 ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved
7. Analytics @ Twitter – Correlating
What devices fail at the
same time?
What features get user
hooked?
What places on the
globe are “happening”?
7 ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved
8. Analytics @ Twitter – Research
Sentiment analysis
“Obama is popular”
Trends
“People like to tweet
after watching
American Idol”
Spam patterns
How can you tell when
a user spams?
8 ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved
9. It’s All about Timing
“Real time” Reasonably Quick Batch
(< few Seconds) (seconds - minutes) (hours/days)
9 ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved
10. It’s All about Timing
• Event driven / stream processing
• High resolution – every tweet gets counted
• Ad-hoc querying This is what
• Medium resolution (aggregations) here
we’re
to discuss
• Long running batch jobs (ETL, map/reduce)
• Low resolution (trends & patterns)
10 ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved
12. Twitter in Numbers (Jan 2013)
It takes a week for users to
send 3 billion Tweets.
Source: http://blog.twitter.com/2011/03/numbers.html
12 ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved
13. Twitter in Numbers (Jan 2013)
On average,
500 million
tweets get sent every day.
Source: http://news.cnet.com/8301-1023_3-57541566-93/report-twitter-hits-half-a-billion-tweets-a-day/l
13 ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved
14. Twitter in Numbers (Jan 2013)
The highest
throughput to date is
33,388 tweets/sec.
http://www.huffingtonpost.com/2013/01/02/tweets-per-second-record_n_2396915.html
14 ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved
15. Twitter in Numbers (March 2011)
1,000,000 new
accounts
are created daily.
Source: http://www.mediabistro.com/alltwitter/50-twitter-fun-facts_b33589
15 ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved
16. Twitter in Numbers
5% of the users generate
75% of the content.
Source: http://www.sysomos.com/insidetwitter/
16 ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved
17. Analyze the Problem
(Tens of) thousands of tweets per second to
process
Assumption: Need to process in near real time
Aggregate counters for each word
A few 10s of thousands of words (or hundreds of
thousands if we include URLs)
System needs to linearly scale
System needs to be fault tolerant
17 ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved
18. Key Elements in
Real Time Big Data Analytics
18 ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved
19. Sharding (Partitioning)
Counter
Tokenizer1 Filterer 1 Updater 1
Counter
Tokenizer2 Filterer 2 Updater 2
Tokenizer Counter
Filterer 3
3 Updater 3
Tokenizer Counter
Filterer n
n Updater n
20. Use EDA (Event Driven Architecture)
Counter
Raw Tokenizer Tokenized Filterer Filtered /
Aggregator
20 ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved
21. Twitter Storm
21 ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved
22. Twitter Storm With Hadoop
22 ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved
23. Storm Overview
23 ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved
24. Storm Concepts
Spouts
Streams
Bolt
Unbounded sequence of tuples
Spouts
Source of streams (Queues)
Bolts
Functions, Filters, Joins, Aggregations
Topologies
Topologies
24 ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved
25. Challenge – Word Count
Tweets
Count
Word:Count
• Hottest topics
• URL mentions
• etc.
25 ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved
26. Streaming word count with Storm
26 ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved
27. Supercharging Storm
Storm doesn’t supply persistence, but provides for it
Storm optimizes IO to slow persistence (e.g. databases) using
batching.
Storm processes streams. The stream provider itself needs to
support persistency, batching, and reliability.
Tweets,
events,whatever….
27 ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved
28. XAP Real Time Analytics
28 ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved
29. Two Layer Approach
Advantage: Minimal
Raw Event Stream
Raw Event Stream
Raw Event Stream
ts
ents
“impedance mismatch”
en
Real Time Ev
Real Time Ev
between layers.
– Both NoSQL cluster
technologies, with similar
advantages SCALE
Grid layer serves as an in
Reporting Engine
In Memory Compute Cluster
memory cache for interactive
Raw And Derived Events
requests.
Grid layer serves as a real time ...
SCALE
computation fabric for CEP, and
NoSQL Cluster
limited ( to allocated memory)
real time distributed query
capability.
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
31. Key Concepts
Flowing event streams through memory for side effects
Event driven architecture executing in-memory
Raw events flushed, aggregations/derivations retained
All layers horizontally scalable
All layers highly available
Real-time analytics & cached batch analytics on same scalable
layer
Data grid provides a transactional/consistent façade on NoSQL
store (in this case eliminating SQL database entirely)
31 ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved
32. Keep Things In Memory
Facebook keeps 80% of its
data in Memory
(Stanford research)
RAM is 100-1000x faster
than Disk (Random seek)
• Disk: 5 -10ms
• RAM: ~0.001msec
33. Take Aways
A data grid can serve different needs for big data analytics:
Supercharge a dedicated stream processing cluster like Storm.
– Provide fast, reliable, transactional tuple streams and state
Provide a general purpose analytics platform
– Roll your own
Simplify overall architecture while enhancing scalability
– Ultra high performance/low latency
– Dynamically scalable processing and in-memory storage
– Eliminate messaging tier
– Eliminate or minimize need for RDBMS
34. References
Realtime Analytics with Storm and Hadoop
http://www.slideshare.net/Hadoop_Summit/realtime-
analytics-with-storm
Learn and fork the code on github:
https://github.com/Gigaspaces/storm-integration
Twitter Storm:
http://storm-project.net
XAP + Storm Detailed Blog Post
http://blog.gigaspaces.com/gigaspaces-and-storm-part-2-
xap-integration/
34 ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved
35. 35 ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved