Streaming data analysis in real time is becoming the fastest and most efficient way to obtain useful knowledge from what is happening now, allowing organizations to react quickly when problems appear or to detect new trends helping to improve their performance. Evolving data streams are contributing to the growth of data created over the last few years. We are creating the same quantity of data every two days, as we created from the dawn of time up until 2003. Evolving data streams methods are becoming a low-cost, green methodology for real time online prediction and analysis. We discuss the current and future trends of mining evolving data streams, and the challenges that the field will have to overcome during the next years.
9. Big Data
McKinsey Global Institute (MGI) Report on Big Data, 2011.
Big data refers to datasets whose size is beyond
the ability of typical database software tools to
capture, store, manage, and analyze.
10. Big Data
McKinsey Global Institute (MGI) Report on Big Data, 2011.
Big data refers to datasets whose size is beyond
the ability of typical database software tools to
capture, store, manage, and analyze.
19. Applications
sensor data: industry, cities
telecomm data
social networks: twitter, facebook, yahoo
marketing: sales business
Data may come from: humans, sensors, or
machines.
20. New applications: social networks
Twitter: A Massive Data Stream
Micro-blogging service
Built to discover what is happening at any moment in time,
anywhere in the world.
3 billion requests a day via its API.
MOA-TweetReader: a real-time system to
read tweets in real time
detect changes
find the terms whose frequency changed
21. Sentiment Analysis on Twitter
Sentiment analysis
Classifying messages into two categories depending on
whether they convey positive or negative feelings
Emoticons are visual cues associated with emotional states,
which can be used to define class labels for sentiment
classification
Positive Emoticons Negative Emoticons
:) :(
:-) :-(
:) :(
:D
=)
Table : List of positive and negative emoticons.
22. New problem: structured classification
New methods for structured classification
D D
B B B
→
C C C C
A
sequences, trees, graphs
23. New problem: structured classification
New methods for structured classification
D D D D
B , B B B , B
→
C C C C C
A
sequences, trees, graphs
frequent pattern mining techniques
24. New problem: structured classification
New methods for structured classification
D D D D
B , B B B , B
→
C C C C C
A a,b → class1, class2
sequences, trees, graphs
frequent pattern mining techniques
multi-label data mining
Example: Lord of the Rings → Action, Adventure, Fantasy
30. Pig
A = LOAD ’data’ USING PigStorage() AS
(f1:int, f2:int, f3:int);
B = GROUP A BY f1;
C = FOREACH B GENERATE COUNT ($0);
DUMP C;
Pig: Similar to SQL
35. Storm
Tools
ElephantDB, Voldemort
All Hadoop Precomputed
batch view
data
Storm Query
Precomputed
realtime view
New data stream Storm
Cassandra, Riak, HBase
Kafka
“Lambda Architecture”
Runaway complexity in Big Data
Nathan Marz, 2012