Big Data Analysis Patterns: Tying real world use cases to strategies for analysis using big data technologies and tools.
Big data is ushering in a new era for analytics with large scale data and relatively simple algorithms driving results rather than relying on complex models that use sample data. When you are ready to extract benefits from your data, how do you decide what approach, what algorithm, what tool to use? The answer is simpler than you think.
This session tackles big data analysis with a practical description of strategies for several classes of application types, identified concretely with use cases. Topics include new approaches to search and recommendation using scalable technologies such as Hadoop, Mahout, Storm, Solr, & Titan.
6. The Good News in Big Data:
“Simple algorithms and lots of data
trump complex models”
Halevy, Norvig, and Pereira, Google
IEEE Intelligent Systems
6
7. The Challenge: So Many Solutions!
What solutions fit your business problem?
For example, do you need…
Apache Mahout?
Storm?
Apache Solr/Lucene?
Apache HBase (or MapR M7)?
Apache Drill (or Impala?)
d3.js or Tableau?
Node.js
7
Apache Hadoop?
Titan?
7
8. Ask a Different Question
It may be more useful to better define the problem by asking some
of these questions:
How large is the data to be queried? (the analysis volume)
What time frame is appropriate for your query response?
How fast is data arriving? (bursts or continuously?)
Are queries by sophisticated users?
Are you looking for common patterns or outliers?
8
How large is the data to be stored?
How are your data sources structures?
8
9. Picking the Best Solution
Your responses to these questions can help you better:
define the problem
recognize the analysis pattern to which it belongs
guide the choice of solutions to try
But first, here’s a quick review of a few of the technologies you
might choose, and then we will focus on three of the questions as a
part of the landscape.
9
9
10. Apache Solr/Lucene
Solr/Lucene is a powerful search engine used for flexible, heavily
indexed queries including data such as
Full text
Geographical data
Statistically weighted data
Solr is a small data tool that has flourished in a big data world
10
11. Apache Mahout
Mahout provides a library of scalable machine learning algorithms
useful for big data analysis based on Hadoop or other storage
systems.
Mahout algorithms mainly are used for
Recommendation (collaborative filtering)
Clustering
Classification
Mahout can be used in conjunction with solutions such as Solr: You
might use Mahout to create a co-occurrence data base that could
then be queried using a search tool such as Solr
11
15. Using the Answers to Guide Your Choices
For simplicity, let’s focus in on the first three questions:
How large is the data to be stored?
How large is the data to be queried? (the analysis volume)
What time frame is appropriate for your query response?
15
16. Big Data Decision Tree
How big is your data?
<10 GB
mid
?
?
A
Single element
at a time
>200 GB
What size queries?
One pass
over 100%
B
Response time?
C
Big storage
Multiple passes
over big chunks
Streaming
< 100s
(human scale)
D
16
throughput
not response
E