Webinar Presentation from 2013-08-30
An introduction to what Big Data and Big Analytics can be used for and why it is relevant for your. Includes real life samples and ideas and concludes with a look at InfiniDB
2. Agenda
• Big Data and Big Analytics – What is it?
• Big Analytics vs. the Data Warehouse?
• Big Analytics examples
• Database technologies for Big Analytics
• Questions and Answers
3. What is Big Data?
• Big Data is data that is not
immediately related to my own
business
• Big Data is largely unstructured
• Big Data consists of data from many
different sources, such as
Facebook,Twitter web-pages, blogs
and any other source you can find
• Big Data is all about volume and
analysis!
4. Because you want to grow your
business!
• You can get customers from your competitors
– The data on these customers are not in your CRM!
– Why did they go with someone else than with
you? Your Data Warehouse has few answers to
this!
• You can grow the market
– Those new customers are not in your CRM or Data
Warehouse either, to a large extent!
• You can do both of these!
5. Why do I need all this data
• “My Data Warehouse tells me all I ever want
to know, in gruesome detail, about my
customers, what more do I need?”
• “I get much more data from my CRM system
than I do from friggin’ Facebook!”
• “Why would I need all those pictures from
Facebook and all those twitter texts, they tell
me nuthin’!”
6. What is Big Analytics
• To get insights from Big Data, you need a more
powerful analysis: Big Analytics
• Big Analytics often cannot rely
on simple BTREE indexes
• Big Analytics provides
exponentially better accuracy
the more data you have
7. What is Big Analytics useful for?
• For getting information on things
in the “outside world”
– My competitors
– My competitors customers
• For foreseeing trends
– What will be “the next big thing” in my business?
– What new markets are developing?
– What is happening in my current market?
9. Big Analytics use cases
• The higher the volume of your business, the
more useful Big Data becomes
– If you have very few customers, Big Data might be
less useful
• Retail is a common use case, but there are
many more
– Finance – Big Data trend analysis
– Intelligence – Analysis of new and unknown
trends and loosely tied groups
– Politics – What is my competition up to?
10. Big Analytics vs. Data Warehouse
• Your Data Warehouse is very focused and
contains high quality information on low level
data:
“John Doe bought Chocko Chocolate Chip
Cookies for $3.61 on Jan 12 2013”
• Big Data provides much more data, but each
information item has less detail to it:
“Chocko Chocolate Chip Cookies suck!”
“An increasing amount of people tweet about
Chocolate Chip Cookies”
11. Big Analytics vs. Data Warehouse
• What Big Analytics lack in terms of data item
correctness can be compensated for by:
– Volume: If more than 200.000 tweets agree that
our Chocko cookies suck, then we should probably
look into it.
– Proper analysis: Images can be analyzed for
content and stuff you didn’t think about: Maybe
“Ma Cookies” brand cookies has an edge on us in
that their packaging looks more pleasing? Do we
see “Ma Cookies” being eaten in unexpected
places or at unexpected times?
12. Big Analytics - Linguistic analysis
• This is for tweets, blogs, Facebook and similar.
Proper linguistic analysis is complex:
– Sentiment
“Ma Cookies might seems like they suck, but they
are actually quite tasty”
– Temporal
“In January 2011 we wrote that Chocko Cookies
used to taste like manure in 2008, but that they
have improved since then”
– Ranking
– Really complex for larger blocks of text
13. Other types of Big Analytics
• Image analysis is a fast developing field,
where we find new and interesting use cases
– What are the most popular colors?
– What color has peoples clothes?
– How long has that suitcase been standing at the
floor at the airport?
• Location analysis
– Where did this happen?
– In what city is that? What country?
• Temporal analysis
– When did this happen? When was it published?
14. New Visualizations for New Insights
• Visualizing data as a report with columns and
rows isn’t always effective
• With new and diverse types of data, we need
new ways of visualizing data
– Location on maps
– Timelines
– Sentiments
• Even with traditional Data Warehouse data,
new visualizing can provide new insights!
• Interactive visualizations
17. Map Visualization – Android or iOS
Visualizations by MapBox
• Smartphone OS metadata in Geography view
– iPhone is Red, Android is Green
– Based on data from Verizon passed to NSA
18. Big Analytics database issues
• Big Analytics is complex!
• Big Analytics doesn’t always allow the
“analyze-once-find-later” attribute
of a classic index!
• Big Analytics is compute intensive
• Big Analytics needs some
programming. Yikes!
19. Map-Reduce to the rescue
• Map-Reduce allows distributed processing on
large amounts of data
– Map – Algorithm to distribute data across nodes
– Reduce – Algorithm to aggregate data from the nodes
• Hadoop is the best known and used Map-Reduce
framework
• Map and Reduce still must be developed
• But we still need some kind of database
20. So, what we need is an Analytical
Database
• Support for complex analysis
• Support for distributed, parallel processing
(Map-Reduce for example)
• Support for storing and processing massive
amounts of data
• Some kind of cool index technology that work
with big data, both reads and writes
– Or maybe. A scary idea just came to me…
21. No indexes! Because you don’t
need or want them!
• What! What’s wrong with good old BTREEs?
– They are not well suited to Big Data!
– Their usefulness slows down as data grows
– Updates slow down significantly as the tree
grows!
– Skewed data is doesn’t work well
• SPATIAL? FREETEXT? HASH? BITMAP?
– These are either too specialized or lacks the
functionality we need
22. Calpont InfiniDB
Real-time, Consistent Query Performance
Linear Scale for Massive Data
Removes Limits to Dimensions and Granularity
Easy to Deploy and Maintain
23. Tiered Query Execution
•User Module – Processes SQL Requests
•Performance Module – Executes the Queries
or
Single ServerMPP
24. Map-Reduce for Powerful Analytics
SQL Operations are mapped to Performance Module threads
• Parallel/Distributed Data Access
• Parallel/Distributed Joins (Inner, Outer)
• Parallel/Distributed Sub-queries (From, Where, Select)
• Parallel/Distributed Group By, Distinct, and Aggregation
• Extensible with Parallel/Distributed User Defined Functions
Results are returned to User Module in Reduce Phase
Map Reduce
25. Calpont InfiniDB
• Support for Amazon EC2
– Full EBS support
– Prepackaged AMIs for ease of provisioning
• Hadoop connector
• Multiple parallel load
options
• Available now!
26. • This is true of analytics in general, but particularly
true when working with Big Analytics
• The more data you have, the more
relevant questions you can ask
• The more questions you ask, the more
you know
• The more you know, the more questions
you can ask
• The wider the range of data you have, the wider
questions can be asked
If you think you have all the right answers,
you haven’t asked all the right questions