This document provides a high-level summary of NoSQL and Big Data:
1) It discusses the history of databases from COBOL to SQL and the development of NoSQL in response to the need to handle large, unstructured datasets.
2) It outlines some of the opportunities that NoSQL databases provide for storing and analyzing massive amounts of diverse data types.
3) It briefly mentions some examples of popular NoSQL databases like MongoDB, Cassandra, and DynamoDB that are well-suited for Big Data applications.
5. Billions of Keys & Values
GFS
Google
Big Table
Hadoop
Cassandra
Dynamo
5
6. How would you build a super-fast,
FB-scale chat service, in 2012?
(for example)
6
7. I want my own DB!
• Memcached
Main
Memory • redis
Distr.
• MongoDB
K-V
Versions • CouchDB
Social
Graphs • Neo4j
7
8. BIG
KB GB TB PB
Data Semi-
FILES TABLES Variety
Structured
Dynamic
Analytics OLAP
STATS Apps Mahout
Cube
Language
COBOL SQL XML NoSQL
60’s 80-96 96-’07 ‘07-
8
9. Following *AMAZING* Slides Courtesy: Gregory Piatesky-Shapiro, kdnuggets.com
You can find all the slides from his talk at:
http://www.slideshare.net/gpiatetskyshapiro/analytics-and-data-mining-industry-overview
9
10. Data Tsunami
• In 2010 enterprises
stored 7 exabytes
=7,000,000,000 GB
of new data (McKinsey)
• 90 percent of the
world's data has been
Image with apologies to KDD-2011
generated in the past
two years (IBM)
10
11. Pre-history
Statistics is the biggest term in 20th century, but
data mining and analytics appears in late
1990s
From Google Ngram viewer – English language books
Note: Our analysis uses only English language data.
Other languages, especially Chinese , need to be considered for full picture
11
12. Recent History:
Analytics, Data Mining, Knowledge Discovery
Analytics has been used since 1800, but started to rise in 2005
Data Mining jumps around 1996 (soon after first KDD conference) but declines after
2003 (TIA controversy, associated with gov. invasion of privacy).
Knowledge Discovery appears in 1989, jumps in 1996, and plateaus after 2000
12
18. Largest Dataset Analyzed?
2011 median dataset
size ~10-20 GB,
vs 8-10 GB in 2010.
Increase in
10 GB to 1 PB range
www.KDnuggets.com/polls/2011/largest-dataset-analyzed-data-mined.html
18
19. Which methods/algorithms did you
use for data analysis in 2011
% analysts who used it
0% 10% 20% 30% 40% 50% 60% 70%
Decision Trees
Regression
Clustering
Statistics
Visualization
Time series/Sequence analysis
Support Vector (SVM)
Association rules
Ensemble methods
Text Mining
Neural Nets
Boosting
Bayesian
Bagging
Factor Analysis
Anomaly/Deviation detection
Social Network Analysis
Survival Analysis
Genetic algorithms
Uplift modeling
www.KDnuggets.com/polls/2011/algorithms-analytics-data-mining.html
19
20. Cloud Analytics is not common
(yet)
www.KDnuggets.com/polls/2011/algorithms-analytics-data-mining.html
20
21. Shortage of Skills
• McKinsey: shortage by 2018 in the US of
– 140-190,000 people with deep analytical skills
– 1.5 M managers/analysts with the know-how
to use the analysis of big data to make
effective decisions.
Source:
www.mckinsey.com/mgi/publications/big_data
/ 21
24. “Ground” Analytics (LinkedIn
Skills)
~ 75,000 with Data Mining skill
~ 7,000 with Predictive Modeling
Also
~ 20,000 with Predictive
Analytics
(not related with Predictive
Modeling ??
24