This document discusses big data and related technologies. It begins with an overview of big data, describing its characteristics and sources. It then discusses different data types including structured and unstructured data. Next, it covers big data technologies for storage, processing, and transfer of large datasets. It compares SQL and NoSQL databases. The document also discusses big data security and trends. It concludes with references and a demonstration of MongoDB.
3. How much data?
7 billion people
Google processes 100 PB/day; 3 million servers
Facebook has 300 PB + 500 TB/day; 35% of
world’s photos
YouTube 1000 PB video storage; 4 billion
views/day
Twitter processes 124 billion tweets/year
SMS messages – 6.1T per year
US Cell Calls – 2.2T minutes per year
US Credit cards - 1.4B Cards; 20B
transactions/year
3
4. Contents
4. Big Data Security
3. SQL vs NoSQL
2. Big Data Technology Today
1. Big Data Overview
5. Big data trends
6. Demo with MongoDB & Ref docs
5. 1. Big Data Overview (tt)
“Big data is not a single technology
but a combination of old and new
tech-nologies that helps companies
gain actionable insight”.
(“Big Data For DummiesPublished by John Wiley & Sons,
Inc. ” book reference)
10. Structured Data(…)
Computer- or machine-generated:
Machine-generated data generally
refers to data that is created by a
machine without human intervention.
(Sensor data, Web log data, Point-of-
sale data, Financial data…)
Human-generated: This is data that
humans, in interaction with
computers, supply (Input data, Click-
stream data, Gaming-related data…)
12. Unstructured Data(…)
Unstructured data is everywhere
Machine-generated unstructured
data: Satellite images, Scientific
data, Photographs and video, Radar
or sonar data…
Human-generated unstructured
data:Text internal to your company,
Social media data, Mobile data…
14. Managing different data types
Integrating data types into a big data
environment need:
Connectors: enable you to pull data
in from various big data sources
Metadata is the definitions,
mappings, and other characteristics
used to describe how to find, access,
and use a company’s data (and
software) components
15. Analysis
• Querying
• Statistic
• Modeling
• Data Mining
• Text analytics
Analysis &
Processing
Processing
• Data storage
• Data transfer
• Data monitoring
What will we do with Big Data?
19. 2.Big Data Technology Today(tt)
The Apache Hadoop software library is a
framework that allows for the distributed
processing of large data sets across clusters of
computers using simple programming models.
20. 2.Big Data Technology Today(tt)
Instead of treating
memory as a cache,
why not treat it as a
primary data store?
Facebook keeps 80% of its
data in Memory (Stanford
research)
RAM is 100-1000x faster
than Disk (Random seek)
• Disk - 5 -10ms
• RAM – x0.001msec
20
Events
FACEBOOK
FACEBOOK
FACEBOOK
Memory Grid
Data Grid
Data Grid
Data Grid
22. 2.Big Data Technology Today(tt)
Open-source software framework from
Apache Hadoop
Google MapReduce
GFS (Google File System)
HDFS
Map/Reduce
23. 3. SQL vs NoSQL
Data
storage
File
SQL
DBMS
NoSQL
24. 3. SQL vs NoSQL (…)
A relational database is a set of tables
containing data fitted into predefined
categories.
Each table contains one or more data
categories in columns.
Each row contains a unique instance of
data for the categories defined by the
columns.
25. 3. SQL vs NoSQL (…)
Key-value stores. As the name implies, a
key-value store is a system that stores
values indexed for retrieval by keys.
Some of the market
leaders:
Riak
Amazon Dynamo
Voldermort
26. 3. SQL vs NoSQL (…)
Column-oriented databases. column-
oriented databases contain one extendable
column of closely related data
Some of the market
leaders:
HBase
Cassandra
27. 3. SQL vs NoSQL (…)
Document-based stores. These databases
store and organize data as collections of
documents, rather than as structured tables
with uniform sized fields for each record
Some of the
market
leaders:
MongoDB
CouchDB
SimpleDB
28. 3. SQL vs NoSQL (…)
SQL 2008 Data
storage capacity
29. 3. SQL vs NoSQL (…)
GridFS stores files in two
collections:
chunks stores the binary chunks. For
details, see The chunks Collection.
files stores the file’s metadata. For
details, see The files Collection.
30. 3. SQL vs NoSQL (…)
BSON Types
The chunks Collection
The files Collection
32. 4. Big Data Security
• Secure computations in distributed
programming frameworks
• Security best practices for non-relational
data stores
• Secure data storage and transactions logs
• Cryptographically enforced access control
and secure communication
• Granular access control
• Real-time security/compliance monitoring
33. 4. Big Data Security (…)
Technical Recommendations for
sercurity
• Use Kerberos for node authentication
• Use file layer encryption
• Data anonymization
• Use key management
• Deployment validation
• Use secure communication
• Tokenization
• Cloud database controls
34. 5. Big data trends
• Big data – of the people, by the
people, for the people
• Big data and social computing
• Cloud computing
• In memmory computing
• Mobile Applications and HTML5
• Internet and big data
35. 6. Demo with MongoDB & Ref docs
Ref docs:
Judith Hurwitz, Alan Nugent, Dr. Fern Halper,
and Marcia Kaufman: Big Data For Dummies.
John Wiley & Sons, Inc. 2013.
“Technology Trends for 2013” prepared by
Kaushal Amin, Chief Technology Officer, KMS
Technology – Atlanta, GA, USA
Website: http://hadoop.apache.org/
Demo with MongoDB