Facing trouble in distinguishing Big Data, Hadoop & NoSQL as well as finding connection among them? This slide of Savvycom team can definitely help you.
Enjoy reading!
3. Big Data refers to TECHNOLOGY and INITIATIVES that involve data that
is too DIVERSE FAST-CHANGING or MASSIVE for conventional
technologies, skills and infrastructure to address efficiently.
1
WHAT IS BIG DATA?
BIGDATA
4. Big Data refers to TECHNOLOGY and INITIATIVES that involve data that
is too DIVERSE FAST-CHANGING or MASSIVE for conventional
technologies, skills and infrastructure to address efficiently.
1
WHAT IS BIG DATA?
VOLUME
High data
capacity
(Terabytes or
petabytes)
BIGDATA
BIG DATA CHARACTERISTICS
5. Big Data refers to TECHNOLOGY and INITIATIVES that involve data that
is too DIVERSE FAST-CHANGING or MASSIVE for conventional
technologies, skills and infrastructure to address efficiently.
1
WHAT IS BIG DATA?
VOLUME VELOCITY
High data
capacity
(Terabytes or
petabytes)
Batch
Real-time
Streams
BIGDATA
BIG DATA CHARACTERISTICS
6. Big Data refers to TECHNOLOGY and INITIATIVES that involve data that
is too DIVERSE FAST-CHANGING or MASSIVE for conventional
technologies, skills and infrastructure to address efficiently.
1
WHAT IS BIG DATA?
VOLUME VELOCITY VARIETY
High data
capacity
(Terabytes or
petabytes)
Batch
Real-time
Streams
Various kinds
(Structured, unstructured,
semi-structured)
BIGDATA
BIG DATA CHARACTERISTICS
7. Big Data refers to TECHNOLOGY and INITIATIVES that involve data that
is too DIVERSE FAST-CHANGING or MASSIVE for conventional
technologies, skills and infrastructure to address efficiently.
1
WHAT IS BIG DATA?
BIG DATA CHARACTERISTICS
VOLUME VELOCITY VARIETY VERACITY
High data
capacity
(Terabytes or
petabytes)
Batch
Real-time
Streams
Various kinds
(Structured, unstructured,
semi-structured)
Quality
Consistency
Reliability
BIGDATA
8. Type Characteristics Examples Technology
S T RU C T U R E D
d a t a
Entities with a pre-defined
format/schema.
RDBMS records. RDBMS, NoSQL
S E M I -
S T RU C T U R E D
d a t a
Data is lesser, maybe a schema.
XML Files, JSON
files
NoSQL,
MapReduce
U N S T RU C T U R E D
d a t a
NO structure
Email content,
images, videos,
PDF files
MapReduce
1BIGDATA
BIG DATA
TYPES
9. 1BIGDATA
BIG DATA CHALLENGES IN STORAGE&ANALYSIS
1. PROCESS SLOWLY, UNSCALABLE
SSD (800Mb/s, 2ms seek)
SATA (300Mb/s)
IDE drive (75MB/sec, 10ms seek)
10. 1BIGDATA
1. PROCESS SLOWLY, UNSCALABLE
2. UNRELIABLE MACHINE
IDE drive (75MB/sec, 10ms seek)
Risky
BIG DATA CHALLENGES IN STORAGE&ANALYSIS
11. 1BIGDATA
1. PROCESS SLOWLY, UNSCALABLE
2. UNRELIABLE MACHINE
3. RELIABILITY
IDE drive (75MB/sec, 10ms seek)
Scalability
Data recovery
Partial failure
BIG DATA CHALLENGES IN STORAGE&ANALYSIS
12. 1BIGDATA
1. PROCESS SLOWLY, UNSCALABLE
2. UNRELIABLE MACHINE
3. RELIABILITY
4. BACKUP
IDE drive (75MB/sec, 10ms seek)
BIG DATA CHALLENGES IN STORAGE&ANALYSIS
13. 1BIGDATA
1. PROCESS SLOWLY, UNSCALABLE
2. UNRELIABLE MACHINE
3. RELIABILITY
4. BACKUP
5. PARALLEL PROCESS
IDE drive (75MB/sec, 10ms seek)
BIG DATA CHALLENGES IN STORAGE&ANALYSIS
14. 1BIGDATA
1. PROCESS SLOWLY, UNSCALABLE
2. UNRELIABLE MACHINE
3. RELIABILITY
4. BACKUP
5. PARALLEL PROCESS
6. EXPENSIVE COST
IDE drive (75MB/sec, 10ms seek)
BIG DATA CHALLENGES IN STORAGE&ANALYSIS
16. 2HADOOP
WHAT IS HADOOP ?
A free, Java-based framework that allows the DISTRIBUTED PROCESSING
of LARGE DATA SETS across CLUSTER OF COMPUTERS
using SIMPLE PROGRAMING MODELS
17. 2HADOOP
WHAT IS HADOOP ?
HADOOP ORIGIN
GOOGLE PUBLISH
GFS & MAP
REDUCE PAPER
2 0 0 2 - 2 0 0 4
DOUGH CUTTING
ADD GFS & MAP
REDUCE TO NUTCH
2 0 0 4
YAHOO! HIRE DOUGH, BUILD
A TEAM TO DEVELOP
HADOOP
2 0 0 7
NY TIME CONVERT 4
TB OF ARCHIVE (100
EC2 CLUSTER)
Y
A free, Java-based framework that allows the DISTRIBUTED PROCESSING
of LARGE DATA SETS across CLUSTER OF COMPUTERS
using SIMPLE PROGRAMING MODELS
18. 2HADOOP
WHAT IS HADOOP ?
HADOOP ORIGIN
WEB SCALE
DEVELOPMENT AT
YAHOO, FACEBOOK,
TWITTER
YAHOO! DOES
FASTEST SORT OF a
TB in 62 sec
2 0 0 9
YAHOO! SORT A PB IN
16.25 HOURS (3658
NODES)
APACHE HADOOP IS
NOW AN OPEN SOURCE
E CONVERT 4
ARCHIVE (100
CLUSTER)
A free, Java-based framework that allows the DISTRIBUTED PROCESSING
of LARGE DATA SETS across CLUSTER OF COMPUTERS
using SIMPLE PROGRAMING MODELS
21. 2HADOOP
HADOOP ARCHITECTURE
+
Provide actual storage
NAME NODE DATA NODE
Master of the system
Store meta data
Transaction blog, list of files,
list of block, data nodes
Maintain and manage blocks
on data nodes
Responsible for serving
read/write requests
Slaves; deployed on each machine.
Distributed across “NODES”
HDFS – Hadoop distributed file system
29. o Logical functions: MAPPER & REDUCER
2HADOOP
HADOOP ARCHITECTURE
FUNCTIONS
o Hadoop handles distributing MAP & REDUCE tasks across the cluster
o MAP & REDUCE functions were written and submit .jars to
Hadoop clusters.
o Typically batch oriented.
MAP REDUCE
34. 2HADOOP
HADOOP FEATURES SUMMARY
+
STORE
ANYTHING
Unstructured data,
semi structured data
STORAGE
CAPACITY
Scale linearly
Cost is not exponential
DATA LOCALITY & PROCESS
IN YOUR WAY
FAILURE & FAULT
TOLERANCE
Detect failure & heal
itself
(data replicated, failed task is
re-run, no need to maintain
backup data)
35. 2HADOOP
HADOOP FEATURES SUMMARY
+
STORE
ANYTHING
Unstructured data,
semi structured data
STORAGE
CAPACITY
Scale linearly
Cost is not exponential
DATA LOCALITY & PROCESS
IN YOUR WAY
FAILURE & FAULT
TOLERANCE
Detect failure & heal itself
(data replicated, failed task is
re-run, no need to maintain
backup data)
COST
EFFECTIVE
36. 2HADOOP
HADOOP FEATURES SUMMARY
+
STORE
ANYTHING
Unstructured data,
semi structured data
STORAGE
CAPACITY
Scale linearly
Cost is not exponential
DATA LOCALITY & PROCESS
IN YOUR WAY
FAILURE & FAULT
TOLERANCE
Detect failure & heal
itself
(data replicated, failed task
is re-run, no need to
maintain backup data)
COST
EFFECTIVE
PRIMARILY USED FOR BATCH
PROCESSING, NOT REAL-
TIME
37. 2HADOOP
WHO IS USING HADOOP & FOR WHAT
+
SEARCH
LOG PROCESSING
RECOMMENDATION SYSTEMS
DATA WAREHOUSE
VIDEO & IMAGE ANALYSIS
40. 3N O S Q L
WHAT IS NOSQL ?
NOSQL = Not Only SQL
SCHEMA FREE
41. 3N O S Q L
WHAT IS NOSQL ?
NOSQL = Not Only SQL
SCHEMA FREE
NOSQL CATEGORIES
KEY
VALUE
STORE
DYNAMO, AZURE,
REDIS,
MEMCACHED
42. 3N O S Q L
WHAT IS NOSQL ?
NOSQL = Not Only SQL
SCHEMA FREE
NOSQL CATEGORIES
KEY
VALUE
STORE
DYNAMO, AZURE,
REDIS,
MEMCACHED
BIG TABLE /
COLUM N
STORE
(GOOGLE )
HBASE; CASSANDAR
Similar to RBDMS but
handles semi - structured
43. 3N O S Q L
WHAT IS NOSQL ?
NOSQL = Not Only SQL
SCHEMA FREE
NOSQL CATEGORIES
KEY
VALUE
STORE
DYNAMO, AZURE,
REDIS,
MEMCACHED
BIG TABLE /
COLUM N
STORE
(GOOGLE )
HBASE; CASSANDAR
Similar to RBDMS but
handles semi - structured
GRAPH
DB NEO4J
44. 3N O S Q L
WHAT IS NOSQL ?
NOSQL = Not Only SQL
SCHEMA FREE
NOSQL CATEGORIES
KEY
VALUE
STORE
DYNAMO, AZURE,
REDIS,
MEMCACHED
BIG TABLE /
COLUM N
STORE
(GOOGLE )
HBASE; CASSANDAR
Similar to RBDMS but
handles semi - structured
GRAPH
DB NEO4J
DOCUM E NT
S TORE
MONGODB, REDIS, COUCHDB
Similar to key – value store but
DB knows what is the value
45. 3N O S Q L
NOSQL
+
COLLECTION: is a group of RELATED DOCUMENTS
MONGO DB – DATA MODELING CONCEPT
In form of DOCUMENTS (JSON-liked key value).
Data in MongoDB has A FLEXIBLE SCHEMA.
46. 3N O S Q L
NOSQL
+
No JOIN, instead, there are 2 types of DOCUMENT STRUCTURE
Reference Embedded
MONGO DB – DATA MODELING CONCEPT
47. 3N O S Q L
NOSQL
+
MONGO DB – DATA MODELING CONCEPT
* Always consider the usage of data (queries or update) when designing data models
MODEL RELATIONSHIP
BETWEEN DOCUMENTS
MODEL TREE STRUCTURES
One - to - one
One - to - many
Parent reference
Child reference
Array of ancestors
Materialized paths
Nested sets
48. 3N O S Q L
NOSQL
MONGO DB – CRUD OPERATIONS
COMPARING: SQL VS MONGO STATEMENTS
QUERY STATEMENT
CREATE / INSERT / UPDATE / DELETE