Big Data refers to massive, often unstructured data that is beyond
the processing capabilities of traditional data management tools.
Big Data can take up terabytes and petabytes of storage space in
diverse formats including text, video, sound, images etc.
Traditional relational database management systems cannot deal
with such large masses of data.
Examples : User updates over fb.
Clicks over the internet.
Volume refers to huge amount of data
being generated every minute.
90% of the data we have now is created in
just past 2 years.
IP traffic by 2015 would turn 4X than what
it is now.
3 billion people would be online by 2015 .
Velocity refers to SPEED at which new
data is being generated and moves around.
It includes Real time working systems
such as Online banking.
Need of low response time.
Technology “In-Memory Analytics” is
employed to deal with data in motion.
Variety refers to various datatypes
which we can now use.
Earlier focus was on neat and
structured data kept in form of tables in
80% of data available now is
Datatypes are anomalous varying from
text to videos to audios to pictures.
It is the process of examining large amounts of data of a variety of
types (big data) to uncover hidden patterns, unknown correlations
and other real- time insights.
Use of Big Data Analytics – Google Search recommendations,
Satyamev jayte, Genes reading
Data Mining Big data Analytics
Data constraints like data
must be neat and clean
Big data can not be neat as
it is unstructured
Elaborate ETL required
thus have to wait for
completion of ETL cycle for
Big data analytics provide
real – time insights.
Relational databases failed to store and process Big Data.
As a result, a new class of big data technology has emerged and is
being used in many big data analytics environments.
The technologies associated with big data analytics include
Hadoop is a open source framework
Java-based programming framework
Processing and storing of large data sets
Distributed computing environment.
Components of hadoop
HDFS( hadoop distributed
HDFS stores data in DISTRIBUTED,SCALABLE and FAULT-
Name node have metadata about data on DataNodes
DataNodes actually have data on them in form of blocks and
they are capable of communicating
Data is stored in
form of compressed
files across n number
of commodity servers
Data is stored in
form of tables and
relation in them
Fault tolerant – if
one node fails ,system
If any one node
crashes ,it gives error
so as to maintain
Any questions ???...
Copying same file over all (thousands) of nodes ?
doesn’t it seem like wastage of space !
It actually is not a waste memory, because of 2 reasons:
If one node failed ,System would still work as data is
The query is scaled over nodes so it bring about faster
results due to parallel processing
eg- Select the count of word ‘happy’ on twitter.
The query is split across multiple servers with a criteria
(here months), and the results are consolidated.
MapReduce is a programming model designed for processing
large volumes of data in parallel by dividing the work into a set of
as in previous example twitter data was processed on
different servers on basis of months .
Hadoop is the physical implementation of Mapreduce .
It is combination of 2 java functions : Mapper() and Reducer()
example: to check popularity of text.
use of word-count..
Mapper function maps the split files and provide input to reducer
Mapper ( filename , file –contents):
for each word in file-contents:
emit (word , 1)
Reducer function clubs the input provided by mapper and
Reducer ( word , values):
for each value in values:
sum=sum + value
emit(word , sum)
can anyone think of any disadvantages??..
There were 2 major disadvantages when hadoop was developed
which now have been dissolved
HDFS dependency on single Namenode
solution: A secondary Namenode is attached to Primary
MapReduce is a java fraamework and did not support sql
solution: Facebook developed HIVE which allowed scientists
work with sql on distributed database.
Not only SQL
Non- relational database management system
Used where no fix schemas are required and data is scaled
4 Categories of Nosql databases:
keys used to get
Value from opaque
No provision for content based queries .
• Again a key value store but value is in
form of document.
• Documents are not of fixed schemas
• documents can be nested
• Queries based on content as well as
• Use cases: blogging websites
Works on attributes rather
Key here is column name
and value is contiguous
Best for aggregation
Trend : select (1 or 2
column’s values ) where (
same or the other column
value ) = some value.
• Is a collection of nodes
• Nodes represent data
while edge represent
link between them
• Most dynamic and
Experts sound off on big data , Analytics and its tools
Big data and analytics hub
Research papers :
•MapReduce: Simplified Data Processing on Large Clusters
Jeffrey Dean and Sanjay Ghemawat
OSDI'04: Sixth Symposium on Operating System Design
San Francisco, CA, December, 2004.
Data is the new oil
Without Big data analysis companies are deaf
and dumb , mere wanderers on web ... Like a
cattle on the highway !
Thank you !
Keep dreaming BIG :D