16. Hadoop is a framework that
allows for distributed processing
of large data sets across clusters of
commodity computers using a
simple programming model
17. Hadoop was designed to enable applications to make most out of cluster
architecture by addressing two key points:
1. Layout of data across the cluster ensuring data is evenly distributed
2. Design of applications to benefit from data locality
It brings us two main mechanism of hadoop, hdfs and hadoop MapReduce
20. MapReduce is a mechanism to execute an application in parallel
• by dividing it into tasks
• collocating these task with part of the data
• collecting and redistributing intermediate results
• and managing failures in all nodes of the clusters
21. Hadoop Key Characteristics
• Open source (Apache License)
• Large unstructured data sets (Petabytes)
• Simple Programming model running on Linux
• Scalable from single server to thousands of machines
• Runs on commodity hardware and the cloud
• Application level fault tolerance
• Multiple tools and libraries
23. Hadoop can handle small datasets but you can’t unleash the power of
There is overhead associated with each data distribution. If dataset is small
you won’t get huge advantage in hadoop.
If dataset is small and unstructured, you will try to collate the data.
Areas where Hadoop is not good fit Today
Hello everyone. Welcome to the session on Demystifying Big Data & Hadoop.
In this session we will discuss the buzzword Big Data & Hadoop. What is big data and what is not big data.
I am Prakriti. I have 15 years of experience in Reporting and BI.
Some of you might have done some research on it, some know it, some might have heard it and for some it’s a buzzword something like this “FOREIGN LANG VIDEO”.
Let’s see what is Big Data.
Its beyond our storage capacity and beyond our processing power.
The challenges include capture, storage, search, sharing, transfer, analysis and visualization.
NYSE generates about 1 TB of trade data per day. No trade analysis is done on single day data. It should be on months or years. Imagine the huge volume of data that is crunched to do analytics on that data.
because of technological advancements
With the increase in our processing capability, unstructured data has grown tremendously in recent years.
Data can be categorised into Structured, Semi-Structured and Un-Structured data.
Semi-structure data examples are csv, xml, logs, some part of email (to, from, subject, receive flag, date time)
Let’s see the Characteristics of Big Data
Scale up or scale out
Scale up or scale out
It is an Open-source Data Management with scale-out storage and distributed processing.