2. A Glimpse on Big Data and Hadoop
Outline :
• Introduction to Big Data
• Big data Architecture- Tools and Technologies
• What is Hadoop?
• Key Distinctions of Hadoop
• Core Hadoop components
3. A Glimpse on Big Data and Hadoop
What is Big Data?
• Big Data is a term for collection of data sets so large and complex that
it becomes difficult to process using on-hand database management
tools or any traditional approach.
• Lots of data
• Combination of structured and unstructured
4. A Glimpse on Big Data and Hadoop
Big Data by four words:
• Data Volume
• Data Velocity
• Data Variety
• Data Veracity
5. A Glimpse on Big Data and Hadoop
Challenges:
• data capture
• storage
• search
• Sharing
• analytics
• and visualization etc.
6. A Glimpse on Big Data and Hadoop
Big data Architecture- Tools and Technologies
Hadoop
• Low cost, reliable scale-
out architecture
• Distributed computing
Proven success in
Fortune 500 companies
• Exploding interest
NoSQL Databases
• Huge horizontal scaling
and high availability
• Highly optimized for
retrieval and appending
• Types
• Document stores
• Key Value stores
• Graph databases
Analytic RDBMS
• Optimized for bulk-
load and fast
aggregate query
workloads
• Types
• Column-
oriented
• MPP
• In-memory
Hadoop
NoSQL Databases
Analytic Databases
8. A Glimpse on Big Data and Hadoop
What is Hadoop?
• Apache Hadoop is an open source framework for distributed storage and
processing of large sets of data on commodity hardware. Hadoop enables
businesses to quickly gain insight from massive amounts of structured and
unstructured data.
• Hadoop was created by Doug Cutting and Mike cafarella
• It is designed to scale up from a single server to thousands of machines
• Hadoop provides reliable shared storage and analysis system
10. A Glimpse on Big Data and Hadoop
Why we move to Hadoop
Hadoop is red-hot as it:
Allows distributed processing of large data sets across clusters of
computers using simple programming model.
Is cheaper to use in comparison to other traditional proprietary
technologies such as Oracle , IBM, etc.. It can run on low cost
commodity hardware.
Has become de facto standard for storing , processing and
analyzing hundreds of terabytes and petabytes of data.
Can handle all types of data from disparate systems such as
server logs, emails , sensors , images , etc..
11. A Glimpse on Big Data and Hadoop
Hadoop core components:
• Hadoop is a system for large scale data processing
• It has two main components:
Hadoop Distributed File System:
Distributed across “nodes”
Natively redundant
Namenode track locations
MapReduce:
Splits a tasks across processors
Shuffle and sort
Clustered storage
12. A Glimpse on Big Data and Hadoop
Hadoop Distributed File System
• HDFS is the primary distributed storage used by Hadoop applications.
• HDFS was designed to be a scalable, fault-tolerant, distributed storage
system that works closely with MapReduce.
• supports shell-like commands to interact with HDFS directly
• Features of HDFS are:
Rack Awareness
Minimal data motion
Utilities
Highly operable
14. A Glimpse on Big Data and Hadoop
MapReduce:
• MapReduce is a framework for processing parallelizable problems
across huge datasets
• Uses clusters to process data Or grid to process data
• MapReduce’s key benefits are:
Simplicity
Scalability
Speed
Built-in recovery
Minimal data motion
16. A Glimpse on Big Data and Hadoop
Open
Discussion
17. A Glimpse on Big Data and Hadoop
References:
http://en.wikipedia.org/wiki/Apache_Hadoop - Apache Hadoop Wiki
http://hadoop.apache.org/ -Apache Hadoop Project
http://www-01.ibm.com/software/data/infosphere/hadoop/ - IBM’s
Definition for Big Data and Hadoop
http://hortonworks.com/hadoop/ - Hadoop Sandbox
18. A Glimpse on Big Data and Hadoop
Thank you
Join me at:
Presented by:
Prashanth Yennampelli
pyennamp@gmail.com